ISO/IEC JTC1/SC22/WG21 N2401 = J16/07-0261 Code Conversion Facets for the Standard C++ Library P.J. Plauger Dinkumware, Ltd. pjp@dinkumware.com 2007-09-03 With the acceptance of N2007 (Proposed Library Additions for Code Conversion) we now have template classes wbuffer_convert and wstring_convert, as well as basic_filebuf, that accept code-conversion facets as template parameters. Unfortunately, the current draft C++ Standard defines only the default codecvt facet, with weakly specified properties. This paper proposes the addition of several facets that provide the commonest Unicode support. Add the header with the following definitions: namespace std { enum codecvt_mode { consume_header = 4, generate_header = 2, little_endian = 1}; template class codecvt_utf8 : public std::codecvt { // facet for converting between Elem and UTF-8 byte sequences ..... }; template class codecvt_utf16 : public std::codecvt { // facet for converting between Elem and UTF-16 multibyte sequences ..... }; template class codecvt_utf8_utf16 : public std::codecvt { // facet for converting between UTF-16 Elem and UTF-8 byte sequences ..... }; } // namespace std For each of the three code conversion facets codecvt_utf8, codecvt_utf16, and codecvt_utf8_utf16: -- Elem is the wide-character type, such as wchar_t, char16_t, or char32_t. -- Maxcode is the largest wide-character code that the facet will read or write without reporting a conversion error. -- If (Mode & consume_header), the facet consumes an optional initial header sequence when reading a multibyte sequence to determine the endianness of the subsequent multibyte sequence to be read. -- If (Mode & generate_header), the facet generates an initial header sequence when writing a multibyte sequence to advertise the endianness of the subsequent multibyte sequence to be written. -- If (Mode & little_endian), the facet generates a multibyte sequence in little-endian order, as opposed to the default big-endian order. For the facet codecvt_utf8: -- The facet converts between UTF-8 multibyte sequences and UCS2 or UCS4 (depending on the size of Elem) within the program. -- Endianness does not affect how multibyte sequences are read or written. -- The multibyte sequence can be written as either a text or a binary file. For the facet codecvt_utf16: -- The facet converts between UTF-16 multibyte sequences and UCS2 or UCS4 (depending on the size of Elem) within the program. -- Endianness affects how multibyte sequences are read or written. -- The multibyte sequence must be written as a binary file. For the facet codecvt_utf8_utf16: -- The facet converts between UTF-8 multibyte sequences and UTF-16 (one or two 16-bit codes) within the program. -- Endianness does not affect how multibyte sequences are read or written. -- The multibyte sequence can be written as eitier a text or a binary file.