Additional character types ========================== Document: WG14 N969 Date: 2002-03-14 1. Introduction =============== The Unicode Technical Committee suggested to consider support of a portable data type in the C/C++ language which is based on UTF-16 (see N959). Following the discussions during the WG14 meeting in October 2001, a proposal for a new work item "Additional character types" (N962) has been submitted to SC22. In WG14 there was consensus to treat the questions of character data types in a sufficiently general context. In what follows, we proceed from simple to more advanced solutions. Much is based on ideas which came up during the discussions on the WG14 reflector between October 2000 and October 2001. 1.1. Remarks on the data types ------------------------------ The C Standard provides the minimum-width integer types uint_least8_t, uint_least16_t, uint_least32_t, uint_least64_t and their signed counterparts (7.18.1.2). Using typedef, new character types can be introduced. The character constant 'c' has the type int, and L'c' has the type wchar_t (6.4.4.4p11). The type wint_t is used in the functions of , in the I/O functions of and in btowc and wctob. As long as corresponding functions for the new character types are not defined, there is no necessity to define any counterparts of wint_t. However, to facilitate implementations of such functions, defining these types must be considered; they should be the promoted types of the corresponding character types. The types char and wchar_t may be signed or not. However, signed character types are a frequent source of errors when they are subject to default argument promotions. Thus we find unsigned character types preferable. In some of the proposals below, new character types may have the same size as wchar_t; in these cases we have to take into account that wchar_t may be signed. In some implementations the signedness of char depends on compiler options. It may be a source of confusion if typedefs for new character types depend on the same options. In C++, support of overloading may be desirable. This would require built-in types rather than typedefs. On the other hand, the aim may be a facility to use string literals in order to initialize integer arrays of quite general (existing) types. 1.2. Usage of string and character literals ------------------------------------------- String literals can occur in three different contexts: - when a pointer to the character type is expected; - when initializing an array of the character type; - as the argument of the sizeof operator. Constant expressions in #if directives may contain character and wide character constants. This feature seems unimportant for new character types. 2. Proposals for character types and literals ============================================= 2.1. Simple UTF-16 as in the proposal by the Unicode Technical Committee ------------------------------------------------------------------------ A new type typedef uint_least16_t utf16_t; is introduced. For string literals we propose to use a one-letter prefix, similar to the notation L"str" for wchar_t literals, e.g. u"str" The literal is used to initialize an array of utf16_t. The corresponding character constants, which have the type utf16_t, are u'c' During translation, there will be a conversion from the source character set to UTF-16. This conversion should not imply great difficulties because the C Standard already assumes knowledge of universal character names as specified by ISO/IEC 10646. (The universal character names \Unnnnnnnn and \unnnn are converted to characters of the execution character set.) For members of the basic character set and for universal character names the conversion to UTF-16 is defined unambiguously. For other members of the source character set, the conversion will be implementation-defined and may depend on the current locale, similar to subclause 6.4.5p5. This seems to be sufficient because the literals often contain only members of the basic character set while they must be of the same type as the characters processed by the application. 2.2. Extension to UTF-32 ------------------------ The previous proposal can easily be extended to support UTF-32 also, using the type typedef uint_least32_t utf32_t; The encoding of wchar_t is implementation-defined and need not be based on Unicode. If wchar_t has the same size as uint_least32_t, it seems appropriate that utf32_t is the same type as wchar_t. Then some of the functions in can be used for utf32_t. A possible notation for string literals and character constants is U"str" and U'c' . The prefixes correspond to the notation for universal character names. Compared with the previous proposal, this proposal has the advantage that UTF-16 and UTF-32 are considered as equally justified encoding forms. 2.3. Extension to UTF-8 ----------------------- It may be desirable to support UTF-8 while at the same time the usual execution character set for the type char is a non-Unicode encoding. In particular, this can be useful in environments where the default execution character set is not based on ASCII, but, e.g., on EBCDIC. The type may be typedef uint_least8_t utf8_t; For UTF-8 literals, a special prefix or a built-in macro is needed, and a conversion from the source character set to UTF-8 will be done. This proposal can be extended to support string and character literals of the type char, provided that char has at least 8 bits (otherwise it is unsuitable for UTF-8). Then many of the char-type standard library functions can be used. 2.4. Let the encoding be implementation-defined ----------------------------------------------- In this proposal, just the types and the syntax for literals are specified, but not the character set. The types may be typedef uint_least16_t char16_t; typedef uint_least32_t char32_t; and possibly char8_t. Literals may be written as l"String", l'c', LL"String", LL'c'. The conversion from the source to the execution character set may be controlled by compiler options and may be done by calling the iconv function (which is outside the scope of the C Standard). One of the original intentions was to improve portability, which is not really achieved by this proposal. 2.5. Use a locale-dependent character set ----------------------------------------- As in the previous proposal, the character set of the new types and literals is not necessarily Unicode-based. The execution character set used for the literals will depend on the locale at translation time, similar to subclause 6.4.5p5. The types may be typedef uint_least16_t char16_t; typedef uint_least32_t char32_t; One of these types may coincide with wchar_t, using the same execution character set. Literals may be written as l"String", l'c', LL"String", LL'c'. If wchar_t uses UCS-4 or UTF-32, char16_t will use UTF-16: First, the literals are converted by the mbstowcs function as described in the C Standard in 6.4.5p5. Then a conversion from UTF-32 to UTF-16 will be done. If mbstowcs does not convert to Unicode, the execution character set of the literals will be implementation-defined. Currently there are platforms where it depends on the locale whether wchar_t is Unicode or not. There may be locales where mbstowcs converts from UTF-8 to UCS-4, or from a non-Unicode multibyte character set to a non-Unicode wide character set, but the desired conversion, e.g. from non-Unicode multibyte to Unicode, may be missing. Locales were invented as a mechanism to add country, region and culture specific features without changing the application code or the runtime libraries, usually even without recompiling them. However, we find that it is not practically feasible that anyone who needs a specific character set for literals creates his own locales (which would then be used during translation). In a really general approach, the source character set would be determined by the locale at translation time while the execution character set could be chosen independently. 2.6. A type-generic approach ---------------------------- C++ provides templates to support generic programming. This allows to implement string processing functions for almost arbitrary character types. (The C++ standard specifies the template class basic_string with specializations for the types char and wchar_t.) A corresponding approach for literals is to use built-in macros like __ustr( "str", utf32_t ) __uchr( 'c', utf32_t ) where, in addition to utf32_t, at least the types utf16_t and utf8_t would be supported, as well as char if the latter has at least 8 bits. The first argument is converted to UTF-32, UTF-16, or UTF-8, respectively. 2.7. Cover arbitrary execution character sets --------------------------------------------- In the notation __str_lit( "str", utf32_t, "UTF-32" ) __chr_lit( 'c', utf32_t, "UTF-32" ) the second argument specifies the type and can be any of the minimum-width integer types, any of the standard integer types, or char. The third argument specifies the character set. A function like iconv will be used to convert the first argument. (iconv is not part of the C Standard.) It is the responsibility of the programmer to make sure that the specified character set is actually supported and that the converted literal can reasonably be represented as an array of the specified type. For less frequently used character sets, the behavior of iconv at translation time may be different from the behavior of iconv or mbstowcs at execution time, leading to unexpected results. Since the C Standard does not provide any function like iconv, it remains to be decided whether a such general approach as in this proposal is appropriate. It seems to be sufficient to require that this technique be implemented for certain specified pairs of types and encodings. A header file may provide macros like #define UTF16(x) __str_lit( x, utf16_t, "UTF-16" ) This proposal would also allow non-native endianness by specifying it explicitly, e.g. "UTF-16BE" or "UTF-16LE". 3. Library functions ==================== During the translation, a conversion from the source character set, which usually is the multibyte character set of the active locale, to some execution character set is performed. It may be desirable to offer the same conversion in the runtime library. In a type-generic approach, there would be some limitations. (C++ would offer templates.) For the most general case, it does not seem to be reasonable to invent new functions with broadly the same functionality as iconv. Other library functions are deliberately not included in this paper. (They can be implemented using the existing Standard C.) References: =========== Programming languages - C, ISO/IEC 9899:1999 Programming languages - C++, ISO/IEC 14882:1998 N959 2001-05-05/2001-10-08 Proposal for a C/C++ language extension to support portable UTF-16 N962 2001-11-05 Proposal for a work item "Additional character types" Appendix: ========= Some ideas and arguments that came up during the discussions, quoted more or less verbatim (but far from complete): Define data types for UTF-8, UTF-16 and UTF-32 ... (SC22WG14.8186) UTF-16 cannot be used as the coded character set for the multibyte encoding (SC22WG14.8213, cf. Subclause 5.2.1.2p1: The basic character set shall be present and each character shall be encoded as a single byte. [...] A byte with all bits zero shall not occur in the second or subsequent bytes of a multibyte character.) Some things that used to be single-character function calls should not be so: in particular to_upper of (say) ß (sharp s) is "SS", which can only be represented as two characters. (SC22WG14.8221) If you want library functions specifically for UTF-16, that can be done without any changes to the C standard. (SC22WG14.8470) An option that makes wchar_t 16 bits wide is not satisfactory: For an implementation that already conforms to ISO C (and that offers Unicode through 32-bit wchar_t, UTF-32 wide character codes, and UTF-8 multibyte codes) supporting UTF-16 in this manner would require an "alternate ABI" compilation option and an additional set of UTF-16 library routines having the same *wc* names as their existing library routines for 32-bit wchar_t. (SC22WG14.8493) I think compound-literal initializers could be part of a solution, as well as potentially useful for other purposes, not just ad hoc. It would be more convenient if we could find a way to allow a string literal (of either form) in initializers of arrays of types other than char and wchar_t, including utf16_t: static utf16_t message[] = { L"whatever\U1234\n" }; (SC22WG14.8499) A general string literal conversion mechanism is too complicated for anyone to consider. (SC22WG14.8774) I think the point of using a string to identify the encoding is to leave it open-ended at the lowest level, which is a good thing. [...] Using a string opens up the possibility of having mappings provided by an external vendor - e.g. using a system() command to pipe the source characters through a separate iconv-like program provided by the vendor, or by calling an entry point of that name in a shared library provided by the vendor. (SC22WG14.8796) ...#define UTF32(x) __ustr(x, utf32_t, "UTF-32"); __ustr() would need to be builtin to the translator because, among other things, it needs to handle a typename (like the sizeof operator). (SC22WG14.8816) Endianness might matter when processing Java byte-codes with its 16-bit UTF-16 chars which are encoded just one way (not always the native way on a machine). Several people have pointed out that the C library (API) does not meet their needs for character processing, so doing a new API based upon UTF-16 but with the same functionality is not what they want. Support for UTF-16 should not be required of all implementations, e.g., make it optional like IEEE-754. Need general facility that supports all of UTF-8/16/32 and UCS-2/4 along with big and little endian encodings.