SC22/WG20 N830 From: Asmus Freytag [asmusf@ix.netcom.com] Sent: Friday, May 04, 2001 5:02 PM Proposal for a C/C++ language extension to support portable UTF-16 At the most recent International Unicode Conference, SAP and Fujitsu reported on the needs for and experience with a C/C++ language extension to support portable programs using UTF-16. This document is intended to summarize the issues and the proposed solution and bring them to the attention of JTC1 SC22/WG20 for input into the relevant programming language committees. For a discussion of the Unicode Encoding Forms, see Appendix B. Reasons why UTF-16 is a rational choice for processing code ----------------------------------------------------------- UTF-16 is a common choice for processing code. It has the following advantages: 1) For average text it alnost always uses 50% less space than UTF-32 (see Appendix A) 2) It's easy to migrate from UCS-2 implementations 3) It's the platform character set on Java, OS-X, and Windows 4) It is supported by databases and middleware However, it is difficult to write portable C and C++ programs supporting UTF-16 since it neither is a multibyte nor a wide character data type. While it is variable length, an average text encoded in UTF-16 will contain few or any pairs of UTF-16 codes. This is the opposite case from multibyte codes (including UTF-8) where single byte characters are the exception. As a result, the performance of UTF-16 as a processing code tends to be quite good. The performance of UTF-32 as processing code for the same data may actually be worse, since the additional memory overhead means that cache limits will be exceeded more often and paging will occur more frequently. For systems with processor designs with penalties for 16-bit aligned access, but very large memories, this effect may be less. For these reasons, UTF-16 will continue to remain a viable choice for processing code for large, portable applications. Difficulties of writing portable programs supporting UTF-16 ----------------------------------------------------------- It is certainly possible to write portable C and C++ programs using an unsigned, fixed 16-bit integral data type as the character. However, doing so means that one cannot use literal strings or the platform's runtime libraries. The need to duplicate the runtime libraries is annoying, but it may not be too significant an issue, especially for larger applications. On the other hand, existing larger applications contain thousands of literal strings, even after all the user-visible strings have been externalized for internationalization. Lack of support for literal strings is therefore a significant barrier to porting existing programs. Supporting literal strings as UTF-16 is currently only possible by providing a custom compiler extension, or by reformulating them into arrays with static initializers, which degrades source code readability. Proposed solution: ----------------- In addition to the wchar_t datatype, support a utf16_t datatype, with the following features: utf16_t is an unsigned, 16-bit quantity on all platforms u"abc" is a literal that will be compiled into a string of utf16_t For the runtime library function names, the prefix "uc" can be used in analogy to the current use of "wc". Since the goal is the support of porting large existing bodies of software to UTF-16, the recommendation is to provide all existing "wc" runtime functions in a "uc" version, even though more modern approaches for internationalization exist. Appendix A: Frequency of characters that require two UTF-16 code units ---------------------------------------------------------------------- For average text UTF-16 uses 49.99 - 50% less space than UTF-32. For any "average" Han text, clearly more than 99.99% of character tokens are going to be accounted for by the CJK Unified Ideographs block and CJK Ideographs Extension A, both of which are encoded using single code units. CJK Extension B characters, which require two units, are going to be quite rare in regular text, except perhaps for special applications such as dictionaries. Estimating even "1 in a 1000" is not stretching things, by any means. As for the rest of the supplementary characters requiring two code units, again, "average" text will never need to touch them. Only very unusual corpora, such as historic texts (e.g. Gothic) per se, will make extensive use of them, and those unusual corpora are themselves quite likely to constitute less than 0.01% of text by bulk. Documents using formal mathematical notation will make use of the Mathematical Alphanumeric Symbols for a total of a few percent of the characters, but except for a scientific publications database, these texts will rarely be the "average" text. Appendix B: Unicode Encoding Forms ---------------------------------- The following has been adapted from the Technical Introduction (http://www.unicode.org/unicode/standard/principles.html). The Unicode Standard defines three encoding forms that allow the same data to be transmitted in a byte, word or double word oriented format (i.e. in 8, 16 or 32-bits per code unit). All three encoding forms encode the same common character repertoire and can be efficiently transformed into one another without loss of data. The Unicode Consortium fully endorses the use of any of these encoding forms as a conformant way of implementing the Unicode Standard. UTF-8 is popular for HTML and similar protocols. UTF-8 is a way of transforming all Unicode characters into a variable length encoding of bytes. It has the advantages that the Unicode characters corresponding to the familiar ASCII set have the same byte values as ASCII, and that Unicode characters transformed into UTF-8 can be used with much existing software without extensive software rewrites. However, extensive processing in UTF-8 is expensive due to the variable length of characters encoded in UTF-8. UTF-16 is popular in many environments that need to balance efficient access to characters with economical use of storage. It is reasonably compact and all the heavily used characters fit into a single 16-bit code unit, while all other characters are accessible via pairs of 16-bit code units. UTF-32 is popular where memory space is no concern, but fixed width, single code unit access to characters is desired. Each Unicode character is encoded in a single 32-bit code unit when using UTF-32. All three encoding forms need at most 4 bytes (or 32-bits) of data for each character. For some common operations on text data it is necessary to consider sequences of Unicode Characters, reducing the advantage of using a fixed-width encoding form. For more information see Unicode Technical Report #17, Unicode Character Encoding Model available at http://www.unicode.org/unicode/reports/tr17/ End of document 3