JTC1/SC22/WG14 N969


                 Additional character types
                 ==========================
Document: WG14 N969
Date:	  2002-03-14

1. Introduction
===============

The Unicode Technical Committee suggested to consider support of a
portable data type in the C/C++ language which is based on UTF-16 (see
N959).

Following the discussions during the WG14 meeting in October 2001, a
proposal for a new work item "Additional character types" (N962) has
been submitted to SC22.  In WG14 there was consensus to treat the
questions of character data types in a sufficiently general context.
In what follows, we proceed from simple to more advanced solutions.
Much is based on ideas which came up during the discussions on the
WG14 reflector between October 2000 and October 2001.


1.1. Remarks on the data types
------------------------------

The C Standard provides the minimum-width integer types

  uint_least8_t, uint_least16_t, uint_least32_t, uint_least64_t

and their signed counterparts (7.18.1.2). Using typedef, new character
types can be introduced.

The character constant 'c' has the type int, and L'c' has the type
wchar_t (6.4.4.4p11).  The type wint_t is used in the functions of
<wctype.h>, in the I/O functions of <wchar.h> and in btowc and
wctob. As long as corresponding functions for the new character types
are not defined, there is no necessity to define any counterparts of
wint_t. However, to facilitate implementations of such functions,
defining these types must be considered; they should be the promoted
types of the corresponding character types.

The types char and wchar_t may be signed or not. However, signed
character types are a frequent source of errors when they are subject
to default argument promotions. Thus we find unsigned character types
preferable. In some of the proposals below, new character types may
have the same size as wchar_t; in these cases we have to take into
account that wchar_t may be signed.

In some implementations the signedness of char depends on compiler
options. It may be a source of confusion if typedefs for new character
types depend on the same options.

In C++, support of overloading may be desirable. This would require
built-in types rather than typedefs. On the other hand, the aim may be
a facility to use string literals in order to initialize integer
arrays of quite general (existing) types.


1.2. Usage of string and character literals 
-------------------------------------------

String literals can occur in three different contexts:
- when a pointer to the character type is expected;
- when initializing an array of the character type;
- as the argument of the sizeof operator.

Constant expressions in #if directives may contain character and wide
character constants. This feature seems unimportant for new character
types.


2. Proposals for character types and literals
=============================================

2.1. Simple UTF-16 as in the proposal by the Unicode Technical Committee 
------------------------------------------------------------------------

A new type

  typedef  uint_least16_t  utf16_t;

is introduced. For string literals we propose to use a one-letter
prefix, similar to the notation L"str" for wchar_t literals, e.g.

  u"str"

The literal is used to initialize an array of utf16_t.  The
corresponding character constants, which have the type utf16_t, are

  u'c'

During translation, there will be a conversion from the source
character set to UTF-16. This conversion should not imply great
difficulties because the C Standard already assumes knowledge of
universal character names as specified by ISO/IEC 10646. (The
universal character names \Unnnnnnnn and \unnnn are converted to
characters of the execution character set.)

For members of the basic character set and for universal character
names the conversion to UTF-16 is defined unambiguously. For other
members of the source character set, the conversion will be
implementation-defined and may depend on the current locale, similar
to subclause 6.4.5p5. This seems to be sufficient because the literals
often contain only members of the basic character set while they must
be of the same type as the characters processed by the application.


2.2. Extension to UTF-32
------------------------

The previous proposal can easily be extended to support UTF-32 also,
using the type

  typedef  uint_least32_t  utf32_t;

The encoding of wchar_t is implementation-defined and need not be
based on Unicode. If wchar_t has the same size as uint_least32_t, it
seems appropriate that utf32_t is the same type as wchar_t. Then some
of the functions in <wchar.h> can be used for utf32_t.

A possible notation for string literals and character constants is

  U"str"     and     U'c' .

The prefixes correspond to the notation for universal character names.

Compared with the previous proposal, this proposal has the advantage
that UTF-16 and UTF-32 are considered as equally justified encoding
forms.


2.3. Extension to UTF-8
-----------------------

It may be desirable to support UTF-8 while at the same time the usual
execution character set for the type char is a non-Unicode encoding.
In particular, this can be useful in environments where the default
execution character set is not based on ASCII, but, e.g., on EBCDIC.

The type may be 

  typedef  uint_least8_t  utf8_t;

For UTF-8 literals, a special prefix or a built-in macro is needed,
and a conversion from the source character set to UTF-8 will be done.

This proposal can be extended to support string and character literals
of the type char, provided that char has at least 8 bits (otherwise it
is unsuitable for UTF-8). Then many of the char-type standard library
functions can be used.


2.4. Let the encoding be implementation-defined
-----------------------------------------------

In this proposal, just the types and the syntax for literals are
specified, but not the character set. The types may be

   typedef uint_least16_t  char16_t;
   typedef uint_least32_t  char32_t;

and possibly char8_t. Literals may be written as

   l"String",   l'c',
   LL"String",  LL'c'.

The conversion from the source to the execution character set may be
controlled by compiler options and may be done by calling the iconv
function (which is outside the scope of the C Standard).

One of the original intentions was to improve portability, which is 
not really achieved by this proposal.


2.5. Use a locale-dependent character set
-----------------------------------------

As in the previous proposal, the character set of the new types and
literals is not necessarily Unicode-based.  The execution character
set used for the literals will depend on the locale at translation
time, similar to subclause 6.4.5p5. The types may be

   typedef uint_least16_t  char16_t;
   typedef uint_least32_t  char32_t;

One of these types may coincide with wchar_t, using the same execution
character set. Literals may be written as

   l"String",   l'c',
   LL"String",  LL'c'.

If wchar_t uses UCS-4 or UTF-32, char16_t will use UTF-16: First, the
literals are converted by the mbstowcs function as described in the C
Standard in 6.4.5p5. Then a conversion from UTF-32 to UTF-16 will be
done.

If mbstowcs does not convert to Unicode, the execution character set
of the literals will be implementation-defined.

Currently there are platforms where it depends on the locale whether
wchar_t is Unicode or not.  There may be locales where mbstowcs
converts from UTF-8 to UCS-4, or from a non-Unicode multibyte
character set to a non-Unicode wide character set, but the desired
conversion, e.g. from non-Unicode multibyte to Unicode, may be
missing.

Locales were invented as a mechanism to add country, region and
culture specific features without changing the application code or the
runtime libraries, usually even without recompiling them.  However,
we find that it is not practically feasible that anyone who needs a
specific character set for literals creates his own locales (which
would then be used during translation).

In a really general approach, the source character set would be
determined by the locale at translation time while the execution
character set could be chosen independently.


2.6. A type-generic approach
----------------------------

C++ provides templates to support generic programming.  This allows to
implement string processing functions for almost arbitrary character
types. (The C++ standard specifies the template class basic_string with
specializations for the types char and wchar_t.)

A corresponding approach for literals is to use built-in macros like

  __ustr( "str", utf32_t )
  __uchr( 'c',   utf32_t )

where, in addition to utf32_t, at least the types utf16_t and utf8_t
would be supported, as well as char if the latter has at least 8
bits. The first argument is converted to UTF-32, UTF-16, or UTF-8,
respectively.


2.7. Cover arbitrary execution character sets
---------------------------------------------

In the notation

  __str_lit( "str", utf32_t, "UTF-32" )
  __chr_lit( 'c',   utf32_t, "UTF-32" )

the second argument specifies the type and can be any of the
minimum-width integer types, any of the standard integer types, or
char.

The third argument specifies the character set. A function like iconv
will be used to convert the first argument. (iconv is not part of the C
Standard.)

It is the responsibility of the programmer to make sure that the
specified character set is actually supported and that the converted
literal can reasonably be represented as an array of the specified
type. For less frequently used character sets, the behavior of iconv
at translation time may be different from the behavior of iconv or
mbstowcs at execution time, leading to unexpected results.

Since the C Standard does not provide any function like iconv, it
remains to be decided whether a such general approach as in this
proposal is appropriate. It seems to be sufficient to require that
this technique be implemented for certain specified pairs of types and
encodings. A header file may provide macros like

  #define UTF16(x) __str_lit( x,  utf16_t, "UTF-16" )

This proposal would also allow non-native endianness by specifying it
explicitly, e.g. "UTF-16BE" or "UTF-16LE". 


3. Library functions
====================

During the translation, a conversion from the source character set,
which usually is the multibyte character set of the active locale, to
some execution character set is performed. It may be desirable to
offer the same conversion in the runtime library.

In a type-generic approach, there would be some limitations.  (C++
would offer templates.)

For the most general case, it does not seem to be reasonable to invent
new functions with broadly the same functionality as iconv.

Other library functions are deliberately not included in this paper.
(They can be implemented using the existing Standard C.)


References:
===========

Programming languages - C, ISO/IEC 9899:1999

Programming languages - C++, ISO/IEC 14882:1998

N959 2001-05-05/2001-10-08  Proposal for a C/C++ language extension to
support portable UTF-16

N962 2001-11-05  Proposal for a work item "Additional character types"


Appendix:
=========

Some ideas and arguments that came up during the discussions, 
quoted more or less verbatim (but far from complete):

Define data types for UTF-8, UTF-16 and UTF-32 ... (SC22WG14.8186)

UTF-16 cannot be used as the coded character set for the multibyte
encoding (SC22WG14.8213, cf. Subclause 5.2.1.2p1: The basic character
set shall be present and each character shall be encoded as a single
byte. [...] A byte with all bits zero shall not occur in the second or
subsequent bytes of a multibyte character.)

Some things that used to be single-character function calls should not
be so: in particular to_upper of (say) � (sharp s) is "SS", which
can only be represented as two characters. (SC22WG14.8221)

If you want library functions specifically for UTF-16, that
can be done without any changes to the C standard. (SC22WG14.8470)

An option that makes wchar_t 16 bits wide is not satisfactory: For an
implementation that already conforms to ISO C (and that offers Unicode
through 32-bit wchar_t, UTF-32 wide character codes, and UTF-8
multibyte codes) supporting UTF-16 in this manner would require an
"alternate ABI" compilation option and an additional set of UTF-16
library routines having the same *wc* names as their existing library
routines for 32-bit wchar_t. (SC22WG14.8493)

I think compound-literal initializers could be part of a solution,
as well as potentially useful for other purposes, not just ad hoc.
It would be more convenient if we could find a way to allow a
string literal (of either form) in initializers of arrays of
types other than char and wchar_t, including utf16_t:
    static utf16_t message[] = { L"whatever\U1234\n" };
(SC22WG14.8499)

A general string literal conversion mechanism is too complicated for
anyone to consider. (SC22WG14.8774)

I think the point of using a string to identify the encoding is to
leave it open-ended at the lowest level, which is a good thing. [...]
Using a string opens up the possibility of having mappings provided by
an external vendor - e.g. using a system() command to pipe the source
characters through a separate iconv-like program provided by the
vendor, or by calling an entry point of that name in a shared library
provided by the vendor. (SC22WG14.8796)

...#define UTF32(x) __ustr(x,  utf32_t, "UTF-32");
__ustr() would need to be builtin to the translator because, among
other things, it needs to handle a typename (like the sizeof
operator). (SC22WG14.8816)

Endianness might matter when processing Java byte-codes with its
16-bit UTF-16 chars which are encoded just one way (not always the
native way on a machine).

Several people have pointed out that the C library (API) does not meet
their needs for character processing, so doing a new API based upon
UTF-16 but with the same functionality is not what they want.

Support for UTF-16 should not be required of all implementations,
e.g., make it optional like IEEE-754.

Need general facility that supports all of UTF-8/16/32 and UCS-2/4
along with big and little endian encodings.