Additional Character Data Types in the Programming Language C

WG14 Document: N977
Date: 2002-05-17

Additional Character Data Types in the Programming Language C

WG14 meeting in April 2002 discussed document N969 . During the discussion, the following basic criteria were considered to be important when forming an outline of further discussions on additional character data types:

WG 14 has an requirement from US and UTC to consider UTF-16 and UTF-32 support. (N966)
It is essential that new data types guarantee certain width because of the portability.
It is desirable that additional character data types are as generic as possible. Although we currently have requirement on UTF-16 and UTF-32 support, the new data types must cover principally other encodings.
String literals need to be specified for the new data types.
It is desirable that the encoding of the new data types is implementation independent.

There is no consensus yet, if both UTF-16 and UTF-32 need to be supported or if WG14 considers the support of UTF-16 to be sufficient. Independent of the number of data types, there was a tendency to call the new data type char16_t and char32_t for the time being. The names merely suggest that the width of new data types are well defined and the encoding of those data types is still a subject of discussion. Syntax of string literals is also a subject of discussion. It was also insinuated by a couple of WG14 participants that none of N969 suggestions are generic enough to meet the criteria 3 and compatibility with C ++.

Based on these discussions, we need to continue to discuss:

How to specify the string literals for new data types: char16_t and char32_t
How to specify the encoding of char16_t, char32_t and their string literals.

1. How to specify the string literals for new data types

1.1 Simple approach with a prefix for literals

Using a one-letter prefix, similar to the notation L"str" for wide string literals,
  u"str"
The literal is used to initialize an array of char16_t. The corresponding character constants are
  u'c'
and have the type char16_t.
This proposal can easily be extended to cover a 32-bit type, using char32_t and U"str" and U'c'.

1.2. Generic literals in C++

A concern arises that the approach with 1.1 is not generic enough when considering C/C++ compatibility. To illustrate the problem, let us look at the following example (a similar example was posted in the newsgroup comp.std.c++ on May 10, 2002, subject Re: C++0x Wish list, thread started on May 8, 2002).
     template<class CharT, class traits>
     int f(some_class<CharT, traits>& o)
     {
       /* ... */
       CharT digits []
         = "0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZ";
       /* ... */
The code compiles fine when CharT is char, but compilation breaks on the declaration of the digits array if CharT is wchar_t or anything else. One solution proposed in comp.std.c++ was to allow
  CharT d[] = <CharT>"0123...";
  CharT* p  = <CharT>"0123...";
and this can be extended to integral types other than char and wchar_t.
1.2.1. A variant of Generic literals in C++
A possible Variant of the proposal 1.2 is to omit the angle brackets:
  CharT d[] = CharT "0123...";
1.2.2. A proposal based on compound literals

The syntax <char16_t>"str" can be problematic, as, e.g.,
  if ( ch<<char16_t>'c' )
shows. (Such problems are not new in C++.) A new syntax should be as close as possible to existing standard C, in order to keep the effort for code parsing and analyzing tools at a reasonable level. For this reason let us investigate a solution based on compound literals. In May 2001 it was already proposed on the WG14 reflector to start from compound literals, see msg. 8499.
C99 allows compound literals, see Subclause 6.5.2.5. These are examples:
  typedef  uint_least16_t  char16_t;
  int f(char16_t *);
  size_t n = sizeof( (const char16_t []){ 's', 't', 'r', '\0' } );
  f( (char16_t []){ 's', 't', 'r', '\0' } );
It would be a small extension to allow
  char16_t a [] = (char16_t []){ 's', 't', 'r', '\0' };
(Actually gcc version 2.95.3 does allow this.) So the same syntax could be used in three different contexts: as an array initialization, as a pointer argument and as argument of sizeof. A more convenient way to write the literal would be
  (char16_t []){ "str" } 
Note that this is fully type-generic. Looking more precisely at the encoding, (wchar_t []){ "str" } should be the same as
  (wchar_t []){ L's', L't', L'r', L'\0' };
and for char16_t the rules of Section 1.2 may be applied.
Note: There are some differences between compound literals and conventional string literals. Compound literals have automatic storage duration when they occur within the body of a function. Further, in the above example we cannot call f with a const-qualified compound literal, whereas a string literal can be assigned to a pointer to char, although string literals are not required to be modifiable.

2. How to specify the Encoding of new data types

C99 subclause 6.10.8 specifies that the value of the macro __STDC_ISO_10646__ shall be "an integer constant of the form yyyymmL (for example, 199712L), intended to indicate that values of type wchar_t are the coded representations of the characters defined by ISO/IEC 10646, along with all amendments and technical corrigenda as of the specified year and month." C99 subclause 6.4.5p5 specifies that wide string literals are initialized with a sequence of wide characters as defined by the mbstowcs function with an implementation-defined current locale.

A possible suggestion is that the char16_t literals uses UTF-16 if an implementation defines the macro __STDC_ISO_10646__. (Implementers may support additional encodings for char16_t and compiler options or #pragmas to activate these.) There shall be a macro __STDC_UTF_16__ (or similar) to indicate that char16_t uses UTF-16. This also allows to use UTF-16 in char16_t while wchar_t uses a non-Unicode encoding. In certain cases the compile-time conversion to UTF-16 may be restricted to members of the basic character set and universal character names (\Unnnnnnnn and \unnnn) because for these the conversion to UTF-16 is defined unambiguously.

The encoding of char32_t can be defined in the same manner using __STDC_UTF_32__.

Currently there are platforms where it depends on the locale whether wchar_t is Unicode or not. If mbstowcs converts to UCS-4 (i.e. 4-bytes per character), a subsequent conversion to UTF-16 is easy to perform. The value of the character constants (u'c') are limited to the basic multilingual plane of Unicode, i.e. the values representable with 16 bits.

The encoding of new data types and string literals become implementation defined when the macro __STDC_ UTF_nn __ is not set.

3. Pros, cons and further remarks

The simple approach with a one letter prefix for string literals appears to be quite natural in C. The C++ evolution group of WG21 is going to do some work on literals, including literals for certain classes. C and C++ are languages that allow to keep the source code concise, e.g. by operator overloading. It might be possible to take advantage of all suggested solutions by introducing some abbreviations. String literals and character constants of the three types mentioned can be marked with no prefix or prefixed with u or L, respectively:

         <char>"str" is an alias of "str"
   <char16_t>"str" is an alias of u"str"
     <char32_t>"str" is an alias of U"str"
      <wchar_t>"str" is an alias of L"str"

The type generic solution principally allows to write string literals for any data type. It seems to make sense to exclude non character data types. Those types imply a burden for implementers and lead to error-prone implementations.

The described suggestions seem to fulfill most criteria we set up in the beginning. A minor shortage is the implementation dependency of new data types encoding when __STDC_ UTF_nn__ macros are not set. The shortage can be compensated later when the requirements on those encodings are available.

Last modified: Fri May 17 13:43:20 METDST 2002