New Character Types in C++

ISO/IEC JTC1 SC22 WG21 N2018 = 06-0088 - 2006-06-20

Lawrence Crowl

Problem

Many users of C++ need to manipulate Unicode character strings. Unfortunately, there is no C++ standard means to do so.

Solution

The ISO C committee has addressed this issue extensively. See ISO/IEC TR 19769:2004 "Extensions for the programming language C to support new character data types" as described in draft report ISO/IEC JTC1 SC22 WG14 N1040 at http://www.open-std.org/jtc1/sc22/wg14/www/docs/n1040.pdf.

This proposal adopts their work, but with those changes necessary for effective use within C++. In particular, we propose new types to support overloading.

A separate proposal will address specializations for numeric_limits, character traits, basic strings, streams, and insertion operations.

References

See section 2.5 "Encoding Forms" in

The Unicode Consortium. The Unicode Standard, Version 4.0.0, defined by: The Unicode Standard, Version 4.0 (Boston, MA, Addison-Wesley, 2003. ISBN 0-321-18578-1)
The online version (printing prohibited) is at http://www.unicode.org/versions/Unicode4.0.0/.

See Annex C of ISO 10646-1, which is online at http://www.dkuug.dk/JTC1/SC2/WG2/docs/n2005/n2005-2.doc.

See UTF-8, UTF-16, UTF-32 & BOM.

Summary of ISO/IEC TR 19769 (WG14 N1040)

The document ISO/IEC TR 19769 (WG14 N1040) provides motivation, new typedefs for the (at least) 16-bit and (at least) 32-bit character types, macros for reporting ISO 10646 encoding, character and string literals, mixed string concatenation, four library functions, and a new header with appropriate declarations.

Summary of Changes to the C Proposal for C++

The document ISO/IEC TR 19769 (WG14 N1040) can be adopted with few changes. Further changes are possible, but this proposal minimizes the changes to ensure maximum interoperability. Specifically, they are:

Define new primitive types.

Define char16_t to be a typedef to a distinct new type, with the name _Char16_t that has the same size and representation as uint_least16_t. Likewise, define char32_t to be a typedef to a distinct new type, with the name _Char32_t that has the same size and representation as uint_least32_t.

[N1040 defined char16_t and char32_t as typedefs to uint_least16_t and uint_least32_t, which make overloading on these characters impossible.]

[The proposal has this typedef indirection rather than making char16_t and char32_t new keywords because:

The standard library headers should use the non-typedef names.]

Add C++-specific headers.

Add a new C++ header <cuchar> corresponding to the new C header <uchar.h>.

Clarify literals.

Clarify the handling of universal character names that do not fit with char16_t. In particular, the interaction with ISO 10646 UTF-16 is underspecified in the C proposal.

Changes to the C++ Standard

2.11 Keywords

To "Table 3 -- keywords", add _Char16_t and _Char32_t.

2.13.2 Character literals

To the grammar, add

character-literal:
u' c-char-sequence '
U' c-char-sequence '

To paragraph 1, replace

optionally preceded by the letter L, as in L'x'
with
optionally preceded by one of the letters L, u, or U, as in L'x', u'y', U'z', respectively

To paragraph 2, add

A character literal that begins with the letter u, such as u'y', is a character literal of type _Char16_t. The value of a _Char16_t literal containing a single c-char has value equal to the numerical value of the encoding of the c-char in the execution _Char16_t character set, provided that the encoding is representable within 16 bits. If the value is not representable within 16 bits, the program is ill-formed. A _Char16_t literal containing multiple c-chars is ill-formed. It is implementation-defined whether literals may contain more than members of the basic character set and universal character names (\Unnnnnnnn and \unnnn).

To paragraph 2, add

A character literal that begins with the letter U, such as U'z', is a character literal of type _Char32_t. The value of a _Char32_t literal containing a single c-char has value equal to the numerical value of the encoding of the c-char in the execution _Char32_t character set, provided that the encoding is representable within 32 bits. If the value is not representable within 32 bits, the program is ill-formed. A _Char32_t literal containing multiple c-chars is ill-formed. It is implementation-defined whether literals may contain more than members of the basic character set and universal character names (\Unnnnnnnn and \unnnn).

In paragraph 4, replace

range defined for char (for ordinary literals) or wchar_t (for wide literals)
with
range defined for char (for ordinary literals), _Char16_t (for at-least-16-bit literals), _Char32_t (for at-least-32-bit literals), or wchar_t (for wide literals)

2.13.4 String literals

To the grammar, add

string-literal:
u" c-char-sequenceopt "
U" c-char-sequenceopt "

To paragraph 1, replace

optionally beginning with the letter L, as in "..." or L"..."
with
optionally beginning with one of the letters L, u, or U, as in "...", L"...", u"...", or U"...", respectively

To paragraph 1, append

A string literal that begins with u, such as u"asdf", is a _Char16_t string literal. A _Char16_t string literal has type array of n const _Char16_t and has static storage duration, where n is the size of the string as defined below, and is initialized with the given characters. It is implementation-defined whether literals may contain more than members of the basic character set and universal character names (\Unnnnnnnn and \unnnn). When the macro __STDC_UTF_16__ (see 21.5.2 The __STDC_UTF_16__ and __STDC_UTF_32__ macros), single c-char may produce more than one _Char16_t. A string literal that begins with U, such as U"asdf", is a _Char32_t string literal. A _Char32_t string literal has type array of n const _Char32_t and has static storage duration, where n is the size of the string as defined below, and is initialized with the given characters. It is implementation-defined whether literals may contain more than members of the basic character set and universal character names (\Unnnnnnnn and \unnnn).

In paragraph 3, replace

In translation phase 6 (2.1), adjacent narrow string literals are concatenated and adjacent wide string literals are concatenated. If a narrow string literal token is adjacent to a wide string literal token, the behavior is undefined.
with
In translation phase 6 (2.1), adjacent string literals are concatenated. If both string literals have the same prefix, the resulting concatenated string literal has that prefix. If one string literal has no prefix, it is treated as a string literal of the same prefix as the other operand. Any other concatenations have conditionally supported behavior. Note that this concatentation is an interpretation, not a conversion. [Example: Here are some examples of valid concatenations:
sourcemeans sourcemeans sourcemeans
u"a" u"b"u"ab" U"a" U"b"U"ab" L"a" L"b"L"ab"
u"a" "b"u"ab" U"a" "b"U"ab" L"a" "b"L"ab"
"a" u"b"u"ab" "a" U"b"U"ab" "a" L"b"L"ab"
]

In paragraph 5, replace

The size of a wide string literal is the total number of escape sequences, universal-character-names, and other characters, plus one for the terminating L'\0'.
with
The size of a wide, _Uchar16_t, or _Uchar32_t string literal is the total number of escape sequences, universal-character-names, and other characters, plus one for the terminating L'\0'. The universal-character-names must be representable by the type of the literal.

3.9.1 Fundamental Types

At the end of paragraph 5, add

Types _Char16_t and _Char32_t denote distinct types with the same size, signedness, and alignment as uint_least16_t and uint_least32_t, respectively, in <stdint.h>, called the underlying types.

The <stdint.h> header is from ISO C as proposed in document WG21 N1835 = 05-0095, and subsequently adopted into ISO/IEC TR 19768: C++ Library Extensions TR1.

In paragraph 7, append

_Char16_t, _Char32_t,
to
wchar_t,

4.2 Array-to-pointer conversion

In paragraph 2, replace

A string literal (2.13.4) that is not a wide string literal can be converted to an rvalue of type pointer to char; a wide string literal can be converted to an rvalue of type pointer to wchar_t. In either case
with
A string literal (2.13.4) with no prefix, L prefix, u prefix, or U prefix can be converted to an rvalue of type pointer to char, pointer to wchar_t, pointer to _Char16_t, or pointer to _Char32_t, respectively. In any case

4.5 Integral promotions

In paragraph 2, append

, _Char16_t, or _Char32_t,
to
wchar_t

5 Expressions

In footnote 54, append

_Char16_t, _Char32_t,
to
wchar_t,

5.3.3 Sizeof

In paragraph 1, replace

and sizeof(wchar_t),
with
sizeof(wchar_t), sizeof(_Char16_t), and sizeof(_Char32_t),

7.1.5.2 Simple type specifiers

To the grammar, add

simple-type-specifier:
_Char16_t
_Char32_t

To Table 7, add

_Char16_t_Char16_t
_Char32_t_Char32_t

8.5 Initializers

In paragraph 14, bullet 2, replace

or an array of wchar_t,
with
an array of wchar_t, an array of _Char16_t, or an array of _Char32_t,

8.5.2 Character arrays

In paragraph 1, replace

A char array (whether plain char, signed char, or unsigned char) can be initialized by a string-literal (optionally enclosed in braces); a wchar_t array can be initialized by a wide string-literal (optionally enclosed in braces); successive characters of the string-literal initialize the members of the array.
with
A char array (whether plain char, signed char, or unsigned char), wchar_t array, _Char16_t array, or _Char32_t array can be initialized by a string-literal (optionally enclosed in braces) with no prefix, with L prefix, with u prefix, or with U prefix, respectively. Successive characters of the string-literal initialize the members of the array.

15.1 Throwing an exception

In paragraph 3, replace

char* or wchar_t*
with
char*, _Char16_t*, _Char32_t*, or wchar_t*
Replace
array of const char and array of const wchar_t
with
array of const char, array of const _Char16_t, array of const _Char32_t, or array of const wchar_t
Replace
pointer to char or pointer to wchar_t
with
pointer to char, pointer to _Char16_t, pointer to _Char32_t, or pointer to wchar_t

17 Library introduction

In paragraph 4, replace

sequences of type wchar_t,
with
sequences of type wchar_t, sequences of type _Char16_t, sequences of type _Char32_t,

17.1.2 character

In paragraph 1, replace

and wchar_t,
with
wchar_t, _Char16_t, and _Char32_t,

17.3.2.1.3.4 [NEW] char16-character sequences

A char16-character sequence is an array object (8.3.4) A that can be declared as T A[N], where T is type _Char16_t (3.9.1), optionally qualified by any combination of const and volatile. The initial elements of the array have defined contents up to and including an element determined by some predicate. A character sequence can be designated by a pointer value S that designates its first element.

A null-terminated char16-character string, or NTC16S, is a char16-character sequence whose highest-addressed element with defined content has the value zero. [Footnote: Many of the objects manipulated by function signatures declared in <cuchar> are char16-character sequences or NTC16Ss.]

The length of an NTC16S is the number of elements that precede the terminating null char16 character. An empty NTC16S has a length of zero.

The value of an NTC16S is the sequence of values of the elements up to and including the terminating null character.

A static NTC16S is an NTC16S with static storage duration. [Footnote: A char16 string literal, such as u"abc", is a static NTC16S.]

17.3.2.1.3.5 [NEW] char32-character sequences

A char32-character sequence is an array object (8.3.4) A that can be declared as T A[N], where T is type _Char32_t (3.9.1), optionally qualified by any combination of const and volatile. The initial elements of the array have defined contents up to and including an element determined by some predicate. A character sequence can be designated by a pointer value S that designates its first element.

A null-terminated char32-character string, or NTC32S, is a char32-character sequence whose highest-addressed element with defined content has the value zero. [Footnote: Many of the objects manipulated by function signatures declared in <cuchar> are char32-character sequences or NTC32Ss.]

The length of an NTC32S is the number of elements that precede the terminating null char32 character. An empty NTC32S has a length of zero.

The value of an NTC32S is the sequence of values of the elements up to and including the terminating null character.

A static NTC32S is an NTC32S with static storage duration. [Footnote: A char32 string literal, such as U"abc", is a static NTC32S.]

17.4.1.2 Headers

To table 12, add <cuchar>.

17.4.3.1.3 External linkage

In paragraph 5, footnote 168, add <cuchar>.

21.4 Null-terminated sequence utilities

Add paragraph 20,

Table 50 describes headers <cuchar> and <uchar.h>. The distinction is that <cuchar> defines the function and typedef names within namespace std and that <uchar.h> defines them at global scope.

Add Table 50,

Table 50 -- Headers <cuchar> and <uchar.h> synopsis
Typedef Names
char16_tchar32_t
Macro Names
__STDC_UTF_16____STDC_UTF_32__
Function Names
mbrtoc16c16rtomb
mbrtoc32c32rtomb

21 Strings Library

Add <cuchar> to table 38 under "Null-terminated sequence utilities".

21.5 [NEW] char16 and char32 characters

The headers <cuchar> and <uchar.h> define typedefs, define macros, and declare functions for use with at-least-16-bit and at-least-32-bit characters.

21.5.1 [NEW] The char16_t and char32_t typedefs

The headers <cuchar> and <uchar.h> define the typedefs:

typedef _Char16_t char16_t;
typedef _Char32_t char32_t;

21.5.2 [NEW] The __STDC_UTF_16__ and __STDC_UTF_32__ macros

If the headers <cuchar> and <uchar.h> define the macro __STDC_UTF_16__, values of type _Char16_t shall have UTF-16 encoding, as defined by ISO 10646.

If the headers <cuchar> and <uchar.h> define the macro __STDC_UTF_32__, values of type _Char32_t shall have UTF-32 encoding, as defined by ISO 10646.

21.5.3 [NEW] The mbrtoc16 function

Synopsis

#include <cuchar>
size_t std::mbrtoc16(char16_t * pc16, const char * s, size_t n, mbstate_t* ps);

Description

If s is a null pointer, the mbrtoc16 function is equivalent to the call:

mbrtoc16(NULL, "", 1, ps)
In this case, the values of the parameters pc16 and n are ignored.

If s is not a null pointer, the mbrtoc16 function inspects at most n bytes beginning with the byte pointed to by s to determine the number of bytes needed to complete the next multibyte character (including any shift sequences). If the function determines that the next multibyte character is complete and valid, it determines the value of the corresponding wide character and then, if pc16 is not a null pointer, stores that value in the object pointed to by pc16. If the corresponding wide character is the null wide character, the resulting state described is the initial conversion state.

Returns

The mbrtoc16 function returns the first of the following that applies (given the current conversion state):
0
if the next n or fewer bytes complete the multibyte character that corresponds to the null wide character (which is the value stored).
[1..n]
if the next n or fewer bytes complete a valid multibyte character (which is the value stored); the value returned is the number of bytes that complete the multibyte character.
(size_t)(-3)
if the multibyte sequence converted more than one corresponding char32_t character and not all these characters have yet been stored; the next character in the sequence has now been stored and no bytes from the input have been consumed by this call.
(size_t)(-2)
if the next n bytes contribute to an incomplete (but potentially valid) multibyte character, and all n bytes have been processed (no value is stored).

Note: When n has at least the value of the MB_CUR_MAX macro, this case can only occur if s points at a sequence of redundant shift sequences (for implementations with state-dependent encodings).

(size_t)(-1)
if an encoding error occurs, in which case the next n or fewer bytes do not contribute to a complete and valid multibyte character (no value is stored); the value of the macro EILSEQ is stored in errno, and the conversion state is unspecified.

21.5.4 [NEW] The c16rtomb function

Synopsis

#include <cuchar>
size_t std::c16rtomb(char * s, char16_t c16, mbstate _t * ps);

Description

If s is a null pointer, the c16rtomb function is equivalent to the call

c16rtomb(buf, L'\0', ps)
where buf is an internal buffer.

If s is not a null pointer, the c16rtomb function determines the number of bytes needed to represent the multibyte character that corresponds to the wide character given by c16 (including any shift sequences), and stores the multibyte character representation in the array whose first element is pointed to by s. At most MB_CUR_MAX bytes are stored. If c16 is a null wide character, a null byte is stored, preceded by any shift sequence needed to restore the initial shift state; the resulting state described is the initial conversion state.

Returns

The c16rtomb function returns the number of bytes stored in the array object; this may be 0 (including any shift sequences). When c16 is not a valid wide character, an encoding error occurs: the function stores the value of the macro EILSEQ in errno and returns (size_t)(-1); the conversion state is unspecified.

21.5.5 [NEW] The mbrtoc32 function

Synopsis

#include <cuchar>
size_t std::mbrtoc32(char32_t * pc32, const char * s, size_t n, mbstate_t* ps);

Description

If s is a null pointer, the mbrtoc32 function is equivalent to the call:

mbrtoc32(NULL, "", 1, ps)
In this case, the values of the parameters pc32 and n are ignored.

If s is not a null pointer, the mbrtoc32 function inspects at most n bytes beginning with the byte pointed to by s to determine the number of bytes needed to complete the next multibyte character (including any shift sequences). If the function determines that the next multibyte character is complete and valid, it determines the value of the corresponding wide character and then, if pc32 is not a null pointer, stores that value in the object pointed to by pc32. If the corresponding wide character is the null wide character, the resulting state described is the initial conversion state.

Returns

The mbrtoc32 function returns the first of the following that applies (given the current conversion state):
0
if the next n or fewer bytes complete the multibyte character that corresponds to the null wide character (which is the value stored).
[1..n]
if the next n or fewer bytes complete a valid multibyte character (which is the value stored); the value returned is the number of bytes that complete the multibyte character.
(size_t)(-3)
if the multibyte sequence converted more than one corresponding char32_t character and not all these characters have yet been stored; the next character in the sequence has now been stored and no bytes from the input have been consumed by this call.
(size_t)(-2)
if the next n bytes contribute to an incomplete (but potentially valid) multibyte character, and all n bytes have been processed (no value is stored).

Note: When n has at least the value of the MB_CUR_MAX macro, this case can only occur if s points at a sequence of redundant shift sequences (for implementations with state-dependent encodings).

(size_t)(-1)
if an encoding error occurs, in which case the next n or fewer bytes do not contribute to a complete and valid multibyte character (no value is stored); the value of the macro EILSEQ is stored in errno, and the conversion state is unspecified.

21.5.6 [NEW] The c32rtomb function

Synopsis

#include <cuchar>
size_t std::c32rtomb(char * s, char32_t c32, mbstate_t * ps);

Description

If s is a null pointer, the c32rtomb function is equivalent to the call

c32rtomb(buf, L'\0', ps)
where buf is an internal buffer.

If s is not a null pointer, the c32rtomb function determines the number of bytes needed to represent the multibyte character that corresponds to the wide character given by c32 (including any shift sequences), and stores the multibyte character representation in the array whose first element is pointed to by s. At most MB_CUR_MAX bytes are stored. If c32 is a null wide character, a null byte is stored, preceded by any shift sequence needed to restore the initial shift state; the resulting state described is the initial conversion state.

Returns

The c32rtomb function returns the number of bytes stored in the array object; this may be 0 (including any shift sequences). When c32 is not a valid wide character, an encoding error occurs: the function stores the value of the macro EILSEQ in errno and returns (size_t)(-1); the conversion state is unspecified.

A.6 Declarations

To the grammar, add

simple-type-specifier:
_Char16_t
_Char32_t

C.1.1 Clause 2: lexical conventions

At the end of Subclause _lex.string: Change:, add

The type of a char16 string literal is changed from array of some-integer-type to array of const _Char16_t. The type of a char32 string literal is changed from array of some-integer-type to array of const _Char32_t.

C.2.2.4 Header <uchar.h>

Add section.

The typedefs char16_t and char32_t are typedefs to distinct types rather than typedefs to existing integral types.

D.5 Standard C Library Headers

Replace "18 C headers" with "18 C headers and 1 C technical report header".

To table 101, add

<uchar.h>