UTF-8 String Literals

ISO/IEC JTC1 SC22 WG21 N2159 = 07-0019 - 2007-01-10

Lawrence Crowl

Problem

Many users of C++ need to manipulate Unicode character strings. While N2149 New Character Types for C++ addresses most low-level issues, it does not provide a mechanism to ensure UTF-8 literals. For portable international code, the standard needs such a mechanism.

Solution

We propose to add a new lexical token for UTF-8 string literals. No new types or other language changes are required. In particular, we do not propose character literals.

Note that this paper does not presume adoption of N2149 and some editorial merge will be necessary.

Likewise, this paper does not presume adoption of N2053 Raw String Literals, for which some editorial merge will also be necessary.

References

See section 2.5 "Encoding Forms" in

The Unicode Consortium. The Unicode Standard, Version 5.0.0, defined by: The Unicode Standard, Version 5.0 (Boston, MA, Addison-Wesley, 2007. ISBN 0-321-48091-0)

The online version (printing prohibited) is at http://www.unicode.org/versions/Unicode5.0.0/.

See Annex C of ISO 10646-1, which is online at http://www.dkuug.dk/JTC1/SC2/WG2/docs/n2005/n2005-2.doc.

See ISO/IEC 10646:2003, which is publicly available in several text and PDF files within a zip archive from http://standards.iso.org/ittf/PubliclyAvailableStandards/c039921_ISO_IEC_10646_2003%28E%29.zip.

See UTF-8, UTF-16, UTF-32 & BOM.

Changes to the C++ Standard

2.13.4 String literals

To the grammar, add

string-literal:

E" c-char-sequence_opt "

To paragraph 1, replace

optionally beginning with the letter L, as in "..." or L"..."

with

optionally beginning with one of the letters L, or E, as in "...", L"...", or E"...", respectively

To paragraph 1, append

A string literal that begins with E, such as E"asdf", is a char string literal. The literal has the type array of n const char where n is the size of the string as defined below, and is initialized with the given characters encoded in UTF-8. It is implementation-defined whether literals may contain more than members of the basic character set and universal character names (\Unnnnnnnn and \unnnn).

In paragraph 3, append

If any narrow string literal in the concatenation specifies UTF-8 encoding, the resulting string has UTF-8 encoding.

Paragraph 5 already admits a multi-byte encoding of ordinary character string literals.