UTF-8 String Literals

ISO/IEC JTC1 SC22 WG21 N2209 = 07-0069 - 2007-03-08

Lawrence Crowl

This document replaces N2159 = 07-0019 - 2007-01-10.

Problem

Many users of C++ need to manipulate Unicode character strings. While N2149 New Character Types for C++ addresses most low-level issues, it does not provide a mechanism to ensure UTF-8 literals. For portable international code, the standard needs such a mechanism.

Solution

We propose to add a new lexical token for UTF-8 string literals. No new types or other language changes are required. In particular, we do not propose character literals.

Adoption of this paper requires all conforming implementations to have bytes of at least eight bits in size. We believe that all existing systems already conform.

Note that this paper does not presume adoption of N2149 New Character Types for C++ and some editorial merge will be necessary.

Likewise, this paper does not presume adoption of N2053 Raw String Literals, for which some editorial merge will also be necessary.

References

See section 2.5 "Encoding Forms" in

The Unicode Consortium. The Unicode Standard, Version 5.0.0, defined by: The Unicode Standard, Version 5.0 (Boston, MA, Addison-Wesley, 2007. ISBN 0-321-48091-0)

The online version (printing prohibited) is at http://www.unicode.org/versions/Unicode5.0.0/.

See Annex C of ISO 10646-1, which is online at http://www.dkuug.dk/JTC1/SC2/WG2/docs/n2005/n2005-2.doc.

See ISO/IEC 10646:2003, which is publicly available in several text and PDF files within a zip archive from http://standards.iso.org/ittf/PubliclyAvailableStandards/c039921_ISO_IEC_10646_2003%28E%29.zip.

See UTF-8, UTF-16, UTF-32 & BOM.

Changes to the C++ Standard

1.7 The C++ memory model [intro.memory]

To paragraph 1, edit

The fundamental storage unit in the C++ memory model is the byte. A byte is at least large enough to contain ~~any member of the basic execution character set~~ the eight-bit code units of the Unicode UTF-8 encoding form and is composed of a contiguous sequence of bits, the number of which is implementation-defined. The least significant bit is called the low-order bit; the most significant bit is called the high-order bit. The memory available to a C++ program consists of one or more sequences of contiguous bytes. Every byte has a unique address.

2.13.4 String literals [lex.string]

To the grammar, edit

string-literal:

" c-char-sequence_opt "

E" c-char-sequence_opt "

L" c-char-sequence_opt "

To paragraph 1, edit

A string literal is a sequence of characters (as defined in 2.13.2) surrounded by double quotes, optionally beginning with one of the letters E or L, as in "...", E"...", or L"...". A string literal that does not begin with E or L is an ordinary string literal, and is initialized with the given characters. A string literal that begins with E, such as E"asdf", is a UTF-8 string literal and is initialized with the given characters as encoded in UTF-8. It is implementation-defined whether literals may contain more than members of the basic character set and universal character names (\Unnnnnnnn and \unnnn). Ordinary string literals and UTF-8 string literals are also referred to as a narrow string literals. An ~~ordinary~~ narrow string literal has type "array of n const char" and static storage duration (3.7), where n is the size of the string as defined below~~, and is initialized with the given characters~~. A string literal that begins with L, such as L"asdf", is a wide string literal. A wide string literal has type "array of n const wchar_t" and has static storage duration, where n is the size of the string as defined below, and is initialized with the given characters.

In paragraph 3, edit

In translation phase 6 (2.1), adjacent string literals are concatenated. If an ordinary string literal token is adjacent to a UTF-8 string literal token, the result is a UTF-8 string literal. If ~~a narrow~~ an ordinary string literal token is adjacent to a wide string literal token, the result is a wide string literal. If a UTF-8 string literal token is adjacent to a wide string literal token, the program is ill-formed.

Paragraph 5 already admits a multi-byte encoding of narrow string literals.

3.9.1 Fundamental Types [basic.fundamental]

To paragraph 1, after the first sentence, add

Objects declared as characters (char) shall be large enough to store either one byte (1.7 [intro.memory]) or any member of the implementation's basic character set. If a character from this set is stored in a character object, the integral value of that character object is equal to the value of the single character literal form of that character.