WG14/N951 String literals and concatenation Clive Feather Last changed 2001-08-14 Introduction ============ There is an inconsistency in the rules for string literal concatenation and the relationship between source and execution character sets. This paper discusses this inconsistency and suggests a new model and associated changes to the Standard. This paper was written following discussions on the WG14 reflector, with particular input from Tanaka Keishiro and Antoine Leca. Standard text ============= The following text from the Standard is relevant. Translation phase 1: Physical source file multibyte characters are mapped, in an implementation-defined manner, to the source character set (introducing new-line characters for end-of-line indicators) if necessary. Translation phase 3: The source file is decomposed into preprocessing tokens and sequences of white-space characters (including comments). Translation phase 5: Each source character set member and escape sequence in character constants and string literals is converted to the corresponding member of the execution character set; if there is no corresponding member, it is converted to an implementation-defined member other than the null (wide) character. Translation phase 6: Adjacent string literal tokens are concatenated. 6.4.5#1: [#1] string-literal: " s-char-sequence-opt " L" s-char-sequence-opt " s-char-sequence: s-char s-char-sequence s-char s-char: any member of the source character set except the double-quote ", backslash \, or new-line character escape-sequence 6.4.5#4: [#4] In translation phase 6, the multibyte character sequences specified by any sequence of adjacent character and wide string literal tokens are concatenated into a single multibyte character sequence. If any of the tokens are wide string literal tokens, the resulting multibyte character sequence is treated as a wide string literal; otherwise, it is treated as a character string literal. 6.4.5#5: [#5] In translation phase 7, a byte or code of value zero is appended to each multibyte character sequence that results from a string literal or literals. The multibyte character sequence is then used to initialize an array of static storage duration and length just sufficient to contain the sequence. For character string literals, the array elements have type char, and are initialized with the individual bytes of the multibyte character sequence; for wide string literals, the array elements have type wchar_t, and are initialized with the sequence of wide characters corresponding to the multibyte character sequence, as defined by the mbstowcs function with an implementation- defined current locale. The value of a string literal containing a multibyte character or escape sequence not represented in the execution character set is implementation-defined. Problems ======== Consider code like: L"abc" "def" The 6.4.5#4 text says that the multibyte sequences in the literals are concatenated into a single sequence in translation phase 6. But, on the other hand, multibyte characters were mapped to source characters in TP1, and the source characters were then mapped to execution character set characters in TP5. So there are no multibyte sequences available in TP6 to be concatenated. There are then further problems. Consider a string literal containing a UCN: L"\u8868" At TP5 this is converted to a member of the execution character set, but at TP7 (6.4.5#5) this literal is supposed to generate a multibyte character that can be fed to mbstowcs. Nowhere is it explained where this multibyte character comes from. Finally, consider an implementation where the two byte sequence 0x95 0x5C is the source encoding of U+8868. Look at the following literals: L"@\" (@ represents the byte with value 0x95) L"\x95\x5C" L"\x95" "\\" At TP5 the second of these is effectively converted to the first, and after concatenation in TP6 so is the third. This means that all of these literals generate an array of one element, holding the wide character with value 0x955C. This is somewhat counter-intuitive, and Tanaka-san states that it is not what users will expect or implementers will produce. The alternative is to assume that TP1 will convert the first literal to some internal character. But in this case TP7 lacks anything obvious to pass to mbstowcs, and the other two cases still generate the "wrong" answer. Some examples of desired output =============================== Our next step was to consider a range of examples and note what we thought they should produce. Example Array type Array contents 1: "ABC" (char [4]) { 0x41, 0x42, 0x43, 0x00 } 2: "\x12" "34" (char [4]) { 0x12, 0x33, 0x34, 0x00 } 3: "\x95" "\\" (char [3]) { 0x95, 0x5C, 0x00 } 4: "@\" (char [3]) { 0x95, 0x5C, 0x00 } 5:: "@" "\\" (char [3]) { 0x95, 0x5C, 0x00 } OR UNDEFINED 6: L"ABC" (wchar_t [4]) { 0x0041, 0x0042, 0x0043, 0x0000 } 7: L"\u8868" (wchar_t [2]) { 0x955C, 0x0000 } 8: L"\x95\\" (wchar_t [3]) { 0x0095, 0x005C, 0x0000 } 9: L"\x95" L"\\" (wchar_t [3]) { 0x0095, 0x005C, 0x0000 } 10: "\x95" L"\\" (wchar_t [3]) { 0x0095, 0x005C, 0x0000 } 11: L"@\" (wchar_t [2]) { 0x955C, 0x0000 } 12: L"\x955C" (wchar_t [2]) { 0x955C, 0x0000 } 13: L"\x95" (wchar_t [2]) { 0x0095, 0x0000 } 14: "@\\" UNDEFINED 15: "@" "\" UNDEFINED Example 14 is undefined because \" is an escape sequence and so the literal is unterminated. Example 15 depends on whether @" is a valid multibyte sequence or not. If it is, then the third " terminates the literal and the backslash causes a syntax error. If it is not, the second literal is unterminated. Example 5 is defined or undefined in the same way. Principles ========== From consideration of various examples we can derive a set of basic principles for string literals. [P1] The sequences: L"a" L"b" L"a" "b" "a" L"b" are completely equivalent. The final type of a concatenated string literal depends only on whether any of the components have an L prefix, and not on which ones they are. [P2] The sequences: "abc" "ab" "c" "a" "bc" "abc" are completely equivalent. The division of the string into literals does not alter the final array. However, this applies only when the literals consist of the same s-chars; the sequences: "\x1234" "\x12" "34" are not equivalent because they involve different s-chars. [P3] Multibyte sequences are converted to single source characters during TP1, and so each multibyte sequence is a single s-char. [P4] The literal "@\" contains one s-char but the literal "\x95\\" contains two. These are not equivalent, and the latter is not merged to form a multibyte character later on. [P5] The two string literals: "abc" L"abc" should be related. More precisely, applying mbstowcs to the former should produce the latter. [P6] When the final result will be a wchar_t array, each s-char in the source generates exactly one element of the array. [P7] When the final result will be a char array: - a single byte source character generates exactly one byte - an escape sequence generates exactly one byte - a non-single byte multibyte source character generates one or more bytes, and: * mbstowcs applied to the sequence produces a single wide character; * where it makes sense, the byte sequence in the array is the direct analogue of the source multibyte character. [P8] When the final result will be a wchar_t array, source shift sequences are not separate s-chars and do not map to separate elements of the array. [P9] WHen the final result will be a char array, source shift sequences should appear in the array to the extent it makes sense (by analogue with the last sub-bullet of P7). New model ========= Applying these principles to the processes in the Standard, we can construct a new model. The source character set contains the 95 required characters and the "new line" indicator. It also contains as many additional characters as are defined by every valid multibyte character (and making allowance for shift states). For example, suppose that a given encoding consists of: - codes 1 to 96 are the required characters; - codes 101 to 120 are always followed by a code from 1 to 100, and each pair represents a character; - codes 121 to 127 each represent one of four characters depending on the choice of shift state; - codes 97 to 100 select a shift state; this only affects codes 121 to 127. Therefore the entire encoding contains 96+20*100+8*4 = 2128 characters, and that is the size of the source character set. Translation phase 1 converts all input to characters from this set. Thus the sequence: 1 81 81 78 46 100 122 122 101 54 A ? ? / t $ $ ` is converted to the 6 source character sequence A\t$$' If a source character can be generated in more than one way (e.g. through the use of alternative shift sequences), an implementation is free to annotate the character with this information. This annotation is used later. Within string literals, these sequences are parsed into s-chars during TP3; in this case there are 5 such s-chars. Other source code also works in terms of these source characters. TP4 stringisation and token pasting works in terms of these source characters. The execution character set needs essentially the same set of characters as the source had. At TP5 each s-char in a string literal is converted to the corresponding execution character set character. At this point the distinction between multibyte characters, UCNs, and other escape sequences is lost (so \t, \x9 (or whatever), and an actual source tab all produce the same character). At TP6 the sequences of characters are simply concatenated without further change. At TP7 each character in the execution character set generates either: - a single wide character - a sequence of characters In the latter case, if the corresponding s-char came from a multibyte character the sequence should match it if possible. The annotation mechanism described above is one way to do this. Proposed changes ================ The following changes to the Standard are required to put this model into effect. Firstly we specify this model in some detail: 5.2.1.3 Character encoding model [#1] Translation phase 1 establishes the boundaries between multibyte characters in the source. These are converted into /source character encoding units/ that encode a single member of the source character set (any shift sequences are merged with an adjacent unit). Source character encoding units are never split or merged in subsequent translation phases. [#2] In translation phase 3, each source character encoding unit that is not a member of the basic character set will become: - an identifier-nondigit within an identifier or pp-number - an h-char or q-char in a header-name - a c-char within a character-constant - an s-char within a string-literal, or - a preprocessing-token on its own. [#3] In translation phase 5, each c-char or s-char is converted to a single /execution character encoding unit/ (ECEU). Each character constant and string literal therefore becomes a sequence of ECEUs. Note that there may be several representations of the same ECEU: - a source character encoding unit, possibly derived from a multibyte sequence - a universal character name, - a special escape sequence such as \t, or - an octal or hexadecimal escape sequence [#4] In translation phase 6, string literals are concatenated by concatenating the ECEU sequences into a single sequence; the total number of ECEUs involved is unchanged. [#5] In translation phase 7, a string literal is converted to an array of values by first appending an ECEU, representing the null character, to the ECEU sequence. If it is a character string literal, each ECEU then generates one or more elements of the char array; the precise elements generated may depend on the source code encoding unit that the ECEU derives from. If it is a wide string literal, each ECEU generates one element of the wchar_t array. [#6] Two character string literals or two wide string literals derived from the same sequence of source character encoding units shall generate identical arrays. A character string literal and a wide string literal derived from the same sequence shall generate arrays that correspond, as defined by the mbstowcs function with an implementation-defined current locale. Next we need to make the explanation of string concatenation in 6.4.5#4 to use this new model. This completely replaces the old text: [#4] In translation phase 6, the contents of adjacent character and wide string literal tokens are concatenated into a single token as described in 5.2.1.3. If any of the tokens are wide string literal tokens, the resulting token is a wide string literal; otherwise, it is a character string literal. Finally we need to make the explanation of string literals in 6.4.5#5 also use this new model. Again, this completely replaces the old text. [#5] In translation phase 7, a code of value zero (representing the null character) is appended to each string literal. The contents of the literal are then used to initialize an array of static storage duration and length just sufficient to contain the sequence. For character string literals, the array elements have type char; for wide string literals, the array elements have type wchar_t. The array is initialized as described in 5.2.1.3. [No text replaces the last sentence of the current #5, as it duplicates a requirement in TP5.] If it is preferred that 6.4.5#5 not contain the reference to 5.2.1.3, an alternative way to word the former would be: [#5] In translation phase 7, a code of value zero (representing the null character) is appended to each string literal. The contents of the literal are then used to initialize an array of static storage duration and length just sufficient to contain the sequence. For character string literals, the array elements have type char; each ECEU in the string literal (taken in order) determines the value of one or more elements (the precise values may depend on the source code encoding unit(s) that the ECEU derives from). For wide string literals, the array elements have type wchar_t; each ECEU in the string literal determines the initial value for the corresponding element. If so, 5.2.1.3#5 should be deleted and "in translation phase 7" should be added to the end of the first sentence of 5.2.1.3#6.