Jason Merrill
2010-03-12
N3077=10-0067

Alternative approach to Raw String issues

Introduction

N2990 deals with the problem of trigraph replacement in raw strings (part of core issue 789) by moving trigraphs out of phase 1, making them alternate spellings instead. This solves the problem for trigraphs, but does not deal with the related issues for UCNs (extended characters being replaced with \uXXXX in phase 1) and line splicing in phase 2.

This paper proposes an alternate approach to dealing with these issues: just undo the transformations done in phase 1 and 2 inside a raw string. Apparently many compilers already keep track of those transformations internally.

This paper incorporates the proposed wording for issue 872, and addresses UK comment 11 (core issue 789).

Proposed Wording

2.2 [lex.phases]:

3. The source file is decomposed into preprocessing tokens (2.5) and sequences of white-space characters (including comments). A source file shall not end in a partial preprocessing token or in a partial comment.¹² Each comment is replaced by one space character. New-line characters are retained. Whether each nonempty sequence of white-space characters other than new-line is retained or replaced by one space character is unspecified. The process of dividing a source file's characters into preprocessing tokens is context-dependent. [ Example: see the handling of < within a #include preprocessing directive. -- end example ] Within the r-char-sequence of a raw string literal, any transformations performed in phases 1 and 2 (trigraphs, universal-character-names, and line splicing) are reverted.

2.5 [lex.pptoken] paragraph 3:

If the input stream has been parsed into preprocessing tokens up to a given character:

if the next character begins a sequence of characters that could be the prefix and initial double quote of a raw string literal, such as R", the next preprocessing token shall be a raw string literal;

otherwise, the next preprocessing token is the longest sequence of characters that could constitute a preprocessing token, even if that would cause further lexical analysis to fail.

[ Example:
  #define R "x"
  const char* s = R"y"; // ill-formed raw string, not "x" "y"
--end example ]

2.14.5 [lex.string]:

raw-string:
      " d-char-sequenceopt [( r-char-sequenceopt )] d-char-sequenceopt "

r-char-sequence:
      r-char
      r-char-sequence r-char

r-char:
  any member of the source character set, except
    (1), a backslash \followed by a u or U, or
    (2), a right square bracket ]parenthesis ) followed by the initial d-char-sequence
    (which may be empty) followed by a double quote ".
  universal-character-name

d-char-sequence:
      d-char
      d-char-sequence d-char

d-char:
      any member of the basic source character set except:
            space, the left square bracket [parenthesis (, the right square bracket ]parenthesis ),
            the backslash \,
            and the control characters representing horizontal tab,
            vertical tab, form feed, and newline.
...
A string literal is a sequence of characters (as defined in 2.14.3) surrounded by double quotes, optionally prefixed by R, u8, u8R, u, uR, U, UR, L, or LR, as in "...", R"[(...])", u8"...", u8R"**[(...])**", u"...", uR"*~[(...])*~", U"...", UR"zzz[(...])zzz", L"...", or LR"[(...])", respectively.

A string literal that has an R in the prefix is a raw string literal. The d-char-sequence serves as a delimiter. The terminating d-char-sequence of a raw-string is the same sequence of characters as the initial d-char-sequence. A d-char-sequence shall consist of at most 16 characters. [ Footnote: Use of characters with trigraph equivalents in a d-char-sequence may produce unintended results. --end footnote ]

[ Note: The characters '[(' and '])' are permitted in a raw-string. Thus, R"delimiter~~[[a-z]]~~((a|b))delimiter" is equivalent to "~~[a-z]~~(a|b)". -- end note ]

[ Note: A source-file new-line in a raw string literal results in a new-line in the resulting execution string-literal~~, unless preceded by a backslash~~. Assuming no whitespace at the beginning of lines in the following example, the assert will succeed:
     const char *p = R"[(a\
     b
     c)]";
     assert(std::strcmp(p, "a\\\nb\nc") == 0);
-- end note ]

...

Escape sequences and universal-character-names in non-raw string literals ~~and universal-character-names in string literals~~ have the same meaning as in character literals ....