Doc. no.   WG21/N2146=J16/07-0006
Date:        2007-01-09
Project:     Programming Language C++
Reply to:   Beman Dawes <bdawes@acm.org>

Raw String Literals (Revision 1)

Introduction
Revision History
Motivating examples
    Regular Expression motivating example
    Markup motivating example
Implementation experience
Raw character literals?
FAQ
Acknowledgements
Proposed wording

Introduction

In recent years it has become more common for C++ to work with regular expressions and with markup languages such as HTML and XML.

Regular expressions use the same backslash escape sequence as C++ does in string literals. The resulting plethora of backslashes is very difficult to write correctly and impenetrable to read. See Regular expressions motivating example.

Markup languages such as XML and HTML use a lot of quotation marks and newlines. The resulting escape sequences in string literals are irritating, cumbersome, and error prone. See Markup motivating example.

Other programming languages, such as Perl, Python, and Lua, have addressed these issues by providing raw string literals in addition to regular string literals. A raw string literal is simply a string literal that does not recognize C++ escape sequences. Raw string literals are well accepted and used regularly (pun intended!) in languages that have them.

This document proposes adding raw string literals to C++0x.

The proposal is a pure extension. It will have no impact on any existing code.

The proposal has some minor interaction with proposal N2018 to add additional character types. See www.open-std.org/jtc1/sc22/wg21/docs/papers/2006/n2018.html. If N2018 is accepted, four additional lines must be added to the grammar in addition to the two additional grammar lines proposed in N2018.

Revision History

Revision 1:

Initial paper: www.open-std.org/jtc1/sc22/wg21/docs/papers/2006/n2053.html

Motivating examples

Regular Expression motivating example

Here is an example of the concatenated string literals in an actual C++ program (by John Maddock):

"(^[[:blank:]]*#(?:[^\\\\\\n]|\\\\[^\\n[:punct:][:word:]]*[\\n[:punct:][:word:]])*)|"
"(//[^\\n]*|/\\*.*?\\*/)|"
"\\<([+-]?(?:(?:0x[[:xdigit:]]+)|(?:(?:[[:digit:]]*\\.)"
"?[[:digit:]]+(?:[eE][+-]?[[:digit:]]+)?))u?(?:(?:int(?:8|16|32|64))|L)?)\\>|"
"('(?:[^\\\\']|\\\\.)*'|\"(?:[^\\\\\"]|\\\\.)*\")|"
"\\<(__asm|__cdecl|__declspec|__export|__far16|__fastcall|__fortran|__import"
"|__pascal|__rtti|__stdcall|_asm|_cdecl|__except|_export|_far16|_fastcall"
"|__finally|_fortran|_import|_pascal|_stdcall|__thread|__try|asm|auto|bool"
"|break|case|catch|cdecl|char|class|const|const_cast|continue|default|delete"
"|do|double|dynamic_cast|else|enum|explicit|extern|false|float|for|friend|goto"
"|if|inline|int|long|mutable|namespace|new|operator|pascal|private|protected"
"|public|register|reinterpret_cast|return|short|signed|sizeof|static|static_cast"
"|struct|switch|template|this|throw|true|try|typedef|typeid|typename|union|unsigned"
"|using|virtual|void|volatile|wchar_t|while)\\>"

Note in particular the line that reads:

"('(?:[^\\\\']|\\\\.)*'|\"(?:[^\\\\\"]|\\\\.)*\")|"

Are the high-lighted five backslashes correct or not? Even experts become easily confused. Here is the same line as a raw string literal:

R"[('(?:[^\\']|\\.)*'|"(?:[^\\"]|\\.)*")|]"

Note the the five backslash sequence has been reduced to a more manageable two backslash sequence. And, yes, the original five backslash sequence was both correct and necessary in C++03.

Here is the complete example using the raw string proposal:

R"[(^[[:blank:]]*#(?:[^\\\n]|\\[^\n[:punct:][:word:]]*[\n[:punct:][:word:]])*)|]"
R"[(//[^\n]*|/\*.*?\*/)|]"
R"[\<([+-]?(?:(?:0x[[:xdigit:]]+)|(?:(?:[[:digit:]]*\.)]"
R"[?[[:digit:]]+(?:[eE][+-]?[[:digit:]]+)?))u?(?:(?:int(?:8|16|32|64))|L)?)\>|]"
R"[('(?:[^\\']|\\.)*'|"(?:[^\\"]|\\.)*")|]"
R"[\<(__asm|__cdecl|__declspec|__export|__far16|__fastcall|__fortran|__import]"
"|__pascal|__rtti|__stdcall|_asm|_cdecl|__except|_export|_far16|_fastcall"
"|__finally|_fortran|_import|_pascal|_stdcall|__thread|__try|asm|auto|bool"
"|break|case|catch|cdecl|char|class|const|const_cast|continue|default|delete"
"|do|double|dynamic_cast|else|enum|explicit|extern|false|float|for|friend|goto"
"|if|inline|int|long|mutable|namespace|new|operator|pascal|private|protected"
"|public|register|reinterpret_cast|return|short|signed|sizeof|static|static_cast"
"|struct|switch|template|this|throw|true|try|typedef|typeid|typename|union|unsigned"
R"[|using|virtual|void|volatile|wchar_t|while)\>]"

Markup motivating example

Here is an example of concatenated string literals in an actual C++ program (again by John Maddock):

"<HTML>\n"
"<HEAD>\n"
"<TITLE>Auto-generated html formated source</TITLE>\n"
"<META HTTP-EQUIV=\"Content-Type\" CONTENT=\"text/html; charset=windows-1252\">\n"
"</HEAD>\n"
"<BODY LINK=\"#0000ff\" VLINK=\"#800080\" BGCOLOR=\"#ffffff\">\n"
"<P> </P>\n"
"<PRE>\n"

Here is the same example as a raw string literal:

R"[\
<HTML>
<HEAD>
<TITLE>Auto-generated html formated source</TITLE>
<META HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=windows-1252">
</HEAD>
<BODY LINK="#0000ff" VLINK="#800080" BGCOLOR="#ffffff">
<P> </P>
<PRE>
]"

There are several reasons the raw string version may be preferred:

Implementation experience

The proposal has been implemented twice, in the GCC compiler and in James Dennett's preprocessor.

In GCC, the code changes affected three existing functions. No new functions were added and no existing function signatures required changes. Because the bulk of the new code is in a path only executed when compiling a raw string literal, there is no measurable effect on compilation times for existing programs.

In James Dennett's preprocessor, as with the GCC implementation, there was almost no new code added to existing cases, so no slow-down for any C++03 constructs.

Raw character literals?

As a deliberate design choice, raw character (as opposed to string) literals are not proposed because there is no apparent need; escape sequences do not pose the same practical problems in character literals that they do in string literals.

The arguments in favor of raw character literals are symmetry and error-reduction. Knowing that raw string-literals are allowed, programmers are likely to assume raw character-literals are also available. Indeed, a committee member inadvertently made that assumption when reading a draft of this paper. Although the resulting error is easy to fix, there is the argument that it is better to eliminate the possibility of the error by providing raw character-literals in the first place.

I will be happy to provide proposed wording if the committee desires to add raw character literals.

FAQ

What is the purpose of the d-char-sequence in the raw string literal? It eliminates the need for an escape sequence within raw string literals themselves, by allowing the user to choose the delimiter sequence.

Why doesn't the 'R' prefix have a fixed order relative to the 'L' prefix? Either RL or LR prefix ordering is allowed as a convenience to programmers, so that the exact order does not have to be memorized.

Why are trigraphs and backslash new-lines in raw string literals still converted to something else? To do otherwise would require unduly entangling translation phases one to three, yet any benefit would be minimal since the practical need for raw trigraphs and backslash new-line sequences is expected to be rare. Under the current proposal, there is no change whatsoever to phase one and two wording.

Why are universal-character-names in raw string literals still converted? Phase one rules, 2.1 [lex.phases], require incoming non-basic character set characters such as '@' be replaced by their universal-character-name (UCN). It would be a constant source of bugs if UCN's in raw string literals did not converted back to their single character input representation.

Acknowledgements

This proposal was initiated in response to a posting on the LWG reflector from Thomas Witt, with comments from several other committee members. Thomas credits Scott Meyers with the original the suggestion. John Maddock provided insights about the string-literal needs of regular expressions. Robert Klarer provided examples and clarifications. The C++ standards committee's Evolution Working Group provided encouragement and comments at the Portland '06 meeting. Jonathan Adamczewski provided useful comments and suggestions for revision 1. Comments from James Dennett and others identified the need to avoid entanglement of translation phases one through three. Nathan Myers and James Dennett suggested improved examples.

Proposed wording

Added text is shown in green and underlined. Deleted text is shown in red with strikethrough. Commentary is shown in gray shading with italic text and is not part of the proposed wording.

Change 2.1 [lex.phases], paragraph 1:

5. Each source character set member, escape sequence, or universal-character-name in character literals and string
literals, and escape sequence in character literals and regular string literals, is converted to the corresponding member of the execution character set (2.13.2, 2.13.4); if there is no corresponding member, it is converted to an implementation-defined member other than the null (wide) character.17)

Change 2.13.4 [lex.string] :

string-literal:
    "s-char-sequenceopt"
    L"s-char-sequenceopt"
    R"d-char-sequenceopt[r-char-sequenceopt]d-char-sequenceopt"
    LR"d-char-sequenceopt[r-char-sequenceopt]d-char-sequenceopt"
    RL"d-char-sequenceopt[r-char-sequenceopt]d-char-sequenceopt"
    uR"d-char-sequenceopt[r-char-sequenceopt]d-char-sequenceopt"        Applies only if N2018 is accepted
    Ru"d-char-sequenceopt[r-char-sequenceopt]d-char-sequenceopt"        Applies only if N2018 is accepted
    UR"d-char-sequenceopt[r-char-sequenceopt]d-char-sequenceopt"       Applies only if N2018 is accepted
    RU"d-char-sequenceopt[r-char-sequenceopt]d-char-sequenceopt"       Applies only if N2018 is accepted

s-char-sequence:
    s-char
    s-char-sequence s-char

s-char:
    any member of the source character set except
                        the double-quote ", backslash \, or new-line character
    escape-sequence
    universal-character-name

r-char-sequence:
    r-char
    r-char-sequence r-char

r-char:
   
any member of the source character set, except the right square bracket ]
                        when followed by the initial
d-char-sequence, if present, followed by the double quote ".
    universal-character-name

d-char-sequence:
    d-char
    d-char-sequence d-char

d-char:
    any member of the source character set, except the left square bracket [, the right square bracket ],
                        or the control characters representing horizontal tab, vertical tab, form feed, or new-line
.

Change 2.13.4 [lex.string] paragraph 1:

A string literal is regular string literal or a raw string literal. A regular string literal does not have an R prefix. A raw string literal has an R prefix, as in R"[...]" RL"**[...]**" or LR"delim[...]delim". A string literal is a sequence of characters (as defined in 2.13.2) surrounded by double quotes, optionally prefixed beginning with the letter L, as in L"...", RL"xxx[...]xxx" or LR"--[...]--". A string literal that does not begin with have an L prefix is an ordinary string literal, also referred to as a narrow string literal. An ordinary string literal has type “array of n const char” and static storage duration (3.7), where n is the size of the string as defined below, and is initialized with the given characters. A string literal that begins with has an L prefix, such as L"asdf" or RL"*[\bgd]*", is a wide string literal. A wide string literal has type “array of n const wchar_t” and has static storage duration, where n is the size of the string as defined below, and is initialized with the given characters. The terminating d-char-sequence of a raw string literal shall be the same sequence of characters as the initial d-char-sequence, The maximum length of d-char-sequence shall be 16 characters.

[Note: A source-file new-line in a raw string-literal results in a new-line in the resulting execution string-literal, unless preceded by a backslash. Assuming no whitespace at the beginning of lines in the following example, the assert will succeed:

   const char * p = R"[a\
   b
   c]";
   assert(std::strcmp(p, "ab\nc") == 0);

 -- end note]

To 2.13.4 [lex.string] paragraph 4 add:

[Example:

  const char p1[] = R""[A\bC]"" "def\0" R"raw[GHI]raw" "\n";
  const char p2[] = "A\\bCdef\0GHI\n";
  assert(sizeof(p1) == sizeof(p2) && std::memcmp(p1, p2, sizeof(p1)) == 0);

-- end example]


© Beman Dawes 2006