Doc. no.   WG21/N2053=06-0123
Date:        2006-09-06
Project:     Programming Language C++
Reply to:   Beman Dawes <bdawes@acm.org>

Raw String Literals

Introduction
Motivating examples
Regular Expression motivating example
Markup motivating example
Implementation experience
Raw character literals?
Acknowledgements
Proposed wording

Introduction

In recent years it has become more common for C++ to work with regular expressions and with markup languages such as HTML and XML.

Regular expressions use the same backslash escape sequence as C++ does in string literals. The resulting plethora of backslashes is very difficult to write correctly and impenetrable to read. See Regular expressions motivating example.

Markup languages such as XML and HTML use a lot of quotation marks and newlines. The resulting escape sequences in string literals are irritating, cumbersome, and error prone. See Markup motivating example.

Other programming languages, such as Perl, Python, and Lua, have addressed these issues by providing raw string literals in addition to regular string literals. A raw string literal is simply a string literal that does not recognize C++ escape sequences. Raw string literals are well accepted and used regularly (pun intended!) in languages that have them.

This document proposes adding raw string literals to C++0x.

The proposal is a pure extension. It will have no impact on any existing code.

The proposal has some minor interaction with proposal N2018 to add additional character types. See www.open-std.org/jtc1/sc22/wg21/docs/papers/2006/n2018.html. If N2018 is accepted, four additional lines must be added to the grammar in addition to the two additional grammar lines proposed in N2018.

Motivating examples

Regular Expression motivating example

Here is an example of the concatenated string literals in an actual C++ program (by John Maddock):

"(^[[:blank:]]*#(?:[^\\\\\\n]|\\\\[^\\n[:punct:][:word:]]*[\\n[:punct:][:word:]])*)|"
"(//[^\\n]*|/\\*.*?\\*/)|"
"\\<([+-]?(?:(?:0x[[:xdigit:]]+)|(?:(?:[[:digit:]]*\\.)"
"?[[:digit:]]+(?:[eE][+-]?[[:digit:]]+)?))u?(?:(?:int(?:8|16|32|64))|L)?)\\>|"
"('(?:[^\\\\']|\\\\.)*'|\"(?:[^\\\\\"]|\\\\.)*\")|"
"\\<(__asm|__cdecl|__declspec|__export|__far16|__fastcall|__fortran|__import"
"|__pascal|__rtti|__stdcall|_asm|_cdecl|__except|_export|_far16|_fastcall"
"|__finally|_fortran|_import|_pascal|_stdcall|__thread|__try|asm|auto|bool"
"|break|case|catch|cdecl|char|class|const|const_cast|continue|default|delete"
"|do|double|dynamic_cast|else|enum|explicit|extern|false|float|for|friend|goto"
"|if|inline|int|long|mutable|namespace|new|operator|pascal|private|protected"
"|public|register|reinterpret_cast|return|short|signed|sizeof|static|static_cast"
"|struct|switch|template|this|throw|true|try|typedef|typeid|typename|union|unsigned"
"|using|virtual|void|volatile|wchar_t|while)\\>"

Note in particular the line that reads:

"('(?:[^\\\\']|\\\\.)*'|\"(?:[^\\\\\"]|\\\\.)*\")|"

Are the high-lighted five backslashes correct or not? Even experts become easily confused. Here is the equivalent line as a raw string:

('(?:[^\\']|\\.)*'|"(?:[^\\"]|\\.)*")|\

Note the the five backslash sequence has been reduced to a more manageable two backslash sequence. And, yes, the original five backslash sequence was both correct and necessary in C++03.

Here is the complete example using the raw string proposal:

R""(^[[:blank:]]*#(?:[^\\\n]|\\[^\n[:punct:][:word:]]*[\n[:punct:][:word:]])*)|\
(//[^\n]*|/\*.*?\*/)|\
\<([+-]?(?:(?:0x[[:xdigit:]]+)|(?:(?:[[:digit:]]*\.)\
?[[:digit:]]+(?:[eE][+-]?[[:digit:]]+)?))u?(?:(?:int(?:8|16|32|64))|L)?)\>|\
('(?:[^\\']|\\.)*'|"(?:[^\\"]|\\.)*")|\
\<(__asm|__cdecl|__declspec|__export|__far16|__fastcall|__fortran|__import\
|__pascal|__rtti|__stdcall|_asm|_cdecl|__except|_export|_far16|_fastcall\
|__finally|_fortran|_import|_pascal|_stdcall|__thread|__try|asm|auto|bool\
|break|case|catch|cdecl|char|class|const|const_cast|continue|default|delete\
|do|double|dynamic_cast|else|enum|explicit|extern|false|float|for|friend|goto\
|if|inline|int|long|mutable|namespace|new|operator|pascal|private|protected\
|public|register|reinterpret_cast|return|short|signed|sizeof|static|static_cast\
|struct|switch|template|this|throw|true|try|typedef|typeid|typename|union|unsigned\
|using|virtual|void|volatile|wchar_t|while)\>""

Markup motivating example

Here is an example of the concatenated string literals in an actual C++ program (again by John Maddock):

"<HTML>\n"
"<HEAD>\n"
"<TITLE>Auto-generated html formated source</TITLE>\n"
"<META HTTP-EQUIV=\"Content-Type\" CONTENT=\"text/html; charset=windows-1252\">\n"
"</HEAD>\n"
"<BODY LINK=\"#0000ff\" VLINK=\"#800080\" BGCOLOR=\"#ffffff\">\n"
"<P> </P>\n"
"<PRE>\n"

Here is the complete example using the raw string proposal:

R"$\
<HTML>
<HEAD>
<TITLE>Auto-generated html formated source</TITLE>
<META HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=windows-1252">
</HEAD>
<BODY LINK="#0000ff" VLINK="#800080" BGCOLOR="#ffffff">
<P> </P>
<PRE>
$"

There are several reasons the raw string versions is preferred:

It is easier to write, whether by hand or by cut-and-past from an actual HTML file.
It is easier to read, although perhaps not as markedly easier as with the regular expression example.
Code that does markup language generation often does a lot of it, so the multiplier effect is large. In other words, even moderate gains in writeability and readability in a single example become important when multiplied by many similar uses in a larger program.

Implementation experience

Not yet.

Raw character literals?

As a deliberate design choice, the proposal does not include raw character (as opposed to string) literals because there is no apparent need; escape sequences do not pose the same practical problems in character literals that they do in string-literals.

The arguments in favor of raw character literals are symmetry and error-reduction. Knowing that raw string-literals are allowed, programmers are likely to assume raw character-literals are also available. Indeed, a committee member inadvertently made that assumption when reading a draft of this paper. Although the resulting compiler error is easy to fix, there is the argument that it is better to eliminate the possibility of the error by providing raw character-literals in the first place.

I will be happy to provide proposed wording if the committee desires to add raw character-literals.

Acknowledgements

This proposal was initiated in response to a posting on the LWG reflector from Thomas Witt, with comments from several others committee members. John Maddock provided insights about the string-literal needs of regular expressions. Robert Klarer provided examples and clarifications.

Proposed wording

Added text is shown in green and underlined. Deleted text is shown in ~~red with strikethrough~~. Commentary is shown in gray shading and is not part of the proposed wording.

Change 2.1 [lex.phases], paragraph 5:

Each source character set member~~, escape sequence, or universal-character-name~~ in character literals and string
literals, and escape sequence or universal-character name in character literals and regular string literals, is converted to the corresponding member of the execution character set (2.13.2, 2.13.4); if there is no
corresponding member, it is converted to an implementation-defined member other than the null (wide) character.¹⁷⁾

Change 2.13.4 [lex.string] :

string-literal:
    "s-char-sequence_opt"
    L"s-char-sequence_opt"
    R"d-char r-char-sequence_opt d-char"
    LR"d-char r-char-sequence_opt d-char"
    RL"d-char r-char-sequence_opt d-char"
    uR"d-char r-char-sequence_opt d-char"        Applies only if N2018 is accepted
    Ru"d-char r-char-sequence_opt d-char"        Applies only if N2018 is accepted
    UR"d-char r-char-sequence_opt d-char"       Applies only if N2018 is accepted
    UL"d-char r-char-sequence_opt d-char"       Applies only if N2018 is accepted

r-char-sequence:
    r-char
    r-char-sequence r-char

r-char:
    any member of the source character set, except the initial d-char when followed by ".

d-char:
    any member of the source character set for which std::ispunc is true;
    the terminating d-char is the same character as the initial d-char.

Change 2.13.4 [lex.string] paragraph 1:

A string literal is regular string literal or a raw string literal. A regular string literal does not have an R prefix. A raw string literal has an R prefix, as in R""..."", RL""..."" or LR""..."". A string literal is ~~a sequence of characters (as defined in 2.13.2) surrounded by double quotes,~~ optionally prefixed ~~beginning~~ with the letter L, as in L"...", RL""..."" or LR""..."". A string literal that does not ~~begin with~~ have an L prefix is an ordinary string literal, also referred to as a narrow string literal. An ordinary string literal has type “array of n const char” and static storage duration (3.7), where n is the size of the string as defined below, and is initialized with the given characters. A string literal that ~~begins with~~ has an L prefix, such as L"asdf" or RL"/\bgd/", is a wide string literal. A wide string literal has type “array of n const wchar_t” and has static storage duration, where n is the size of the string as defined below, and is initialized with the given characters.

[Example: Whether or not a source-file new-line in a raw string-literal results in a newline in the resulting execution string-literal is determined by the second phase of translation (2.1) rules for a trailing backslash:

const char * p1 = R""abc def""; assert(strcmp(p1, "abc\ndef") == 0); // assert will succeed const char * p2 = R""abc\ def""; assert(strcmp(p2, "abcdef") == 0); // assert will succeed -- end example]

To 2.13.4 [lex.string] paragraph 4 add:

[Example:

const char * p1 = R"$A\bC$" "def" R"!GHI!"; const char * p2 = "A\\bCdefGHI"; assert(strcmp(p1, p2) == 0); // assert will succeed
-- end example]