Doc. no. WG21/N2053=06-0123
Project: Programming Language C++
Reply to: Beman Dawes <firstname.lastname@example.org>
Regular Expression motivating example
Markup motivating example
Raw character literals?
In recent years it has become more common for C++ to work with regular expressions and with markup languages such as HTML and XML.
Regular expressions use the same backslash escape sequence as C++ does in string literals. The resulting plethora of backslashes is very difficult to write correctly and impenetrable to read. See Regular expressions motivating example.
Markup languages such as XML and HTML use a lot of quotation marks and newlines. The resulting escape sequences in string literals are irritating, cumbersome, and error prone. See Markup motivating example.
Other programming languages, such as Perl, Python, and Lua, have addressed these issues by providing raw string literals in addition to regular string literals. A raw string literal is simply a string literal that does not recognize C++ escape sequences. Raw string literals are well accepted and used regularly (pun intended!) in languages that have them.
This document proposes adding raw string literals to C++0x.
The proposal is a pure extension. It will have no impact on any existing code.
The proposal has some minor interaction with proposal N2018 to add additional character types. See www.open-std.org/jtc1/sc22/wg21/docs/papers/2006/n2018.html. If N2018 is accepted, four additional lines must be added to the grammar in addition to the two additional grammar lines proposed in N2018.
Here is an example of the concatenated string literals in an actual C++ program (by John Maddock):
"(^[[:blank:]]*#(?:[^\\\\\\n]|\\\\[^\\n[:punct:][:word:]]*[\\n[:punct:][:word:]])*)|" "(//[^\\n]*|/\\*.*?\\*/)|" "\\<([+-]?(?:(?:0x[[:xdigit:]]+)|(?:(?:[[:digit:]]*\\.)" "?[[:digit:]]+(?:[eE][+-]?[[:digit:]]+)?))u?(?:(?:int(?:8|16|32|64))|L)?)\\>|" "('(?:[^\\\\']|\\\\.)*'|\"(?:[^\\\\\"]|\\\\.)*\")|" "\\<(__asm|__cdecl|__declspec|__export|__far16|__fastcall|__fortran|__import" "|__pascal|__rtti|__stdcall|_asm|_cdecl|__except|_export|_far16|_fastcall" "|__finally|_fortran|_import|_pascal|_stdcall|__thread|__try|asm|auto|bool" "|break|case|catch|cdecl|char|class|const|const_cast|continue|default|delete" "|do|double|dynamic_cast|else|enum|explicit|extern|false|float|for|friend|goto" "|if|inline|int|long|mutable|namespace|new|operator|pascal|private|protected" "|public|register|reinterpret_cast|return|short|signed|sizeof|static|static_cast" "|struct|switch|template|this|throw|true|try|typedef|typeid|typename|union|unsigned" "|using|virtual|void|volatile|wchar_t|while)\\>"
Note in particular the line that reads:
Are the high-lighted five backslashes correct or not? Even experts become easily confused. Here is the equivalent line as a raw string:
Note the the five backslash sequence has been reduced to a more manageable two backslash sequence. And, yes, the original five backslash sequence was both correct and necessary in C++03.
Here is the complete example using the raw string proposal:
R""(^[[:blank:]]*#(?:[^\\\n]|\\[^\n[:punct:][:word:]]*[\n[:punct:][:word:]])*)|\ (//[^\n]*|/\*.*?\*/)|\ \<([+-]?(?:(?:0x[[:xdigit:]]+)|(?:(?:[[:digit:]]*\.)\ ?[[:digit:]]+(?:[eE][+-]?[[:digit:]]+)?))u?(?:(?:int(?:8|16|32|64))|L)?)\>|\ ('(?:[^\\']|\\.)*'|"(?:[^\\"]|\\.)*")|\ \<(__asm|__cdecl|__declspec|__export|__far16|__fastcall|__fortran|__import\ |__pascal|__rtti|__stdcall|_asm|_cdecl|__except|_export|_far16|_fastcall\ |__finally|_fortran|_import|_pascal|_stdcall|__thread|__try|asm|auto|bool\ |break|case|catch|cdecl|char|class|const|const_cast|continue|default|delete\ |do|double|dynamic_cast|else|enum|explicit|extern|false|float|for|friend|goto\ |if|inline|int|long|mutable|namespace|new|operator|pascal|private|protected\ |public|register|reinterpret_cast|return|short|signed|sizeof|static|static_cast\ |struct|switch|template|this|throw|true|try|typedef|typeid|typename|union|unsigned\ |using|virtual|void|volatile|wchar_t|while)\>""
Here is an example of the concatenated string literals in an actual C++ program (again by John Maddock):
"<HTML>\n" "<HEAD>\n" "<TITLE>Auto-generated html formated source</TITLE>\n" "<META HTTP-EQUIV=\"Content-Type\" CONTENT=\"text/html; charset=windows-1252\">\n" "</HEAD>\n" "<BODY LINK=\"#0000ff\" VLINK=\"#800080\" BGCOLOR=\"#ffffff\">\n" "<P> </P>\n" "<PRE>\n"
Here is the complete example using the raw string proposal:
R"$\ <HTML> <HEAD> <TITLE>Auto-generated html formated source</TITLE> <META HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=windows-1252"> </HEAD> <BODY LINK="#0000ff" VLINK="#800080" BGCOLOR="#ffffff"> <P> </P> <PRE> $"
There are several reasons the raw string versions is preferred:
As a deliberate design choice, the proposal does not include raw character (as opposed to string) literals because there is no apparent need; escape sequences do not pose the same practical problems in character literals that they do in string-literals.
The arguments in favor of raw character literals are symmetry and error-reduction. Knowing that raw string-literals are allowed, programmers are likely to assume raw character-literals are also available. Indeed, a committee member inadvertently made that assumption when reading a draft of this paper. Although the resulting compiler error is easy to fix, there is the argument that it is better to eliminate the possibility of the error by providing raw character-literals in the first place.
I will be happy to provide proposed wording if the committee desires to add raw character-literals.
This proposal was initiated in response to a posting on the LWG reflector from Thomas Witt, with comments from several others committee members. John Maddock provided insights about the string-literal needs of regular expressions. Robert Klarer provided examples and clarifications.
Added text is shown in
green and underlined. Deleted text is shown
red with strikethrough.
Commentary is shown in gray shading
and is not part of the proposed wording.
Change 2.1 [lex.phases], paragraph 5:
Each source character set member
, escape sequence, or universal-character-namein character literals and string
literals, and escape sequence or universal-character name in character literals and regular string literals, is converted to the corresponding member of the execution character set (2.13.2, 2.13.4); if there is no
corresponding member, it is converted to an implementation-defined member other than the null (wide) character.17)
Change 2.13.4 [lex.string] :
R"d-char r-char-sequenceopt d-char"
LR"d-char r-char-sequenceopt d-char"
RL"d-char r-char-sequenceopt d-char"
uR"d-char r-char-sequenceopt d-char" Applies only if N2018 is accepted
Ru"d-char r-char-sequenceopt d-char" Applies only if N2018 is accepted
UR"d-char r-char-sequenceopt d-char" Applies only if N2018 is accepted
UL"d-char r-char-sequenceopt d-char" Applies only if N2018 is accepted
any member of the source character set, except the initial d-char when followed by ".
any member of the source character set for which
the terminating d-char is the same character as the initial d-char.
Change 2.13.4 [lex.string] paragraph 1:
A string literal is regular string literal or a raw string literal. A regular string literal does not have an R prefix. A raw string literal has an R prefix, as in
LR""..."". A string literal is
a sequence of characters (as defined in 2.13.2) surrounded by double quotes,optionally prefixed beginningwith the letter L, as in
LR""..."". A string literal that does not
begin withhave an L prefix is an ordinary string literal, also referred to as a narrow string literal. An ordinary string literal has type “array of n const char” and static storage duration (3.7), where n is the size of the string as defined below, and is initialized with the given characters. A string literal that begins withhas an L prefix, such as
RL"/\bgd/", is a wide string literal. A wide string literal has type “array of n const wchar_t” and has static storage duration, where n is the size of the string as defined below, and is initialized with the given characters.
[Example: Whether or not a source-file new-line in a raw string-literal results in a newline in the resulting execution string-literal is determined by the second phase of translation (2.1) rules for a trailing backslash:
const char * p1 = R""abc-- end example]
assert(strcmp(p1, "abc\ndef") == 0); // assert will succeed
const char * p2 = R""abc\
assert(strcmp(p2, "abcdef") == 0); // assert will succeed
To 2.13.4 [lex.string] paragraph 4 add:
const char * p1 = R"$A\bC$" "def" R"!GHI!";
const char * p2 = "A\\bCdefGHI";
assert(strcmp(p1, p2) == 0); // assert will succeed
-- end example]
© Beman Dawes 2006