______________________________________________________________________
2 Lexical conventions [lex]
______________________________________________________________________
1 A C++ program need not all be translated at the same time. The text
of the program is kept in units called source files in this standard.
A source file together with all the headers (_lib.headers_) and source
files included (_cpp.include_) via the preprocessing directive
#include, less any source lines skipped by any of the conditional
inclusion (_cpp.cond_) preprocessing directives, is called a transla
tion unit. Previously translated translation units can be preserved
individually or in libraries. The separate translation units of a
program communicate (_basic.link_) by (for example) calls to functions
whose identifiers have external linkage, manipulation of objects whose
identifiers have external linkage, or manipulation of data files.
Translation units can be separately translated and then later linked
to produce an executable program. (_basic.link_).
2.1 Phases of translation [lex.phases]
1 The precedence among the syntax rules of translation is specified by
the following phases.1)
1 Physical source file characters are mapped to the source character
set (introducing new-line characters for end-of-line indicators)
if necessary. Trigraph sequences (_lex.trigraph_) are replaced by
corresponding single-character internal representations.
2 Each instance of a new-line character and an immediately preceding
backslash character is deleted, splicing physical source lines to
form logical source lines. A source file that is not empty shall
end in a new-line character, which shall not be immediately pre
ceded by a backslash character.
3 The source file is decomposed into preprocessing tokens
(_lex.pptoken_) and sequences of white-space characters (including
comments). A source file shall not end in a partial preprocessing
token or partial comment2). Each comment is replaced by one space
_________________________
1) Implementations must behave as if these separate phases occur, al
though in practice different phases might be folded together.
2) A partial preprocessing token would arise from a source file ending
in one or more characters of a multi-character token followed by a
"line-splicing" backslash. A partial comment would arise from a
source file ending with an unclosed /* comment, or a // comment line
that ends with a "line-splicing" backslash.
character. New-line characters are retained. Whether each
nonempty sequence of white-space characters other than new-line is
retained or replaced by one space character is implementation-
defined. The process of dividing a source file's characters into
preprocessing tokens is context-dependent. [Example: see the han
dling of < within a #include preprocessing directive. ]
4 Preprocessing directives are executed and macro invocations are
expanded. A #include preprocessing directive causes the named
header or source file to be processed from phase 1 through phase
4, recursively.
5 Each source character set member and escape sequence in character
literals and string literals is converted to a member of the exe
cution character set.
6 Adjacent character string literal tokens are concatenated. Adja
cent wide string literal tokens are concatenated.
7 White-space characters separating tokens are no longer signifi
cant. Each preprocessing token is converted into a token. (See
_lex.token_). The resulting tokens are syntactically and semanti
cally analyzed and translated. The result of this process start
ing from a single source file is called a translation unit.
8 The translation units are combined to form a program. All exter
nal object and function references are resolved. Library compo
nents are linked to satisfy external references to functions and
objects not defined in the current translation. All such transla
tor output is collected into a program image which contains infor
mation needed for execution in its execution environment.
+------- BEGIN BOX 1 -------+
What about shared libraries?
+------- END BOX 1 -------+
2.2 Trigraph sequences [lex.trigraph]
1 Before any other processing takes place, each occurrence of one of the
following sequences of three characters ("trigraph sequences") is
replaced by the single character indicated in Table 1.
Table 1--trigraph sequences
+-----------------------+------------------------+------------------------+
|trigraph replacement | trigraph replacement | trigraph replacement |
+-----------------------+------------------------+------------------------+
| ??= # | ??( [ | ??< { |
+-----------------------+------------------------+------------------------+
| ??/ \ | ??) ] | ??> } |
+-----------------------+------------------------+------------------------+
| ??' ^ | ??! | | ??- ~ |
+-----------------------+------------------------+------------------------+
2 [Example:
??=define arraycheck(a,b) a??(b??) ??!??! b??(a??)
becomes
#define arraycheck(a,b) a[b] || b[a]
--end example]
2.3 Preprocessing tokens [lex.pptoken]
+------- BEGIN BOX 2 -------+
We have deleted the non-terminal for 'digraph', because the alterna
tive representations are just alternative ways of expressing a "first-
class" preprocessing token. In C, # and ## are grouped with opera
tors, but that would involve more work in clause 13, and wouldn't fit
the "spirit of C++". Instead, we simply list under 'preprocessing
token' all the valid preprocessing tokens. They are not further cate
gorized until phase 7, in which they are actual tokens.
+------- END BOX 2 -------+
preprocessing-token:
header-name
identifier
pp-number
character-literal
string-literal
preprocessing-op-or-punc
each non-white-space character that cannot be one of the above
1 Each preprocessing token that is converted to a token (_lex.token_)
shall have the lexical form of a keyword, an identifier, a literal, an
operator, or a punctuator.
2 A preprocessing token is the minimal lexical element of the language
in translation phases 3 through 6. The categories of preprocessing
token are: header names, identifiers, preprocessing numbers, character
literals, string literals, preprocessing-op-or-punc, and single non-
white-space characters that do not lexically match the other prepro
cessing token categories. If a ' or a " character matches the last
category, the behavior is undefined. Preprocessing tokens can be
separated by white space; this consists of comments (_lex.comment_),
or white-space characters (space, horizontal tab, new-line, vertical
tab, and form-feed), or both. As described in Clause _cpp_, in cer
tain circumstances during translation phase 4, white space (or the
absence thereof) serves as more than preprocessing token separation.
White space can appear within a preprocessing token only as part of a
header name or between the quotation characters in a character literal
or string literal.
3 If the input stream has been parsed into preprocessing tokens up to a
given character, the next preprocessing token is the longest sequence
of characters that could constitute a preprocessing token, even if
that would cause further lexical analysis to fail.
4 [Example: The program fragment 1Ex is parsed as a preprocessing number
token (one that is not a valid floating or integer literal token),
even though a parse as the pair of preprocessing tokens 1 and Ex might
produce a valid expression (for example, if Ex were a macro defined as
+1). Similarly, the program fragment 1E1 is parsed as a preprocessing
number (one that is a valid floating literal token), whether or not E
is a macro name. ]
5 [Example: The program fragment x+++++y is parsed as x ++ ++ + y,
which, if x and y are of built-in types, violates a constraint on
increment operators, even though the parse x ++ + ++ y might yield a
correct expression. ]
2.4 Alternative tokens [lex.digraph]
1 Alternative token representations are provided for some operators and
punctuators3).
2 In all respects of the language, each alternative token behaves the
same, respectively, as its primary token, except for its spelling4).
The set of alternative tokens is defined in Table 2.
_________________________
3) These include "digraphs" and additional reserved words. The term
"digraph" (token consisting of two characters) is not perfectly de
scriptive, since one of the alternative preprocessing-tokens is %:%:
and of course several primary tokens contain two characters. Nonethe
less, those alternative tokens that aren't lexical keywords are collo
quially known as "digraphs".
4) Thus [ and <: behave differently when "stringized"
(_cpp.stringize_), but can otherwise be freely interchanged.
Table 2--alternative tokens
+----------------------+-----------------------+-----------------------+
|alternative primary | alternative primary | alternative primary |
+----------------------+-----------------------+-----------------------+
| <% { | and && | and_eq &= |
+----------------------+-----------------------+-----------------------+
| %> } | bitor | | or_eq |= |
+----------------------+-----------------------+-----------------------+
| <: [ | or || | xor_eq ^= |
+----------------------+-----------------------+-----------------------+
| :> ] | xor ^ | not ! |
+----------------------+-----------------------+-----------------------+
| %: # | compl ~ | not_eq != |
+----------------------+-----------------------+-----------------------+
| %:%: ## | bitand & | |
+----------------------+-----------------------+-----------------------+
2.5 Tokens [lex.token]
token:
identifier
keyword
literal
operator
punctuator
1 There are five kinds of tokens: identifiers, keywords, literals,5)
operators, and other separators. Blanks, horizontal and vertical
tabs, newlines, formfeeds, and comments (collectively, "white space"),
as described below, are ignored except as they serve to separate
tokens. Some white space is required to separate otherwise adjacent
identifiers, keywords, and literals.
2.6 Comments [lex.comment]
1 The characters /* start a comment, which terminates with the charac
ters */. These comments do not nest. The characters // start a com
ment, which terminates with the next new-line character. If there is a
form-feed or a vertical-tab character in such a comment, only white-
space characters shall appear between it and the new-line that termi
nates the comment; no diagnostic is required. The comment characters
//, /*, and */ have no special meaning within a // comment and are
treated just like other characters. Similarly, the comment characters
// and /* have no special meaning within a /* comment.
_________________________
5) Literals include strings and character and numeric literals.
2.7 Preprocessing numbers [lex.ppnumber]
pp-number:
digit
. digit
pp-number digit
pp-number nondigit
pp-number e sign
pp-number E sign
pp-number .
1 Preprocessing number tokens lexically include all integral literal
tokens (_lex.icon_) and all floating literal tokens (_lex.fcon_).
2 A preprocessing number does not have a type or a value; it acquires
both after a successful conversion (as part of translation phase 7,
_lex.phases_) to an integral literal token or a floating literal
token.
2.8 Identifiers [lex.name]
identifier:
nondigit
identifier nondigit
identifier digit
nondigit: one of
_ a b c d e f g h i j k l m
n o p q r s t u v w x y z
A B C D E F G H I J K L M
N O P Q R S T U V W X Y Z
digit: one of
0 1 2 3 4 5 6 7 8 9
1 An identifier is an arbitrarily long sequence of letters and digits.
The first character shall be a nondigit. Upper- and lower-case let
ters are different. All characters are significant.
2.9 Keywords [lex.key]
1 The identifiers shown in Table 3 are reserved for use as keywords, and
shall not be used otherwise in phases 7 and 8:
Table 3--keywords
+--------------------------------------------------------------------------+
|asm do inline short typeid |
|auto double int signed typename |
|bool dynamic_cast long sizeof union |
|break else mutable static unsigned |
|case enum namespace static_cast using |
|catch explicit new struct virtual |
|char extern operator switch void |
|class false private template volatile |
|const float protected this wchar_t |
|const_cast for public throw while |
|continue friend register true |
|default goto reinterpret_cast try |
|delete if return typedef |
+--------------------------------------------------------------------------+
2 Furthermore, the alternative representations shown in Table 4 for cer
tain operators and punctuators (_lex.digraph_) are reserved and shall
not be used otherwise:
Table 4--alternative representations
+------------------------------------------------+
|and and_eq bitand bitor compl not |
|not_eq or or_eq xor xor_eq |
+------------------------------------------------+
3 In addition, identifiers containing a double underscore (__) or begin
ning with an underscore and an upper-case letter are reserved for use
by C++ implementations and standard libraries and shall not be used
otherwise; no diagnostic is required.
4 The lexical representation of C++ programs includes a number of pre
processing tokens which are used in the syntax of the preprocessor or
are converted into tokens for operators and punctuators:
preprocessing-op-or-punc: one of
{ } [ ] # ## ( )
<: :> <% %> %: %:%: ; : ...
new delete ? :: . .*
+ - * / % ^ & | ~
! = < > += -= *= /= %=
^= &= |= << >> >>= <<= == !=
<= >= && || ++ -- , ->* ->
and and_eq bitand bitor compl not not_eq or or_eq
xor xor_eq
After preprocessing, each preprocessing-op-or-punc is converted to a
single token in translation phase 7 (_lex.phases_).
5 [Note: Certain implementation-defined properties, such as the type of
a sizeof (_expr.sizeof_) expression, the ranges of fundamental types
(_basic.fundamental_), and the types of the most basic library func
tions are defined in the standard headers <climits>, <cstddef>, and
<new> (_lib.language.support_). ]
2.10 Literals [lex.literal]
1 There are several kinds of literals.6)
literal:
integer-literal
character-literal
floating-literal
string-literal
boolean-literal
2.10.1 Integer literals [lex.icon]
integer-literal:
decimal-literal integer-suffixopt
octal-literal integer-suffixopt
hexadecimal-literal integer-suffixopt
decimal-literal:
nonzero-digit
decimal-literal digit
octal-literal:
0
octal-literal octal-digit
hexadecimal-literal:
0x hexadecimal-digit
0X hexadecimal-digit
hexadecimal-literal hexadecimal-digit
nonzero-digit: one of
1 2 3 4 5 6 7 8 9
octal-digit: one of
0 1 2 3 4 5 6 7
hexadecimal-digit: one of
0 1 2 3 4 5 6 7 8 9
a b c d e f
A B C D E F
integer-suffix:
unsigned-suffix long-suffixopt
long-suffix unsigned-suffixopt
unsigned-suffix: one of
u U
long-suffix: one of
l L
_________________________
6) The term "literal" generally designates, in this International
Standard, those tokens that are called "constants" in ISO C.
1 An integer literal consisting of a sequence of digits is taken to be
decimal (base ten) unless it begins with 0 (digit zero). A sequence
of octal digits7) starting with 0 is taken to be an octal integer
(base eight). A sequence of digits preceded by 0x or 0X is taken to
be a hexadecimal integer (base sixteen). The hexadecimal digits
include a or A through f or F with decimal values ten through fifteen.
[Example: the number twelve can be written 12, 014, or 0XC. ]
2 The type of an integer literal depends on its form, value, and suffix.
If it is decimal and has no suffix, it has the first of these types in
which its value can be represented: int, long int, unsigned long
int.8) If it is octal or hexadecimal and has no suffix, it has the
first of these types in which its value can be represented: int,
unsigned int, long int, unsigned long int. If it is suffixed by u or
U, its type is the first of these types in which its value can be rep
resented: unsigned int, unsigned long int. If it is suffixed by l or
L, its type is the first of these types in which its value can be rep
resented: long int, unsigned long int. If it is suffixed by ul, lu,
uL, Lu, Ul, lU, UL, or LU, its type is unsigned long int.
3 A program is ill-formed if it contains an integer literal that cannot
be represented by any of the allowed types.
2.10.2 Character literals [lex.ccon]
character-literal:
'c-char-sequence'
L'c-char-sequence'
c-char-sequence:
c-char
c-char-sequence c-char
c-char:
any member of the source character set except
the single-quote ', backslash \, or new-line character
escape-sequence
escape-sequence:
simple-escape-sequence
octal-escape-sequence
hexadecimal-escape-sequence
simple-escape-sequence: one of
\' \" \? \\
\a \b \f \n \r \t \v
octal-escape-sequence:
\ octal-digit
\ octal-digit octal-digit
\ octal-digit octal-digit octal-digit
_________________________
7) The digits 8 and 9 are not octal digits.
8) A decimal integer literal with no suffix never has type unsigned
int. Otherwise, for example, on an implementation where unsigned int
values have 16 bits and unsigned long values have strictly more than
17 bits, we would have -30000<0, -50000>0 (because 50000 would have
type unsigned int), and -70000<0 (because 70000 would have type long).
hexadecimal-escape-sequence:
\x hexadecimal-digit
hexadecimal-escape-sequence hexadecimal-digit
1 A character literal is one or more characters enclosed in single
quotes, as in 'x', optionally preceded by the letter L, as in L'x'.
Single character literals that do not begin with L have type char,
with value equal to the numerical value of the character in the
machine's character set. Multicharacter literals that do not begin
with L have type int and implementation-defined value.
2 A character literal that begins with the letter L, such as L'ab', is a
wide-character literal. Wide-character literals have type wchar_t.9)
Wide-character literals have implementation-defined values, regardless
of the number of characters in the literal.
3 Certain nongraphic characters, the single quote ', the double quote ",
the question mark ?, and the backslash \, can be represented according
to Table 5.
Table 5--escape sequences
+----------------------------------+
|new-line NL (LF) \n |
|horizontal tab HT \t |
|vertical tab VT \v |
|backspace BS \b |
|carriage return CR \r |
|form feed FF \f |
|alert BEL \a |
|backslash \ \\ |
|question mark ? \? |
|single quote ' \' |
|double quote " \" |
|octal number ooo \ooo |
|hex number hhh \xhhh |
+----------------------------------+
If the character following a backslash is not one of those specified,
the behavior is undefined. An escape sequence specifies a single
character.
4 The escape \ooo consists of the backslash followed by one, two, or
three octal digits that are taken to specify the value of the desired
character. The escape \xhhh consists of the backslash followed by x
followed by one or more hexadecimal digits that are taken to specify
the value of the desired character. There is no limit to the number
of digits in a hexadecimal sequence. A sequence of octal or
_________________________
9) They are intended for character sets where a character does not fit
into a single byte.
hexadecimal digits is terminated by the first character that is not an
octal digit or a hexadecimal digit, respectively. The value of a
character literal is implementation-defined if it exceeds that of the
largest char (for ordinary literals) or wchar_t (for wide literals).
2.10.3 Floating literals [lex.fcon]
floating-literal:
fractional-constant exponent-partopt floating-suffixopt
digit-sequence exponent-part floating-suffixopt
fractional-constant:
digit-sequenceopt . digit-sequence
digit-sequence .
exponent-part:
e signopt digit-sequence
E signopt digit-sequence
sign: one of
+ -
digit-sequence:
digit
digit-sequence digit
floating-suffix: one of
f l F L
1 A floating literal consists of an integer part, a decimal point, a
fraction part, an e or E, an optionally signed integer exponent, and
an optional type suffix. The integer and fraction parts both consist
of a sequence of decimal (base ten) digits. Either the integer part
or the fraction part (not both) can be missing; either the decimal
point or the letter e (or E) and the exponent (not both) can be miss
ing. The type of a floating literal is double unless explicitly spec
ified by a suffix. The suffixes f and F specify float, the suffixes l
and L specify long double.
2.10.4 String literals [lex.string]
string-literal:
"s-char-sequenceopt"
L"s-char-sequenceopt"
s-char-sequence:
s-char
s-char-sequence s-char
s-char:
any member of the source character set except
the double-quote ", backslash \, or new-line character
escape-sequence
1 A string literal is a sequence of characters (as defined in
_lex.ccon_) surrounded by double quotes, optionally beginning with the
letter L, as in "..." or L"...". A string literal that does not begin
with L has type "array of n char" and static storage duration
(_basic.stc_), where n is the size of the string as defined below, and
is initialized with the given characters. Whether all string literals
are distinct (that is, are stored in nonoverlapping objects) is imple
mentation-defined. The effect of attempting to modify a string
literal is undefined.
2 A string literal that begins with L, such as L"asdf", is a wide string
literal. A wide string literal has type "array of n wchar_t," where n
is the size of the string as defined below.
3 Adjacent string literals are concatenated. Adjacent wide string lit
erals are concatenated. If a string literal token is adjacent to a
wide string literal token, the behavior is undefined. Characters in
concatenated strings are kept distinct. [Example:
"\xA" "B"
contains the two characters '\xA' and 'B' after concatenation (and not
the single hexadecimal character '\xAB'). ]
4 After any necessary concatenation '\0' is appended so that programs
that scan a string can find its end. The size of a string is the num
ber of its characters including this terminator. Within a string, the
double quote character " shall be preceded by a \.
5 Escape sequences in string literals have the same meaning as in char
acter literals (_lex.ccon_).
2.10.5 Boolean literals [lex.bool]
boolean-literal:
false
true
1 The Boolean literals are the keywords false and true. Such literals
have type bool. They are not lvalues.