______________________________________________________________________

  2   Lexical conventions                                [lex]

  ______________________________________________________________________

1 A C++ program need not all be translated at the same time.   The  text
  of  the program is kept in units called source files in this standard.
  A source file together with all the headers (_lib.headers_) and source
  files   included   (_cpp.include_)  via  the  preprocessing  directive
  #include, less any source lines skipped  by  any  of  the  conditional
  inclusion  (_cpp.cond_) preprocessing directives, is called a transla­
  tion unit.  Previously translated translation units can  be  preserved
  individually  or  in  libraries.   The separate translation units of a
  program communicate (_basic.link_) by (for example) calls to functions
  whose identifiers have external linkage, manipulation of objects whose
  identifiers have external linkage,  or  manipulation  of  data  files.
  Translation  units  can be separately translated and then later linked
  to produce an executable program.  (_basic.link_).

  2.1  Phases of translation                                [lex.phases]

1 The precedence among the syntax rules of translation is  specified  by
  the following phases.1)

    1 Physical source file characters are mapped to the source character
      set  (introducing  new-line characters for end-of-line indicators)
      if necessary.  Trigraph sequences (_lex.trigraph_) are replaced by
      corresponding single-character internal representations.

    2 Each instance of a new-line character and an immediately preceding
      backslash character is deleted, splicing physical source lines  to
      form  logical source lines.  A source file that is not empty shall
      end in a new-line character, which shall not be  immediately  pre­
      ceded by a backslash character.

    3 The   source   file   is   decomposed  into  preprocessing  tokens
      (_lex.pptoken_) and sequences of white-space characters (including
      comments).  A source file shall not end in a partial preprocessing
      token or partial comment2).  Each comment is replaced by one space
  _________________________
  1)  Implementations must behave as if these separate phases occur, al­
  though in practice different phases might be folded together.
  2) A partial preprocessing token would arise from a source file ending
  in  one  or  more  characters of a multi-character token followed by a
  "line-splicing" backslash.  A  partial  comment  would  arise  from  a
  source  file  ending with an unclosed /* comment, or a // comment line
  that ends with a "line-splicing" backslash.

      character.    New-line  characters  are  retained.   Whether  each
      nonempty sequence of white-space characters other than new-line is
      retained  or  replaced  by  one space character is implementation-
      defined.  The process of dividing a source file's characters  into
      preprocessing tokens is context-dependent.  [Example: see the han­
      dling of < within a #include preprocessing directive.  ]

    4 Preprocessing directives are executed and  macro  invocations  are
      expanded.   A  #include  preprocessing  directive causes the named
      header or source file to be processed from phase 1  through  phase
      4, recursively.

    5 Each  source character set member and escape sequence in character
      literals and string literals is converted to a member of the  exe­
      cution character set.

    6 Adjacent  character string literal tokens are concatenated.  Adja­
      cent wide string literal tokens are concatenated.

    7 White-space characters separating tokens are  no  longer  signifi­
      cant.   Each  preprocessing token is converted into a token.  (See
      _lex.token_).  The resulting tokens are syntactically and semanti­
      cally  analyzed and translated.  The result of this process start­
      ing from a single source file is called a translation unit.

    8 The translation units are combined to form a program.  All  exter­
      nal  object  and function references are resolved.  Library compo­
      nents are linked to satisfy external references to  functions  and
      objects not defined in the current translation.  All such transla­
      tor output is collected into a program image which contains infor­
      mation needed for execution in its execution environment.

  +-------                 BEGIN BOX 1                -------+
    What about shared libraries?
  +-------                  END BOX 1                 -------+

  2.2  Trigraph sequences                                 [lex.trigraph]

1 Before any other processing takes place, each occurrence of one of the
  following sequences of  three  characters  ("trigraph  sequences")  is
  replaced by the single character indicated in Table 1.

                       Table 1--trigraph sequences

  +-----------------------+------------------------+------------------------+
  |trigraph   replacement | trigraph   replacement | trigraph   replacement |
  +-----------------------+------------------------+------------------------+
  |  ??=           #      |   ??(           [      |   ??<           {      |
  +-----------------------+------------------------+------------------------+
  |  ??/           \      |   ??)           ]      |   ??>           }      |
  +-----------------------+------------------------+------------------------+
  |  ??'           ^      |   ??!           |      |   ??-           ~      |
  +-----------------------+------------------------+------------------------+

2 [Example:
          ??=define arraycheck(a,b) a??(b??) ??!??! b??(a??)
  becomes
          #define arraycheck(a,b) a[b] || b[a]
   --end example]

3 [Note: no other trigraph sequence exists.  Each ?  that does not begin
  one of the trigraphs listed above is not changed.  ]

  2.3  Preprocessing tokens                                [lex.pptoken]

  +-------                 BEGIN BOX 2                -------+
  We have deleted the non-terminal for 'digraph', because  the  alterna­
  tive representations are just alternative ways of expressing a "first-
  class" preprocessing token.  In C, # and ## are  grouped  with  opera­
  tors,  but that would involve more work in clause 13, and wouldn't fit
  the "spirit of C++".  Instead, we  simply  list  under  'preprocessing
  token' all the valid preprocessing tokens.  They are not further cate­
  gorized until phase 7, in which they are actual tokens.
  +-------                  END BOX 2                 -------+

          preprocessing-token:
                  header-name
                  identifier
                  pp-number
                  character-literal
                  string-literal
                  preprocessing-op-or-punc
                  each non-white-space character that cannot be one of the above

1 Each preprocessing token that is converted to  a  token  (_lex.token_)
  shall have the lexical form of a keyword, an identifier, a literal, an
  operator, or a punctuator.

2 A preprocessing token is the minimal lexical element of  the  language
  in  translation  phases  3 through 6.  The categories of preprocessing
  token are: header names, identifiers, preprocessing numbers, character
  literals,  string  literals, preprocessing-op-or-punc, and single non-

  white-space characters that do not lexically match the  other  prepro­
  cessing  token  categories.   If a ' or a " character matches the last
  category, the behavior is undefined.  Preprocessing tokens can be sep­
  arated  by  white space; this consists of comments (_lex.comment_), or
  white-space characters (space, horizontal tab, new-line, vertical tab,
  and  form-feed),  or  both.   As described in Clause _cpp_, in certain
  circumstances during translation phase 4, white space (or the  absence
  thereof)  serves  as  more than preprocessing token separation.  White
  space can appear within a preprocessing token only as part of a header
  name  or  between  the  quotation characters in a character literal or
  string literal.

3 If the input stream has been parsed into preprocessing tokens up to  a
  given  character, the next preprocessing token is the longest sequence
  of characters that could constitute a  preprocessing  token,  even  if
  that would cause further lexical analysis to fail.

4 [Example: The program fragment 1Ex is parsed as a preprocessing number
  token (one that is not a valid floating  or  integer  literal  token),
  even though a parse as the pair of preprocessing tokens 1 and Ex might
  produce a valid expression (for example, if Ex were a macro defined as
  +1).  Similarly, the program fragment 1E1 is parsed as a preprocessing
  number (one that is a valid floating literal token), whether or not  E
  is a macro name.  ]

5 [Example:  The  program  fragment  x+++++y  is  parsed as x ++ ++ + y,
  which, if x and y are of built-in  types,  violates  a  constraint  on
  increment  operators,  even though the parse x ++ + ++ y might yield a
  correct expression.  ]

  2.4  Alternative tokens                                  [lex.digraph]

1 Alternative token representations are provided for some operators  and
  punctuators3).

2 In all respects of the language, each alternative  token  behaves  the
  same, respectively, as its primary token, except for  its  spelling4).
  The set of alternative tokens is defined in Table 2.

  _________________________
  3)  These  include "digraphs" and additional reserved words.  The term
  "digraph" (token consisting of two characters) is  not  perfectly  de­
  scriptive,  since  one of the alternative preprocessing-tokens is %:%:
  and of course several primary tokens contain two characters.  Nonethe­
  less, those alternative tokens that aren't lexical keywords are collo­
  quially known as "digraphs".
  4)    Thus   [   and   <:   behave   differently   when   "stringized"
  (_cpp.stringize_), but can otherwise be freely interchanged.

                       Table 2--alternative tokens

  +----------------------+-----------------------+-----------------------+
  |alternative   primary | alternative   primary | alternative   primary |
  +----------------------+-----------------------+-----------------------+
  |    <%           {    |     and         &&    |   and_eq        &=    |
  +----------------------+-----------------------+-----------------------+
  |    %>           }    |    bitor         |    |    or_eq        |=    |
  +----------------------+-----------------------+-----------------------+
  |    <:           [    |     or          ||    |   xor_eq        ^=    |
  +----------------------+-----------------------+-----------------------+
  |    :>           ]    |     xor          ^    |     not          !    |
  +----------------------+-----------------------+-----------------------+
  |    %:           #    |    compl         ~    |   not_eq        !=    |
  +----------------------+-----------------------+-----------------------+
  |   %:%:         ##    |   bitand         &    |                       |
  +----------------------+-----------------------+-----------------------+

  2.5  Tokens                                                [lex.token]
          token:
                  identifier
                  keyword
                  literal
                  operator
                  punctuator

1 There  are  five  kinds  of tokens: identifiers, keywords, literals,5)
  operators, and other  separators.   Blanks,  horizontal  and  vertical
  tabs, newlines, formfeeds, and comments (collectively, "white space"),
  as described below, are ignored  except  as  they  serve  to  separate
  tokens.   Some  white space is required to separate otherwise adjacent
  identifiers, keywords, and literals.

  2.6  Comments                                            [lex.comment]

1 The characters /* start a comment, which terminates with  the  charac­
  ters  */.  These comments do not nest.  The characters // start a com­
  ment, which terminates with the next new-line character. If there is a
  form-feed  or  a vertical-tab character in such a comment, only white-
  space characters shall appear between it and the new-line that  termi­
  nates  the  comment;  no  diagnostic  is required.  [Note: The comment
  characters //, /*, and */ have no special meaning within a //  comment
  and  are  treated  just like other characters.  Similarly, the comment
  characters // and /* have no special meaning within a /* comment.  ]

  _________________________
  5) Literals include strings and character and numeric literals.

  2.7  Preprocessing numbers                              [lex.ppnumber]
          pp-number:
                  digit
                  . digit
                  pp-number digit
                  pp-number nondigit
                  pp-number e sign
                  pp-number E sign
                  pp-number .

1 Preprocessing number tokens lexically  include  all  integral  literal
  tokens (_lex.icon_) and all floating literal tokens (_lex.fcon_).

2 A  preprocessing  number  does not have a type or a value; it acquires
  both after a successful conversion (as part of  translation  phase  7,
  _lex.phases_)  to  an  integral  literal  token  or a floating literal
  token.

  2.8  Identifiers                                            [lex.name]
          identifier:
                  nondigit
                  identifier nondigit
                  identifier digit
          nondigit: one of
                  _ a b c d e f g h i j k l m
                    n o p q r s t u v w x y z
                    A B C D E F G H I J K L M
                    N O P Q R S T U V W X Y Z
          digit: one of
                  0 1 2 3 4 5 6 7 8 9

1 An identifier is an arbitrarily long sequence of letters  and  digits.
  The  first  character shall be a nondigit.  Upper- and lower-case let­
  ters are different.  All characters are significant.

2 In addition, identifiers containing a double underscore (__) or begin­
  ning  with an underscore and an upper-case letter are reserved for use
  by C++ implementations and standard libraries and shall  not  be  used
  otherwise; no diagnostic is required.

  2.9  Keywords                                                [lex.key]

1 The identifiers shown in Table 3 are reserved for use as keywords, and
  shall not be used otherwise in phases 7 and 8:

                            Table 3--keywords

  +--------------------------------------------------------------------------+
  |asm          do             inline             short         typeid       |
  |auto         double         int                signed        typename     |
  |bool         dynamic_cast   long               sizeof        union        |
  |break        else           mutable            static        unsigned     |
  |case         enum           namespace          static_cast   using        |
  |catch        explicit       new                struct        virtual      |
  |char         extern         operator           switch        void         |
  |class        false          private            template      volatile     |
  |const        float          protected          this          wchar_t      |
  |const_cast   for            public             throw         while        |
  |continue     friend         register           true                       |
  |default      goto           reinterpret_cast   try                        |
  |delete       if             return             typedef                    |
  +--------------------------------------------------------------------------+

2 Furthermore, the alternative representations shown in Table 4 for cer­
  tain  operators and punctuators (_lex.digraph_) are reserved and shall
  not be used otherwise:

                   Table 4--alternative representations

            +------------------------------------------------+
            |and      and_eq   bitand   bitor   compl    not |
            |not_eq   or       or_eq    xor     xor_eq       |
            +------------------------------------------------+

3 The lexical representation of C++ programs includes a number  of  pre­
  processing  tokens which are used in the syntax of the preprocessor or
  are converted into tokens for operators and punctuators:
          preprocessing-op-or-punc: one of
          {       }       [       ]       #       ##      (       )
          <:      :>      <%      %>      %:      %:%:    ;       :       ...
          new     delete  ?       ::      .       .*
          +       -       *       /       %       ^       &       |       ~
          !       =       <       >       +=      -=      *=      /=      %=
          ^=      &=      |=      <<      >>      >>=     <<=     ==      !=
          <=      >=      &&      ||      ++      --      ,       ->*     ->
          and     and_eq  bitand  bitor   compl   not     not_eq  or      or_eq
          xor     xor_eq

  Each preprocessing-op-or-punc is converted to a single token in trans­
  lation phase 7 (_lex.phases_).

  2.10  Literals                                           [lex.literal]

1 There are several kinds of literals.6)
          literal:
                  integer-literal
                  character-literal
                  floating-literal
                  string-literal
                  boolean-literal

  2.10.1  Integer literals                                    [lex.icon]
          integer-literal:
                  decimal-literal integer-suffixopt
                  octal-literal integer-suffixopt
                  hexadecimal-literal integer-suffixopt
          decimal-literal:
                  nonzero-digit
                  decimal-literal digit
          octal-literal:
                  0
                  octal-literal octal-digit
          hexadecimal-literal:
                  0x hexadecimal-digit
                  0X hexadecimal-digit
                  hexadecimal-literal hexadecimal-digit
          nonzero-digit: one of
                  1  2  3  4  5  6  7  8  9
          octal-digit: one of
                  0  1  2  3  4  5  6  7
          hexadecimal-digit: one of
                  0  1  2  3  4  5  6  7  8  9
                  a  b  c  d  e  f
                  A  B  C  D  E  F
          integer-suffix:
                  unsigned-suffix long-suffixopt
                  long-suffix unsigned-suffixopt
          unsigned-suffix: one of
                  u  U
          long-suffix: one of
                  l  L

1 An integer literal is a sequence of digits that has no period or expo­
  nent  part.   An  integer literal may have a prefix that specifies its
  base and a suffix that specifies its type.  The lexically first  digit
  of  the sequence of digits is the most significant.  A decimal integer
  literal (base ten) begins with a digit other then 0 and consists of  a
  sequence  of  decimal  digits.   An octal integer literal (base eight)
  begins with the digit 0 and consists of a sequence of octal  digits.7)
  An hexadecimal integer literal (base sixteen) begins with 0x or 0X and
  _________________________
  6)  The  term  "literal"  generally  designates, in this International
  Standard, those tokens that are called "constants" in ISO C.
  7) The digits 8 and 9 are not octal digits.

  consists of a sequence of hexadecimal digits which include the decimal
  digits  and  the letters a or A through f or F with decimal values ten
  through fifteen.  [Example: the number twelve can be written 12,  014,
  or 0XC.  ]

2 The type of an integer literal depends on its form, value, and suffix.
  If it is decimal and has no suffix, it has the first of these types in
  which  its  value  can  be  represented:  int, long int, unsigned long
  int.8)  If  it  is  octal or hexadecimal and has no suffix, it has the
  first of these types in which  its  value  can  be  represented:  int,
  unsigned  int, long int, unsigned long int.  If it is suffixed by u or
  U, its type is the first of these types in which its value can be rep­
  resented:  unsigned int, unsigned long int.  If it is suffixed by l or
  L, its type is the first of these types in which its value can be rep­
  resented:  long  int, unsigned long int.  If it is suffixed by ul, lu,
  uL, Lu, Ul, lU, UL, or LU, its type is unsigned long int.

3 A program is ill-formed if one of its translation  units  contains  an
  integer  literal  that  cannot  be  represented  by any of the allowed
  types.

  2.10.2  Character literals                                  [lex.ccon]
          character-literal:
                  'c-char-sequence'
                  L'c-char-sequence'
          c-char-sequence:
                  c-char
                  c-char-sequence c-char
          c-char:
                  any member of the source character set except
                          the single-quote ', backslash \, or new-line character
                  escape-sequence
          escape-sequence:
                  simple-escape-sequence
                  octal-escape-sequence
                  hexadecimal-escape-sequence
          simple-escape-sequence: one of
                  \'  \"  \?  \\
                  \a  \b  \f  \n  \r  \t  \v
          octal-escape-sequence:
                  \ octal-digit
                  \ octal-digit octal-digit
                  \ octal-digit octal-digit octal-digit
          hexadecimal-escape-sequence:
                  \x hexadecimal-digit
                  hexadecimal-escape-sequence hexadecimal-digit
  _________________________
  8) A decimal integer literal with no suffix never  has  type  unsigned
  int.   Otherwise, for example, on an implementation where unsigned int
  values have 16 bits and unsigned long values have strictly  more  than
  17  bits,  we  would have -30000<0, -50000>0 (because 50000 would have
  type unsigned int), and -70000<0 (because 70000 would have type long).

1 A character literal is one  or  more  characters  enclosed  in  single
  quotes,  as  in  'x', optionally preceded by the letter L, as in L'x'.
  Single character literals that do not begin with  L  have  type  char,
  with  value  equal  to  the  numerical  value  of the character in the
  machine's character set.  Multicharacter literals that  do  not  begin
  with L have type int and implementation-defined value.

2 A character literal that begins with the letter L, such as L'ab', is a
  wide-character  literal.  Wide-character literals have type wchar_t.9)
  Wide-character literals have implementation-defined values, regardless
  of the number of characters in the literal.

3 Certain nongraphic characters, the single quote ', the double quote ",
  the question mark ?, and the backslash \, can be represented according
  to Table 5.

                        Table 5--escape sequences

                   +----------------------------------+
                   |new-line          NL (LF)   \n    |
                   |horizontal tab    HT        \t    |
                   |vertical tab      VT        \v    |
                   |backspace         BS        \b    |
                   |carriage return   CR        \r    |
                   |form feed         FF        \f    |
                   |alert             BEL       \a    |
                   |backslash         \         \\    |
                   |question mark     ?         \?    |
                   |single quote      '         \'    |
                   |double quote      "         \"    |
                   |octal number      ooo       \ooo  |
                   |hex number        hhh       \xhhh |
                   +----------------------------------+
  The  double  quote  "  and  the question mark ?, can be represented as
  themselves or by the escape sequences \" and \?  respectively, but the
  single quote ' and the backslash \, shall be represented by the escape
  sequences \' and \\ respectively.  If the character following a  back­
  slash  is  not  one of those specified, the behavior is undefined.  An
  escape sequence specifies a single character.

4 The escape \ooo consists of the backslash followed  by  one,  two,  or
  three  octal digits that are taken to specify the value of the desired
  character.  The escape \xhhh consists of the backslash followed  by  x
  followed  by  one or more hexadecimal digits that are taken to specify
  the value of the desired character.  There is no limit to  the  number
  of digits in a hexadecimal sequence.  A sequence of octal or hexadeci­
  mal digits is terminated by the first character that is not  an  octal
  _________________________
  9) They are intended for character sets where a character does not fit
  into a single byte.

  digit  or a hexadecimal digit, respectively.  The value of a character
  literal is implementation-defined if it falls outside of the implemen­
  tation-defined  range  defined  for  char  (for  ordinary literals) or
  wchar_t (for wide literals).

  2.10.3  Floating literals                                   [lex.fcon]
          floating-literal:
                  fractional-constant exponent-partopt floating-suffixopt
                  digit-sequence exponent-part floating-suffixopt
          fractional-constant:
                  digit-sequenceopt . digit-sequence
                  digit-sequence .
          exponent-part:
                  e signopt digit-sequence
                  E signopt digit-sequence
          sign: one of
                  +  -
          digit-sequence:
                  digit
                  digit-sequence digit
          floating-suffix: one of
                  f  l  F  L

1 A floating literal consists of an integer part,  a  decimal  point,  a
  fraction  part,  an e or E, an optionally signed integer exponent, and
  an optional type suffix.  The integer and fraction parts both  consist
  of  a  sequence of decimal (base ten) digits.  Either the integer part
  or the fraction part (not both) can be  missing;  either  the  decimal
  point  or the letter e (or E) and the exponent (not both) can be miss­
  ing.  The integer part, the optional decimal point  and  the  optional
  fraction  part form the significant part of the floating literal.  The
  exponent, if present, indicates the power of 10 by which the  signifi­
  cant  part  is  to  be scaled.  If the scaled value is in the range of
  representable values for its type, the result is  either  the  nearest
  representable  value,  or  the  larger  or smaller representable value
  immediately adjacent to the nearest representatble value, chosen in an
  implementation-defined manner.  The type of a floating literal is dou­
  ble unless explicitly specified by a suffix.  The  suffixes  f  and  F
  specify float, the suffixes l and L specify long double.

  2.10.4  String literals                                   [lex.string]
          string-literal:
                  "s-char-sequenceopt"
                  L"s-char-sequenceopt"
          s-char-sequence:
                  s-char
                  s-char-sequence s-char
          s-char:
                  any member of the source character set except
                          the double-quote ", backslash \, or new-line character
                  escape-sequence

1 A   string  literal  is  a  sequence  of  characters  (as  defined  in
  _lex.ccon_) surrounded by double quotes, optionally beginning with the
  letter L, as in "..." or L"...".  A string literal that does not begin
  with L has  type  "array  of  n  char"  and  static  storage  duration
  (_basic.stc_), where n is the size of the string as defined below, and
  is initialized with the  given  characters.   A  string  literal  that
  begins  with  L,  such  as  L"asdf", is a wide string literal.  A wide
  string literal has type "array of n wchar_t" and  has  static  storage
  duration,  where  n is the size of the string as defined below, and is
  initialized with the given characters.

2 Whether all string literals are  distinct  (that  is,  are  stored  in
  nonoverlapping  objects)  is  implementation-defined.   The  effect of
  attempting to modify a string literal is undefined.

3 In translation phase 6 (_lex.phases_), adjacent  string  literals  are
  concatenated and adjacent wide string literals are concatenated.  If a
  string literal token is adjacent to a wide string literal  token,  the
  behavior  is  undefined.   Characters in concatenated strings are kept
  distinct.  [Example:
          "\xA" "B"
  contains the two characters '\xA' and 'B' after concatenation (and not
  the single hexadecimal character '\xAB').  ]

4 After    any    necessary    concatenation,   in   translation   phase
  7(_lex.phases_), '\0' is appended to every string literal so that pro­
  grams that scan a string can find its end.

5 Escape  sequences in string literals have the same meaning as in char­
  acter literals (_lex.ccon_), except that the single quote ' is  repre­
  sentable either by itself or by the escape sequence \', and the double
  quote " shall be preceded by a \.  The size of a string literal is the
  number  of  its  characters including the '\0' terminator, except that
  each escape sequence specifies a single character.

  2.10.5  Boolean literals                                    [lex.bool]
          boolean-literal:
                  false
                  true

1 The Boolean literals are the keywords false and true.   Such  literals
  have type bool.  They are not lvalues.