lex


  ______________________________________________________________________

  2   Lexical conventions                                [lex]

  ______________________________________________________________________

1 A C++ program need not all be translated at the same time.   The  text
  of  the program is kept in units called source files in this standard.
  A source file together with all the headers (_lib.headers_) and source
  files   included   (_cpp.include_)  via  the  preprocessing  directive
  #include, less any source lines skipped  by  any  of  the  conditional
  inclusion  (_cpp.cond_) preprocessing directives, is called a transla
  tion unit.  Previously translated translation units can  be  preserved
  individually  or  in  libraries.   The separate translation units of a
  program communicate (_basic.link_) by (for example) calls to functions
  whose identifiers have external linkage, manipulation of objects whose
  identifiers have external linkage,  or  manipulation  of  data  files.
  Translation  units  can be separately translated and then later linked
  to produce an executable program.  (_basic.link_).

  2.1  Phases of translation                                [lex.phases]

1 The precedence among the syntax rules of translation is  specified  by
  the following phases.1)

    1 Physical source file characters are mapped to the source character
      set  (introducing  new-line characters for end-of-line indicators)
      if necessary.  Trigraph sequences (_lex.trigraph_) are replaced by
      corresponding single-character internal representations.

    2 Each instance of a new-line character and an immediately preceding
      backslash character is deleted, splicing physical source lines  to
      form  logical source lines.  A source file that is not empty shall
      end in a new-line character, which shall not be  immediately  pre
      ceded by a backslash character.

    3 The   source   file   is   decomposed  into  preprocessing  tokens
      (_lex.pptoken_) and sequences of white-space characters (including
      comments).  A source file shall not end in a partial preprocessing
      token or partial comment2).  Each comment is replaced by one space
  _________________________
  1)  Implementations must behave as if these separate phases occur, al
  though in practice different phases might be folded together.
  2) A partial preprocessing token would arise from a source file ending
  in  one  or  more  characters of a multi-character token followed by a
  "line-splicing" backslash.  A  partial  comment  would  arise  from  a
  source  file  ending with an unclosed /* comment, or a // comment line
  that ends with a "line-splicing" backslash.

      character.    New-line  characters  are  retained.   Whether  each
      nonempty sequence of white-space characters other than new-line is
      retained  or  replaced  by  one space character is implementation-
      defined.  The process of dividing a source file's characters  into
      preprocessing tokens is context-dependent.  [Example: see the han
      dling of < within a #include preprocessing directive.  ]

    4 Preprocessing directives are executed and  macro  invocations  are
      expanded.   A  #include  preprocessing  directive causes the named
      header or source file to be processed from phase 1  through  phase
      4, recursively.

    5 Each  source character set member and escape sequence in character
      literals and string literals is converted to a member of the  exe
      cution character set.

    6 Adjacent  character string literal tokens are concatenated.  Adja
      cent wide string literal tokens are concatenated.

    7 White-space characters separating tokens are  no  longer  signifi
      cant.   Each  preprocessing token is converted into a token.  (See
      _lex.token_).  The resulting tokens are syntactically and semanti
      cally  analyzed and translated.  The result of this process start
      ing from a single source file is called a translation unit.

    8 The translation units are combined to form a program.  All  exter
      nal  object  and function references are resolved.  Library compo
      nents are linked to satisfy external references to  functions  and
      objects not defined in the current translation.  All such transla
      tor output is collected into a program image which contains infor
      mation needed for execution in its execution environment.

  +-------                 BEGIN BOX 1                -------+
    What about shared libraries?
  +-------                  END BOX 1                 -------+

  2.2  Trigraph sequences                                 [lex.trigraph]

1 Before any other processing takes place, each occurrence of one of the
  following sequences of  three  characters  ("trigraph  sequences")  is
  replaced by the single character indicated in Table 1.

                       Table 1--trigraph sequences

  +-----------------------+------------------------+------------------------+
  |trigraph   replacement | trigraph   replacement | trigraph   replacement |
  +-----------------------+------------------------+------------------------+
  |  ??=           #      |   ??(           [      |   ??<           {      |
  +-----------------------+------------------------+------------------------+
  |  ??/           \      |   ??)           ]      |   ??>           }      |
  +-----------------------+------------------------+------------------------+
  |  ??'           ^      |   ??!           |      |   ??-           ~      |
  +-----------------------+------------------------+------------------------+

2 [Example:
          ??=define arraycheck(a,b) a??(b??) ??!??! b??(a??)
  becomes
          #define arraycheck(a,b) a[b] || b[a]
   --end example]

  2.3  Preprocessing tokens                                [lex.pptoken]

  +-------                 BEGIN BOX 2                -------+
  We  have  deleted the non-terminal for 'digraph', because the alterna
  tive representations are just alternative ways of expressing a "first-
  class"  preprocessing  token.   In C, # and ## are grouped with opera
  tors, but that would involve more work in clause 13, and wouldn't  fit
  the  "spirit  of  C++".   Instead, we simply list under 'preprocessing
  token' all the valid preprocessing tokens.  They are not further cate
  gorized until phase 7, in which they are actual tokens.
  +-------                  END BOX 2                 -------+

          preprocessing-token:
                  header-name
                  identifier
                  pp-number
                  character-literal
                  string-literal
                  preprocessing-op-or-punc
                  each non-white-space character that cannot be one of the above

1 Each  preprocessing  token  that is converted to a token (_lex.token_)
  shall have the lexical form of a keyword, an identifier, a literal, an
  operator, or a punctuator.

2 A  preprocessing  token is the minimal lexical element of the language
  in translation phases 3 through 6.  The  categories  of  preprocessing
  token are: header names, identifiers, preprocessing numbers, character
  literals, string literals, preprocessing-op-or-punc, and  single  non-
  white-space  characters  that do not lexically match the other prepro
  cessing token categories.  If a ' or a " character  matches  the  last
  category,  the  behavior  is  undefined.   Preprocessing tokens can be

  separated by white space; this consists of  comments  (_lex.comment_),
  or  white-space  characters (space, horizontal tab, new-line, vertical
  tab, and form-feed), or both.  As described in Clause _cpp_,  in  cer
  tain  circumstances  during  translation  phase 4, white space (or the
  absence thereof) serves as more than preprocessing  token  separation.
  White  space can appear within a preprocessing token only as part of a
  header name or between the quotation characters in a character literal
  or string literal.

3 If  the input stream has been parsed into preprocessing tokens up to a
  given character, the next preprocessing token is the longest  sequence
  of  characters  that  could  constitute a preprocessing token, even if
  that would cause further lexical analysis to fail.

4 [Example: The program fragment 1Ex is parsed as a preprocessing number
  token  (one  that  is  not a valid floating or integer literal token),
  even though a parse as the pair of preprocessing tokens 1 and Ex might
  produce a valid expression (for example, if Ex were a macro defined as
  +1).  Similarly, the program fragment 1E1 is parsed as a preprocessing
  number  (one that is a valid floating literal token), whether or not E
  is a macro name.  ]

5 [Example: The program fragment x+++++y is parsed  as  x  ++  ++  +  y,
  which,  if  x  and  y  are of built-in types, violates a constraint on
  increment operators, even though the parse x ++ + ++ y might  yield  a
  correct expression.  ]

  2.4  Alternative tokens                                  [lex.digraph]

1 Alternative  token representations are provided for some operators and
  punctuators3).

2 In  all  respects  of the language, each alternative token behaves the
  same,  respectively,  as its primary token, except for its spelling4).
  The set of alternative tokens is defined in Table 2.

  _________________________
  3) These include "digraphs" and additional reserved words.   The  term
  "digraph"  (token  consisting  of two characters) is not perfectly de
  scriptive, since one of the alternative preprocessing-tokens  is  %:%:
  and of course several primary tokens contain two characters.  Nonethe
  less, those alternative tokens that aren't lexical keywords are collo
  quially known as "digraphs".
  4)   Thus   [   and   <:   behave   differently   when    "stringized"
  (_cpp.stringize_), but can otherwise be freely interchanged.

                       Table 2--alternative tokens

  +----------------------+-----------------------+-----------------------+
  |alternative   primary | alternative   primary | alternative   primary |
  +----------------------+-----------------------+-----------------------+
  |    <%           {    |     and         &&    |   and_eq        &=    |
  +----------------------+-----------------------+-----------------------+
  |    %>           }    |    bitor         |    |    or_eq        |=    |
  +----------------------+-----------------------+-----------------------+
  |    <:           [    |     or          ||    |   xor_eq        ^=    |
  +----------------------+-----------------------+-----------------------+
  |    :>           ]    |     xor          ^    |     not          !    |
  +----------------------+-----------------------+-----------------------+
  |    %:           #    |    compl         ~    |   not_eq        !=    |
  +----------------------+-----------------------+-----------------------+
  |   %:%:         ##    |   bitand         &    |                       |
  +----------------------+-----------------------+-----------------------+

  2.5  Tokens                                                [lex.token]
          token:
                  identifier
                  keyword
                  literal
                  operator
                  punctuator

1 There are five kinds of  tokens:  identifiers,  keywords,  literals,5)
  operators,  and  other  separators.   Blanks,  horizontal and vertical
  tabs, newlines, formfeeds, and comments (collectively, "white space"),
  as  described  below,  are  ignored  except  as they serve to separate
  tokens.  Some white space is required to separate  otherwise  adjacent
  identifiers, keywords, and literals.

  2.6  Comments                                            [lex.comment]

1 The  characters  /* start a comment, which terminates with the charac
  ters */.  These comments do not nest.  The characters // start a  com
  ment, which terminates with the next new-line character. If there is a
  form-feed or a vertical-tab character in such a comment,  only  white-
  space  characters shall appear between it and the new-line that termi
  nates the comment; no diagnostic is required.  The comment  characters
  //,  /*,  and  */  have no special meaning within a // comment and are
  treated just like other characters.  Similarly, the comment characters
  // and /* have no special meaning within a /* comment.

  _________________________
  5) Literals include strings and character and numeric literals.

  2.7  Preprocessing numbers                              [lex.ppnumber]
          pp-number:
                  digit
                  . digit
                  pp-number digit
                  pp-number nondigit
                  pp-number e sign
                  pp-number E sign
                  pp-number .

1 Preprocessing  number  tokens  lexically  include all integral literal
  tokens (_lex.icon_) and all floating literal tokens (_lex.fcon_).

2 A preprocessing number does not have a type or a  value;  it  acquires
  both  after  a  successful conversion (as part of translation phase 7,
  _lex.phases_) to an integral  literal  token  or  a  floating  literal
  token.

  2.8  Identifiers                                            [lex.name]
          identifier:
                  nondigit
                  identifier nondigit
                  identifier digit
          nondigit: one of
                  _ a b c d e f g h i j k l m
                    n o p q r s t u v w x y z
                    A B C D E F G H I J K L M
                    N O P Q R S T U V W X Y Z
          digit: one of
                  0 1 2 3 4 5 6 7 8 9

1 An  identifier  is an arbitrarily long sequence of letters and digits.
  The first character shall be a nondigit.  Upper- and  lower-case  let
  ters are different.  All characters are significant.

  2.9  Keywords                                                [lex.key]

1 The identifiers shown in Table 3 are reserved for use as keywords, and
  shall not be used otherwise in phases 7 and 8:

                            Table 3--keywords

  +--------------------------------------------------------------------------+
  |asm          do             inline             short         typeid       |
  |auto         double         int                signed        typename     |
  |bool         dynamic_cast   long               sizeof        union        |
  |break        else           mutable            static        unsigned     |
  |case         enum           namespace          static_cast   using        |
  |catch        explicit       new                struct        virtual      |
  |char         extern         operator           switch        void         |
  |class        false          private            template      volatile     |
  |const        float          protected          this          wchar_t      |
  |const_cast   for            public             throw         while        |
  |continue     friend         register           true                       |
  |default      goto           reinterpret_cast   try                        |
  |delete       if             return             typedef                    |
  +--------------------------------------------------------------------------+

2 Furthermore, the alternative representations shown in Table 4 for cer
  tain  operators and punctuators (_lex.digraph_) are reserved and shall
  not be used otherwise:

                   Table 4--alternative representations

            +------------------------------------------------+
            |and      and_eq   bitand   bitor   compl    not |
            |not_eq   or       or_eq    xor     xor_eq       |
            +------------------------------------------------+

3 In addition, identifiers containing a double underscore (__) or begin
  ning  with an underscore and an upper-case letter are reserved for use
  by C++ implementations and standard libraries and shall  not  be  used
  otherwise; no diagnostic is required.

4 The  lexical  representation of C++ programs includes a number of pre
  processing tokens which are used in the syntax of the preprocessor  or
  are converted into tokens for operators and punctuators:
          preprocessing-op-or-punc: one of
          {       }       [       ]       #       ##      (       )
          <:      :>      <%      %>      %:      %:%:    ;       :       ...
          new     delete  ?       ::      .       .*
          +       -       *       /       %       ^       &       |       ~
          !       =       <       >       +=      -=      *=      /=      %=
          ^=      &=      |=      <<      >>      >>=     <<=     ==      !=
          <=      >=      &&      ||      ++      --      ,       ->*     ->
          and     and_eq  bitand  bitor   compl   not     not_eq  or      or_eq
          xor     xor_eq

  After  preprocessing,  each preprocessing-op-or-punc is converted to a

  single token in translation phase 7 (_lex.phases_).

5 [Note: Certain implementation-defined properties, such as the type  of
  a  sizeof  (_expr.sizeof_) expression, the ranges of fundamental types
  (_basic.fundamental_), and the types of the most basic  library  func
  tions  are  defined  in the standard headers <climits>, <cstddef>, and
  <new> (_lib.language.support_).  ]

  2.10  Literals                                           [lex.literal]

1 There are several kinds of literals.6)
          literal:
                  integer-literal
                  character-literal
                  floating-literal
                  string-literal
                  boolean-literal

  2.10.1  Integer literals                                    [lex.icon]
          integer-literal:
                  decimal-literal integer-suffixopt
                  octal-literal integer-suffixopt
                  hexadecimal-literal integer-suffixopt
          decimal-literal:
                  nonzero-digit
                  decimal-literal digit
          octal-literal:
                  0
                  octal-literal octal-digit
          hexadecimal-literal:
                  0x hexadecimal-digit
                  0X hexadecimal-digit
                  hexadecimal-literal hexadecimal-digit
          nonzero-digit: one of
                  1  2  3  4  5  6  7  8  9
          octal-digit: one of
                  0  1  2  3  4  5  6  7
          hexadecimal-digit: one of
                  0  1  2  3  4  5  6  7  8  9
                  a  b  c  d  e  f
                  A  B  C  D  E  F
          integer-suffix:
                  unsigned-suffix long-suffixopt
                  long-suffix unsigned-suffixopt
          unsigned-suffix: one of
                  u  U
          long-suffix: one of
                  l  L

  _________________________
  6) The term "literal"  generally  designates,  in  this  International
  Standard, those tokens that are called "constants" in ISO C.

1 An  integer  literal consisting of a sequence of digits is taken to be
  decimal (base ten) unless it begins with 0 (digit zero).   A  sequence
  of  octal  digits7)  starting  with  0 is taken to be an octal integer
  (base eight).  A sequence of digits preceded by 0x or 0X is  taken  to
  be  a  hexadecimal  integer  (base  sixteen).   The hexadecimal digits
  include a or A through f or F with decimal values ten through fifteen.
  [Example: the number twelve can be written 12, 014, or 0XC.  ]

2 The type of an integer literal depends on its form, value, and suffix.
  If it is decimal and has no suffix, it has the first of these types in
  which  its  value  can  be  represented:  int, long int, unsigned long
  int.8)  If  it  is  octal or hexadecimal and has no suffix, it has the
  first of these types in which  its  value  can  be  represented:  int,
  unsigned  int, long int, unsigned long int.  If it is suffixed by u or
  U, its type is the first of these types in which its value can be rep
  resented:  unsigned int, unsigned long int.  If it is suffixed by l or
  L, its type is the first of these types in which its value can be rep
  resented:  long  int, unsigned long int.  If it is suffixed by ul, lu,
  uL, Lu, Ul, lU, UL, or LU, its type is unsigned long int.

3 A program is ill-formed if it contains an integer literal that  cannot
  be represented by any of the allowed types.

  2.10.2  Character literals                                  [lex.ccon]
          character-literal:
                  'c-char-sequence'
                  L'c-char-sequence'
          c-char-sequence:
                  c-char
                  c-char-sequence c-char
          c-char:
                  any member of the source character set except
                          the single-quote ', backslash \, or new-line character
                  escape-sequence
          escape-sequence:
                  simple-escape-sequence
                  octal-escape-sequence
                  hexadecimal-escape-sequence
          simple-escape-sequence: one of
                  \'  \"  \?  \\
                  \a  \b  \f  \n  \r  \t  \v
          octal-escape-sequence:
                  \ octal-digit
                  \ octal-digit octal-digit
                  \ octal-digit octal-digit octal-digit
  _________________________
  7) The digits 8 and 9 are not octal digits.
  8) A decimal integer literal with no suffix never  has  type  unsigned
  int.   Otherwise, for example, on an implementation where unsigned int
  values have 16 bits and unsigned long values have strictly  more  than
  17  bits,  we  would have -30000<0, -50000>0 (because 50000 would have
  type unsigned int), and -70000<0 (because 70000 would have type long).

          hexadecimal-escape-sequence:
                  \x hexadecimal-digit
                  hexadecimal-escape-sequence hexadecimal-digit

1 A  character  literal  is  one  or  more characters enclosed in single
  quotes, as in 'x', optionally preceded by the letter L,  as  in  L'x'.
  Single  character  literals  that  do not begin with L have type char,
  with value equal to the  numerical  value  of  the  character  in  the
  machine's  character  set.   Multicharacter literals that do not begin
  with L have type int and implementation-defined value.

2 A character literal that begins with the letter L, such as L'ab', is a
  wide-character literal.  Wide-character literals have type  wchar_t.9)
  Wide-character literals have implementation-defined values, regardless
  of the number of characters in the literal.

3 Certain nongraphic characters, the single quote ', the double quote ",
  the question mark ?, and the backslash \, can be represented according
  to Table 5.

                        Table 5--escape sequences

                   +----------------------------------+
                   |new-line          NL (LF)   \n    |
                   |horizontal tab    HT        \t    |
                   |vertical tab      VT        \v    |
                   |backspace         BS        \b    |
                   |carriage return   CR        \r    |
                   |form feed         FF        \f    |
                   |alert             BEL       \a    |
                   |backslash         \         \\    |
                   |question mark     ?         \?    |
                   |single quote      '         \'    |
                   |double quote      "         \"    |
                   |octal number      ooo       \ooo  |
                   |hex number        hhh       \xhhh |
                   +----------------------------------+
  If the character following a backslash is not one of those  specified,
  the  behavior  is  undefined.   An  escape sequence specifies a single
  character.

4 The escape \ooo consists of the backslash followed  by  one,  two,  or
  three  octal digits that are taken to specify the value of the desired
  character.  The escape \xhhh consists of the backslash followed  by  x
  followed  by  one or more hexadecimal digits that are taken to specify
  the value of the desired character.  There is no limit to  the  number
  of  digits  in  a  hexadecimal  sequence.   A  sequence  of  octal  or
  _________________________
  9) They are intended for character sets where a character does not fit
  into a single byte.

  hexadecimal digits is terminated by the first character that is not an
  octal  digit  or  a  hexadecimal  digit, respectively.  The value of a
  character literal is implementation-defined if it exceeds that of  the
  largest char (for ordinary literals) or wchar_t (for wide literals).

  2.10.3  Floating literals                                   [lex.fcon]
          floating-literal:
                  fractional-constant exponent-partopt floating-suffixopt
                  digit-sequence exponent-part floating-suffixopt
          fractional-constant:
                  digit-sequenceopt . digit-sequence
                  digit-sequence .
          exponent-part:
                  e signopt digit-sequence
                  E signopt digit-sequence
          sign: one of
                  +  -
          digit-sequence:
                  digit
                  digit-sequence digit
          floating-suffix: one of
                  f  l  F  L

1 A  floating  literal  consists  of an integer part, a decimal point, a
  fraction part, an e or E, an optionally signed integer  exponent,  and
  an  optional type suffix.  The integer and fraction parts both consist
  of a sequence of decimal (base ten) digits.  Either the  integer  part
  or  the  fraction  part  (not both) can be missing; either the decimal
  point or the letter e (or E) and the exponent (not both) can be  miss
  ing.  The type of a floating literal is double unless explicitly spec
  ified by a suffix.  The suffixes f and F specify float, the suffixes l
  and L specify long double.

  2.10.4  String literals                                   [lex.string]
          string-literal:
                  "s-char-sequenceopt"
                  L"s-char-sequenceopt"
          s-char-sequence:
                  s-char
                  s-char-sequence s-char
          s-char:
                  any member of the source character set except
                          the double-quote ", backslash \, or new-line character
                  escape-sequence

1 A   string  literal  is  a  sequence  of  characters  (as  defined  in
  _lex.ccon_) surrounded by double quotes, optionally beginning with the
  letter L, as in "..." or L"...".  A string literal that does not begin
  with L has  type  "array  of  n  char"  and  static  storage  duration
  (_basic.stc_), where n is the size of the string as defined below, and
  is initialized with the given characters.  Whether all string literals
  are distinct (that is, are stored in nonoverlapping objects) is imple
  mentation-defined.  The  effect  of  attempting  to  modify  a  string

  literal is undefined.

2 A string literal that begins with L, such as L"asdf", is a wide string
  literal.  A wide string literal has type "array of n wchar_t," where n
  is the size of the string as defined below.

3 Adjacent  string literals are concatenated.  Adjacent wide string lit
  erals are concatenated.  If a string literal token is  adjacent  to  a
  wide  string  literal token, the behavior is undefined.  Characters in
  concatenated strings are kept distinct.  [Example:
          "\xA" "B"
  contains the two characters '\xA' and 'B' after concatenation (and not
  the single hexadecimal character '\xAB').  ]

4 After  any  necessary  concatenation '\0' is appended so that programs
  that scan a string can find its end.  The size of a string is the num
  ber of its characters including this terminator.  Within a string, the
  double quote character " shall be preceded by a \.

5 Escape sequences in string literals have the same meaning as in  char
  acter literals (_lex.ccon_).

  2.10.5  Boolean literals                                    [lex.bool]
          boolean-literal:
                  false
                  true

1 The  Boolean  literals are the keywords false and true.  Such literals
  have type bool.  They are not lvalues.