 
 JTC1/SC22/WG14 
N717
JTC1/SC22/WG14 
N717
WG14/N717
J11/97-080
1997-06-23
Thomas Plum
Wording for "Extended Identifiers"     [Revision #4, after voting]
In the text below, lines that start with 6 spaces are quoted
intact from C9X draft 9.  The lines at the left margin are
the proposed words to incorporate extended identifiers, 
taken generally verbatim from the second C++ CD 14882.
        5.1.1.2  Translation phases
        [#1] The precedence among the syntax rules of translation is
        specified by the following phases.
         1.  Physical source file  characters  are  mapped  to  the
             source  character set (introducing new-line characters
             for end-of-line indicators)  if  necessary.   Trigraph
             sequences   are   replaced  by  corresponding  single-
             character internal representations.
Any source file character not in the basic source character set  
is replaced by the universalcharactername that designates that 
character.*)
---------------
*) The process of handling extended characters is specified in terms 
of mapping to an encoding that uses only the basic source character 
set, and, in the case of character literals and strings, further 
mapping to the execution character set. In practical terms, however, 
any internal encoding may be used, so long as an actual extended 
character encountered in the input, and the same extended character 
expressed in the input as a universalcharactername (i.e. using the 
notation), are handled equivalently.
---------------
             [...]
         4.  Preprocessing   directives   are    executed,    macro
             invocations  are  expanded,  and pragma unary operator
             expressions are executed.   
If a character sequence that matches the syntax of a 
universalcharactername is produced by token concatenation 
(16.3.3), the behavior is undefined.
             A  #include  preprocessing
             directive causes the named header or source file to be
             processed from phase 1 through phase  4,  recursively.
             All preprocessing directives are then deleted.
         5.  Each source character set member,
escape  sequence, and universal-character-name
             in   character   constants   and  string  literals  is
             converted to a member of the execution character set.
         [etc as-is]
Constraints
A universal-character-name shall not specify a character short identifier
in the ranges 0000 through 0020 or 007F through 009F, inclusive.  A 
universal-character-name shall not designate a character in the basic source character set.
       5.2  Environmental considerations
       5.2.1  Character sets
       [#1] Two sets of characters and their  associated  collating
       sequences  shall  be defined:  the set in which source files
       are written,  and  the  set  interpreted  in  the  execution
       environment.   The  values  of  the members of the execution
       character set  are  implementation-defined;  any  additional
       members  beyond those required by this subclause are locale-
       specific.
[etc as-is, to the last paragraph of 5.2.1, then add...]
The universalcharactername construct provides a way to name other 
characters.
hexquad: hexadecimaldigit hexadecimaldigit hexadecimaldigit hexadecimaldigit
universalcharactername: \u hexquad 
                          \U hexquad hexquad
The character designated by the universalcharactername \UNNNNNNNN 
is that character whose character short identifier is
NNNNNNNN specified by ISO/IEC 10646 pDAM-9; 
the character designated by the 
universalcharactername \uNNNN is that character whose 
character short identifier is
0000NNNN specified by ISO/IEC 10646 pDAM-9.
[This wording reflects comments from Japan about C++ CD2.]
        Forward   references:    character   constants    (6.1.3.4),
        preprocessing  directives  (6.8),  string  literals (6.1.4),
        comments (6.1.9).
        
        [...]
        6.1.2  Identifiers
        Syntax
        [#1]
                identifier:
                        nondigit
                        identifier nondigit
                nondigit: one of
universalcharactername
                         _  a  b  c  d  e  f  g  h  i  j  k  l  m
                            n  o  p  q  r  s  t  u  v  w  x  y  z
                            A  B  C  D  E  F  G  H  I  J  K  L  M
                            N  O  P  Q  R  S  T  U  V  W  X  Y  Z
        [#2] An identifier is  a  sequence  of  nondigit  characters
        (including  the underscore _ and the lowercase and uppercase
        letters)  and  digits.   
Each universalcharactername in an identifier shall designate 
a character whose encoding in ISO 10646 
falls into one of the ranges specified in Annex xxx.*)
-----------------
*) On systems in which linkers cannot accept extended characters, 
an encoding of the universalcharactername may be used in forming 
valid external identifiers. For example, some otherwise unused 
character or sequence of characters may be used to encode the \u in 
a universalcharactername. Extended characters may produce a long 
external identifier. 
-----------------
        The  first  character  shall  be  a  nondigit character.
             [...]
       6.1.3.4  Character constants
       Syntax
       [#1]
               c-char:
                       any member of the source character set except
                               the single-quote ', backslash \, or 
                               new-line character
                       escape-sequence
universal-character-name
        6.1.4  String literals
        Syntax
        [#1]
                s-char:
                        any member of the source character set except
                                the double-quote ", backslash \, or 
                                new-line character
                        escape-sequence
universal-character-name
  ___________________________________________________________________
  Annex xxx (normative)
  Universal-character-names for identifiers
  ___________________________________________________________________
1 This Clause lists the hexadecimal code values that are valid  in  uni-
  versal-character-names in identifiers.
2 This  table  is reproduced unchanged from ISO/IEC PDTR 10176, produced
  by ISO/IEC  JTC1/SC22/WG20,  except  that  the  ranges  0041-005a  and
  0061-007a  designate the upper and lower case English alphabets, which
  are part of the basic source character set, and are  not  repeated  in
  the table below.*)
--------------
*) If PDTR 10176 is changed during its balloting
  and adoption as a TR, then this table should be changed to  match  its
  changes.
--------------
  Latin:   00c0-00d6,   00d8-00f6,   00f8-01f5,   01fa-0217,  0250-02a8,
  1e00-1e9a, 1ea0-1ef9
  Greek:  0384, 0388-038a, 038c, 038e-03a1, 03a3-03ce, 03d0-03d6,  03da,
  03dc,   03de,   03e0,   03e2-03f3,  1f00-1f15,  1f18-1f1d,  1f20-1f45,
  1f48-1f4d,  1f50-1f57,  1f59,  1f5b,   1f5d,   1f5f-1f7d,   1f80-1fb4,
  1fb6-1fbc,  1fc2-1fc4,  1fc6-1fcc,  1fd0-1fd3,  1fd6-1fdb,  1fe0-1fec,
  1ff2-1ff4, 1ff6-1ffc
  Cyrilic:   0401-040d,  040f-044f,  0451-045c,  045e-0481,   0490-04c4,
  04c7-04c8, 04cb-04cc, 04d0-04eb, 04ee-04f5, 04f8-04f9
  Armenian:  0531-0556, 0561-0587
  Hebrew:  05d0-05ea, 05f0-05f4
  Arabic:    0621-063a,   0640-0652,  0670-06b7,  06ba-06be,  06c0-06ce,
  06e5-06e7
  Devanagari:  0905-0939, 0958-0962
  Bengali:  0985-098c, 098f-0990, 0993-09a8, 09aa-09b0, 09b2, 09b6-09b9,
  09dc-09dd, 09df-09e1, 09f0-09f1
  Gurmukhi:   0a05-0a0a,  0a0f-0a10,  0a13-0a28,  0a2a-0a30,  0a32-0a33,
  0a35-0a36, 0a38-0a39, 0a59-0a5c, 0a5e
  Gujarati:    0a85-0a8b,   0a8d,   0a8f-0a91,   0a93-0aa8,   0aaa-0ab0,
  0ab2-0ab3, 0ab5-0ab9, 0ae0
  Oriya:    0b05-0b0c,   0b0f-0b10,   0b13-0b28,  0b2a-0b30,  0b32-0b33,
  0b36-0b39, 0b5c-0b5d, 0b5f-0b61
  Tamil:  0b85-0b8a, 0b8e-0b90, 0b92-0b95, 0b99-0b9a,  0b9c,  0b9e-0b9f,
  0ba3-0ba4, 0ba8-0baa, 0bae-0bb5, 0bb7-0bb9
  Telugu:    0c05-0c0c,   0c0e-0c10,  0c12-0c28,  0c2a-0c33,  0c35-0c39,
  0c60-0c61
  Kannada:   0c85-0c8c,  0c8e-0c90,  0c92-0ca8,  0caa-0cb3,   0cb5-0cb9,
  0ce0-0ce1
  Malayalam:  0d05-0d0c, 0d0e-0d10, 0d12-0d28, 0d2a-0d39, 0d60-0d61
  Thai:  0e01-0e30, 0e32-0e33, 0e40-0e46, 0e4f-0e5b
  Lao:   0e81-0e82,  0e84, 0e87, 0e88, 0e8a, 0e0d, 0e94-0e97, 0e99-0e9f,
  0ea1-0ea3, 0ea5,  0ea7,  0eaa,  0eab,  0ead-0eb0,  0eb2,  0eb3,  0ebd,
  0ec0-0ec4, 0ec6
  Georgian:  10a0-10c5, 10d0-10f6
  Hiragana:  3041-3094, 309b-309e
  Katakana:  30a1-30fe
  Bopmofo:  3105-312c
  Hangul:  1100-1159, 1161-11a2, 11a8-11f9
  CJK   Unified   Ideographs:   f900-fa2d,  fb1f-fb36,  fb38-fb3c, fb3e,
  fb40-fb41,  fb42-fb44,  fb46-fbb1,  fbd3-fd3f,  fd50-fd8f,  fd92-fdc7,
  fdf0-fdfb,   fe70-fe72,   fe74,   5e76-fefc,   ff21-ff3a,   ff41-ff5a,
  ff66-ffbe, ffc2-ffc7, ffca-ffcf, ffd2-ffd7, ffda-ffdc, 4e00-9fa5
[Denmark (Keld Simonsen) commented re C++ CD2: 
Due to the change in ISO/IEC 10646 of the encoding of Hangul characters,
we propose to change the allowable characters defined for extended
identifiers as follows:
Remove the range U3400..U4DFF
insert the range UAC00..UD7AF
This change has also been processed to DTR 10176.]