WG14/N717 J11/97-080 1997-06-23 Thomas Plum Wording for "Extended Identifiers" [Revision #4, after voting] In the text below, lines that start with 6 spaces are quoted intact from C9X draft 9. The lines at the left margin are the proposed words to incorporate extended identifiers, taken generally verbatim from the second C++ CD 14882. 5.1.1.2 Translation phases [#1] The precedence among the syntax rules of translation is specified by the following phases. 1. Physical source file characters are mapped to the source character set (introducing new-line characters for end-of-line indicators) if necessary. Trigraph sequences are replaced by corresponding single- character internal representations. Any source file character not in the basic source character set is replaced by the universal­character­name that designates that character.*) --------------- *) The process of handling extended characters is specified in terms of mapping to an encoding that uses only the basic source character set, and, in the case of character literals and strings, further mapping to the execution character set. In practical terms, however, any internal encoding may be used, so long as an actual extended character encountered in the input, and the same extended character expressed in the input as a universal­character­name (i.e. using the notation), are handled equivalently. --------------- [...] 4. Preprocessing directives are executed, macro invocations are expanded, and pragma unary operator expressions are executed. If a character sequence that matches the syntax of a universal­character­name is produced by token concatenation (16.3.3), the behavior is undefined. A #include preprocessing directive causes the named header or source file to be processed from phase 1 through phase 4, recursively. All preprocessing directives are then deleted. 5. Each source character set member, escape sequence, and universal-character-name in character constants and string literals is converted to a member of the execution character set. [etc as-is] Constraints A universal-character-name shall not specify a character short identifier in the ranges 0000 through 0020 or 007F through 009F, inclusive. A universal-character-name shall not designate a character in the basic source character set. 5.2 Environmental considerations 5.2.1 Character sets [#1] Two sets of characters and their associated collating sequences shall be defined: the set in which source files are written, and the set interpreted in the execution environment. The values of the members of the execution character set are implementation-defined; any additional members beyond those required by this subclause are locale- specific. [etc as-is, to the last paragraph of 5.2.1, then add...] The universal­character­name construct provides a way to name other characters. hex­quad: hexadecimal­digit hexadecimal­digit hexadecimal­digit hexadecimal­digit universal­character­name: \u hex­quad \U hex­quad hex­quad The character designated by the universal­character­name \UNNNNNNNN is that character whose character short identifier is NNNNNNNN specified by ISO/IEC 10646 pDAM-9; the character designated by the universal­character­name \uNNNN is that character whose character short identifier is 0000NNNN specified by ISO/IEC 10646 pDAM-9. [This wording reflects comments from Japan about C++ CD2.] Forward references: character constants (6.1.3.4), preprocessing directives (6.8), string literals (6.1.4), comments (6.1.9). [...] 6.1.2 Identifiers Syntax [#1] identifier: nondigit identifier nondigit nondigit: one of universal­character­name _ a b c d e f g h i j k l m n o p q r s t u v w x y z A B C D E F G H I J K L M N O P Q R S T U V W X Y Z [#2] An identifier is a sequence of nondigit characters (including the underscore _ and the lowercase and uppercase letters) and digits. Each universal­character­name in an identifier shall designate a character whose encoding in ISO 10646 falls into one of the ranges specified in Annex xxx.*) ----------------- *) On systems in which linkers cannot accept extended characters, an encoding of the universal­character­name may be used in forming valid external identifiers. For example, some otherwise unused character or sequence of characters may be used to encode the \u in a universal­character­name. Extended characters may produce a long external identifier. ----------------- The first character shall be a nondigit character. [...] 6.1.3.4 Character constants Syntax [#1] c-char: any member of the source character set except the single-quote ', backslash \, or new-line character escape-sequence universal-character-name 6.1.4 String literals Syntax [#1] s-char: any member of the source character set except the double-quote ", backslash \, or new-line character escape-sequence universal-character-name ___________________________________________________________________ Annex xxx (normative) Universal-character-names for identifiers ___________________________________________________________________ 1 This Clause lists the hexadecimal code values that are valid in uni- versal-character-names in identifiers. 2 This table is reproduced unchanged from ISO/IEC PDTR 10176, produced by ISO/IEC JTC1/SC22/WG20, except that the ranges 0041-005a and 0061-007a designate the upper and lower case English alphabets, which are part of the basic source character set, and are not repeated in the table below.*) -------------- *) If PDTR 10176 is changed during its balloting and adoption as a TR, then this table should be changed to match its changes. -------------- Latin: 00c0-00d6, 00d8-00f6, 00f8-01f5, 01fa-0217, 0250-02a8, 1e00-1e9a, 1ea0-1ef9 Greek: 0384, 0388-038a, 038c, 038e-03a1, 03a3-03ce, 03d0-03d6, 03da, 03dc, 03de, 03e0, 03e2-03f3, 1f00-1f15, 1f18-1f1d, 1f20-1f45, 1f48-1f4d, 1f50-1f57, 1f59, 1f5b, 1f5d, 1f5f-1f7d, 1f80-1fb4, 1fb6-1fbc, 1fc2-1fc4, 1fc6-1fcc, 1fd0-1fd3, 1fd6-1fdb, 1fe0-1fec, 1ff2-1ff4, 1ff6-1ffc Cyrilic: 0401-040d, 040f-044f, 0451-045c, 045e-0481, 0490-04c4, 04c7-04c8, 04cb-04cc, 04d0-04eb, 04ee-04f5, 04f8-04f9 Armenian: 0531-0556, 0561-0587 Hebrew: 05d0-05ea, 05f0-05f4 Arabic: 0621-063a, 0640-0652, 0670-06b7, 06ba-06be, 06c0-06ce, 06e5-06e7 Devanagari: 0905-0939, 0958-0962 Bengali: 0985-098c, 098f-0990, 0993-09a8, 09aa-09b0, 09b2, 09b6-09b9, 09dc-09dd, 09df-09e1, 09f0-09f1 Gurmukhi: 0a05-0a0a, 0a0f-0a10, 0a13-0a28, 0a2a-0a30, 0a32-0a33, 0a35-0a36, 0a38-0a39, 0a59-0a5c, 0a5e Gujarati: 0a85-0a8b, 0a8d, 0a8f-0a91, 0a93-0aa8, 0aaa-0ab0, 0ab2-0ab3, 0ab5-0ab9, 0ae0 Oriya: 0b05-0b0c, 0b0f-0b10, 0b13-0b28, 0b2a-0b30, 0b32-0b33, 0b36-0b39, 0b5c-0b5d, 0b5f-0b61 Tamil: 0b85-0b8a, 0b8e-0b90, 0b92-0b95, 0b99-0b9a, 0b9c, 0b9e-0b9f, 0ba3-0ba4, 0ba8-0baa, 0bae-0bb5, 0bb7-0bb9 Telugu: 0c05-0c0c, 0c0e-0c10, 0c12-0c28, 0c2a-0c33, 0c35-0c39, 0c60-0c61 Kannada: 0c85-0c8c, 0c8e-0c90, 0c92-0ca8, 0caa-0cb3, 0cb5-0cb9, 0ce0-0ce1 Malayalam: 0d05-0d0c, 0d0e-0d10, 0d12-0d28, 0d2a-0d39, 0d60-0d61 Thai: 0e01-0e30, 0e32-0e33, 0e40-0e46, 0e4f-0e5b Lao: 0e81-0e82, 0e84, 0e87, 0e88, 0e8a, 0e0d, 0e94-0e97, 0e99-0e9f, 0ea1-0ea3, 0ea5, 0ea7, 0eaa, 0eab, 0ead-0eb0, 0eb2, 0eb3, 0ebd, 0ec0-0ec4, 0ec6 Georgian: 10a0-10c5, 10d0-10f6 Hiragana: 3041-3094, 309b-309e Katakana: 30a1-30fe Bopmofo: 3105-312c Hangul: 1100-1159, 1161-11a2, 11a8-11f9 CJK Unified Ideographs: f900-fa2d, fb1f-fb36, fb38-fb3c, fb3e, fb40-fb41, fb42-fb44, fb46-fbb1, fbd3-fd3f, fd50-fd8f, fd92-fdc7, fdf0-fdfb, fe70-fe72, fe74, 5e76-fefc, ff21-ff3a, ff41-ff5a, ff66-ffbe, ffc2-ffc7, ffca-ffcf, ffd2-ffd7, ffda-ffdc, 4e00-9fa5 [Denmark (Keld Simonsen) commented re C++ CD2: Due to the change in ISO/IEC 10646 of the encoding of Hangul characters, we propose to change the allowable characters defined for extended identifiers as follows: Remove the range U3400..U4DFF insert the range UAC00..UD7AF This change has also been processed to DTR 10176.]