Document Number: WG14 N831 / J11 98-030 C9X Revision Proposal ===================== Title: UCN Revision Author: Douglas A. Gwyn Author Affiliation: U.S. Army Research Laboratory Postal Address: 6449 Tauler Ct., Columbia, MD 21045-4530 US E-mail Address: gwyn@arl.mil Telephone Number: (301)394-2287 Fax Number: (301)394-3591 Sponsor: J11: Douglas A. Gwyn Date: 29 May, 1998 Document History: new proposal based on on-line discussion Proposal Category: X Correction Area of Standard Affected: X Language X Preprocessor Prior Art: Plan 9 C compiler, C++ standard, C9x CD1 Target Audience: programmers who need to use extended characters in their C sources, especially string literals Related Documents (if any): C9x CD1 Proposal Attached: X Yes Abstract: The C9x draft requires replacement of extended multibyte source characters by universal character names. This is unwise in two situations: (a) string literals contain multibyte characters in some encoding that doesn't map onto ISO 10646 in a one-to-one manner, e.g. a shift encoding; (b) extended characters are used that have no code value assigned in ISO 10646, e.g. distinctive Chinese and Japanese ideographs for which ISO 10646 assigns the same code. It is better to leave explicit extended multibyte characters as they were written. Proposal: The basic idea is to not require extended multibyte source characters to be mapped into anything else, until translation phase 5 (where execution-time codes are created). Apart from this improvement, an attempt is made to preserve the useful properties and characteristics of the previous UCN-based specification. The implementation of this proposal is given with respect to C9x WD 1997-11-21 as modified by previous editorial changes, but it should be easy enough to apply to any more recent working draft: In 5.1.1.2 Translation phases, change phase 1 to read: 1. Physical source file multibyte characters are mapped to the source character set (introducing new-line characters for end-of-line indicators) if necessary. Trigraph sequences, then universal character names, are replaced by corresponding single members of the source character set. (Delete the footnote about handling extended characters.) Note: Trigraphs have to be processed first, since UCNs use \ which is not defined in the ISO 646 invariant codeset. In 5.1.1.2 Translation phases, phase 4, delete the sentence concerning token concatenation producing a character sequence that matches the syntax of a UCN. In 5.1.1.2 Translation phases, change phase 5 to read: 5. Each source character set member and escape sequence in character constants and string literals is converted to a member of the execution character set. In 5.2.1 Character sets, change paragraph 2 by attaching a footnote * at the end of the first sentence (about \ escape sequences): * Backslash characters introducing universal character names have already been replaced in translation phase 1. Note: If the syntax spec for UCNs has been moved to around 6.1.2 Identifiers, it should be returned to around 5.2.1 (it probably should have its own subsection after 5.2.1.1 Trigraphs). This is required by the change in translation phase. See the following change: In 6.1.2 Identifiers, Syntax, replace the expansion of nondigit as universal-character-name by: extended-source-character and in 6.1.2 Identifiers, Description, replace the sentence about universal character names with: Each extended source character in an identifier shall designate a character whose encoding in ISO 10646-1 falls into one of the ranges specified in Annex I.* (Footnote unchanged.) Note: It is essential that Annex I exclude the basic source characters; otherwise, this spec needs further constraints. I see in the latest draft the sentence preceding the one just mentioned has been changed to mention UCNs; that part must be changed to read: ... and certain extended source characters which will immediately be explained by the sentence above. Also in the latest draft, the sentence following the one replaced above has been changed to mention UCNs; that sentence must be changed to read: The initial nondigit character shall not be an extended source character designating a digit. In 6.1.3.4 Character constants, delete the expansion of c-char as universal-character-name. In 6.1.4 String literals, delete the expansion of s-char as universal-character name. Note: "Any member of the source character set" includes extended source characters, including those resulting from UCN replacement in translation phase 1. (Annex I Universal character names for identifiers seems OK as is.) In K.2 Undefined behavior, delete the second item (about token concatenation producing a character sequence matching the syntax of a UCN). And that's it -- the effect is to allow UCNs to be used to denote characters not in the local source character set, to allow extended source characters in identifiers, and to map multibyte and extended-resulting-from-UCN-replacement source characters in character constants and string literals only at the last possible moment, so shift codes etc. are intact.