Document Number: WG14 N831 / J11 98-030
C9X Revision Proposal
Title: UCN Revision
Author: Douglas A. Gwyn
Author Affiliation: U.S. Army Research Laboratory
Postal Address: 6449 Tauler Ct., Columbia, MD 21045-4530 US
E-mail Address: firstname.lastname@example.org
Telephone Number: (301)394-2287
Fax Number: (301)394-3591
Sponsor: J11: Douglas A. Gwyn
Date: 29 May, 1998
Document History: new proposal based on on-line discussion
Area of Standard Affected:
Prior Art: Plan 9 C compiler, C++ standard, C9x CD1
Target Audience: programmers who need to use extended
characters in their C sources, especially string literals
Related Documents (if any): C9x CD1
Proposal Attached: X Yes
Abstract: The C9x draft requires replacement of extended
multibyte source characters by universal character names.
This is unwise in two situations: (a) string literals
contain multibyte characters in some encoding that doesn't
map onto ISO 10646 in a one-to-one manner, e.g. a shift
encoding; (b) extended characters are used that have no code
value assigned in ISO 10646, e.g. distinctive Chinese and
Japanese ideographs for which ISO 10646 assigns the same
code. It is better to leave explicit extended multibyte
characters as they were written.
Proposal: The basic idea is to not require extended
multibyte source characters to be mapped into anything else,
until translation phase 5 (where execution-time codes are
created). Apart from this improvement, an attempt is made
to preserve the useful properties and characteristics of the
previous UCN-based specification. The implementation of
this proposal is given with respect to C9x WD 1997-11-21 as
modified by previous editorial changes, but it should be
easy enough to apply to any more recent working draft:
In 126.96.36.199 Translation phases, change phase 1 to read:
1. Physical source file multibyte characters are mapped
to the source character set (introducing new-line
characters for end-of-line indicators) if necessary.
Trigraph sequences, then universal character names,
are replaced by corresponding single members of the
source character set.
(Delete the footnote about handling extended characters.)
Note: Trigraphs have to be processed first, since UCNs use
\ which is not defined in the ISO 646 invariant codeset.
In 188.8.131.52 Translation phases, phase 4, delete the sentence
concerning token concatenation producing a character
sequence that matches the syntax of a UCN.
In 184.108.40.206 Translation phases, change phase 5 to read:
5. Each source character set member and escape sequence
in character constants and string literals is
converted to a member of the execution character
In 5.2.1 Character sets, change paragraph 2 by attaching a
footnote * at the end of the first sentence (about \ escape
* Backslash characters introducing universal character
names have already been replaced in translation
Note: If the syntax spec for UCNs has been moved to around
6.1.2 Identifiers, it should be returned to around 5.2.1 (it
probably should have its own subsection after 220.127.116.11
Trigraphs). This is required by the change in translation
phase. See the following change:
In 6.1.2 Identifiers, Syntax, replace the expansion of
nondigit as universal-character-name by:
and in 6.1.2 Identifiers, Description, replace the sentence
about universal character names with:
Each extended source character in an identifier
shall designate a character whose encoding in ISO
10646-1 falls into one of the ranges specified in
Note: It is essential that Annex I exclude the basic source
characters; otherwise, this spec needs further constraints.
I see in the latest draft the sentence preceding the one
just mentioned has been changed to mention UCNs; that part
must be changed to read:
... and certain extended source characters
which will immediately be explained by the sentence above.
Also in the latest draft, the sentence following the one
replaced above has been changed to mention UCNs; that
sentence must be changed to read:
The initial nondigit character shall not be an
extended source character designating a digit.
In 18.104.22.168 Character constants, delete the expansion of
c-char as universal-character-name.
In 6.1.4 String literals, delete the expansion of s-char as
Note: "Any member of the source character set" includes
extended source characters, including those resulting from
UCN replacement in translation phase 1.
(Annex I Universal character names for identifiers seems OK
In K.2 Undefined behavior, delete the second item (about
token concatenation producing a character sequence matching
the syntax of a UCN).
And that's it -- the effect is to allow UCNs to be used to
denote characters not in the local source character set, to
allow extended source characters in identifiers, and to map
multibyte and extended-resulting-from-UCN-replacement source
characters in character constants and string literals only
at the last possible moment, so shift codes etc. are intact.