CEN Guide to the Use of Character Sets in Europe

TC 304

UCS - Referencing of characters

Identification of characters for migration to the UCS

Since a character is an abstract concept, the question of whether two characters are or are not "the same" is not a trivial one. If we recall the definition:

character: A member of a set of elements used for the organisation, control, or representation of data (taken from ISO/IEC 10646-1:1993)

then there is no problem within a single coded character set, since any such set must clearly specify its members. But frequently, in the transmission or processing of character data, that data needs to be converted from one coded representation to another. This is a particular problem in the migration of existing applications from other character set codes to the UCS. The question then arises of identifying "the same character" in two different character sets.

One cannot just look at the characters, in a visual representation, since distinct characters may have the same glyph. The question should be whether they are both used in the same way in the organisation, control or representation of data. But the specification of a coded character set does not specify how its characters should be used; that is outside of its scope. It merely makes the characters available for use. The "sameness" of characters from different coded character sets is therefore ultimately a matter of convention or of definition.

One of the largest resources of coded character sets is the International Register of Coded Character Sets for use with Escape Sequences, maintained and published by the Registration Authority for ISO 2375 in accordance with the procedures of that standard. Those procedures specify how to compare two coded character sets, as follows. Two sets are deemed to be identical if

the number of characters is the same;
the names of the characters are the same according to the terminology of the Registration Authority;
the same positions are used for the same characters;
both sets are of the same type, in particular both a C0 or a C1 set;
the definitions of control characters are functionally equivalent (a more restricted definition is not considered equivalent);
graphic characters have the same geometric shape apart from aesthetic variations between fonts;
any non-spacing graphic characters are in the same positions.

If we abstract from this those aspects which compare individual characters, rather than their code positions or overall aspects of the complete set, we see that two graphic characters are regarded as identical if

they have the same name according to the terminology of the Registration Authority;
they may be represented by the same glyph;
for characters intended for combining with other characters then the rules for creating combinations are the same (in ISO 2375 the only recognized form of combination is the use of non-spacing characters).

The first of these requirements permits the Registration Authority to change the name (for example, from that used in the source standard whose code is being registered) to bring it into a standardized form. It is the policy of SC2, the ISO/IEC JTC1 sub-committee responsible for coded character set standards, to align the names of characters in its published standards with those used in ISO/IEC 10646. When necessary, renaming will take place when standards are next revised. Such renaming will ensure that two characters are given distinct names if they have distinct glyphs or distinct combining procedures. It follows that

two graphic characters, from different coded character sets, should be regarded as the same if they have the same name according to the character naming guidelines of ISO/IEC 10646.

Naming guidelines of the UCS

The naming guidelines of the UCS are given in annex K of ISO/IEC 10646. They include the following:

By convention, only Latin capital letters A to Z, space, and hyphen shall be used for writing the names of characters.
NOTE - Names of ideographic characters may also include digits 0 to 9 provided that a digit is not the first character in a word.
In some cases, the name of a character can be followed by an additional explanatory statements not part of the name. These statements shall be in parentheses and not in capital Latin letters except the initials of the word where required.
The name of a character shall wherever possible denote its customary meaning, for examples PLUS SIGN. Where this is not possible, names should describe shapes, not usage; for example: UPWARDS ARROW.
The names shall be constructed from an appropriate set of the applicable terms of the following grid and ordered in the sequence of this grid
1. Script (e.g. Latin, Cyrillic, Arabic - letters that are elements of more than one script are considered different even if their shape is the same. This is not necessarily true of non-letter characters such as punctuation marks, even where the usage differs between scripts. In such cases the name reflects the most customary use, with alternative names in parentheses)
2. Case (e.g. capital, small).
3. Type (e.g. letter, ligature, digit).
4. Language (e.g. Ukrainian - only included to remove ambiguity, such as between CYRILLIC CAPITAL LETTER I and CYRILLIC CAPITAL LETTER BYELORUSSIAN-UKRAINIAN I, which are distinguished by having different glyphs).
5. Attribute (e.g. final, sharp, subscript, vulgar).
6. Designation (e.g. customary name, name of letter - the names of letters of all scripts other than Latin are represented by their transcription in the language of the first published International Standard).
7. Mark(s) (e.g. acute, ogonek, ring above, diaeresis).
8. Qualifier (e.g. sign, symbol).
There are exceptions to the above rules. Traditional names such as APOSTROPHE and COMMERCIAL AT, shall be acceptable as names and alphanumeric identifiers shall be used for ideographic characters.

Not all of the eight terms in this numbered list need be present. Examples of character names, with term numbers added after each name element, are

LATIN (1) SMALL (2) LETTER (3) DOTLESS (5) J (6) WITH STROKE (7)
DIGIT (3) FIVE (6)
LEFT (5) CURLY (5) BRACKET (6)
COMBINING (5) ACUTE (7) ACCENT (8)

These guidelines are sufficiently clear that there are very few cases in which it is unclear whether two characters from different coded character sets should have the same name under them. Here are some examples of naming problems.

The changes in Technical Corrigendum 1 concerning Æ were said to be changes of name, for example
- from LATIN CAPITAL LIGATURE AE
- to LATIN CAPITAL LETTER AE.
  Both names are constructed according to the naming guidelines and they differ in one name element, namely Type. They therefore are names for two different characters with the same glyph. The Technical Corrigendum changed the characters allocated to the six code positions affected, it did not rename six characters. This was, however, a correction of an editorial error and not a change of coding from the intentions of the first edition.
The second edition (1994) of ISO/IEC 6937 contains a character MUSIC NOTE (with a musical quaver as the illustrative glyph). This is not present in the UCS, but that does contain the two characters QUARTER NOTE and EIGHTH NOTE, which have more specific names and illustrative glyphs of a musical crotchet and quaver respectively. Comparison of the glyphs in the two standards shows that MUSIC NOTE should be treated as the character EIGHTH NOTE, not as QUARTER NOTE.
The first edition (1983) of ISO/IEC 6937, which preceded current naming guidelines, contained a character with the name "small letter g with acute accent". In 8.3 of the second edition it states that this character has been renamed as LATIN SMALL LETTER G WITH CEDILLA in order to align with ISO/IEC 10367 (the cedilla being placed above the g for presentation purposes). However, the UCS contains both LATIN SMALL LETTER G WITH CEDILLA (in the collection LATIN EXTENDED-A) and LATIN SMALL LETTER G WITH ACUTE (in the collection LATIN EXTENDED-B). The justification for the name change is that the original name was in error; the character concerned was always intended to be the small letter corresponding to "capital letter g with cedilla" but was named erroneously due to the positioning of the diacritical mark.

Linguistic translation of character names

Because of the significance of the names of characters in constructing correspondences between the UCS and other coded character sets, it has been controversial within the relevant sub-committee ISO/IEC JTC1/SC2 as to whether the names of characters may be translated when the text of ISO/IEC 10646-1 is translated into another language. It has recently been agreed that the names of characters may be translated.

One effect of this decision is that names will no longer serve as language-independent unique identifiers of characters. They retain their central role in determining whether characters from different coded character sets are or are not the same, but the comparison of names must take place in a common language.

Unique identifiers for characters

If names of characters are to be translatable, there becomes a need for some other form of unique identifier for characters that is language independent. Since the aim of the UCS is to include all the world's characters, this enables the coding of a character in the UCS to be used as an identifier of that character in all situations, including in the specification of other coded character sets. Such a scheme would solve, for the future, the problem of comparing characters from different coded character sets. However, in order to add such identifiers to existing character sets as they are revised, it is first necessary to create a correspondence between the set concerned and the UCS by means of names as described above.

Amendment 9 to ISO/IEC 10646-1 proposes several alternative forms for unique identifiers constructed from UCS code positions. These have the following constructions, in which hhhhhhhh represents the eight hexadecimal digits that represent the code position in the UCS and kkkk represents the last four of these digits for characters of the Basic Multilingual Plane (BMP):

hhhhhhhh or -hhhhhhhh or T-hhhhhhhh or U-hhhhhhhh;
kkkk or +kkkk or T+kkkk or U+kkkk.

The significance of the optional prefixes is as follows:

a minus sign indicates that the numeric form is the eight-digit form, a plus sign indicates that it is the four-digit form;
a letter T indicates that the identifier refers to the character at the specified code position before the application of Amendment 5 to the first edition of ISO/IEC 10646-1, this amendment being the reallocation of Hangul syllabic characters from the A-zone to the O-zone;
a letter U indicates that the identifier refers to the character at the specified code position after this application of Amendment 5.

If there is no prefix letter then the relevant amendment level is unspecified. The three forms (no prefix letter, T prefix, U prefix) coincide unless hhhhhhhh lies in the range 00003400 to 00004DFF inclusive. For this range, the correspondence between the T and U forms is given by the mapping table in the annex to Amd.5. As an example:

T+340F and U+AC19 identify the same character.

The prefix letters, and the letters A to F used as hexadecimal letters, may be written either as capital letters or as small letters.

Unique identifiers for glyphs

The unique identifiers described above for characters are based on the International Standard ISO/IEC 10646. There is also an internationally agreed assignment of unique identifiers to glyphs, but this is instead based on an International Registration Authority. The registrar is the Association for Font Information Interchange and the register operates under procedures laid down in ISO/IEC 10036.

Glyphs registered under ISO/IEC 10036 are assigned an identifier by the Registration Authority that is a hexadecimal number in the range from 0 to FFFFFFFF. This is the same range of values as that used for identifiers of characters in accordance with ISO/IEC 10646. For the characters of ASCII the same value has been assigned to one possible glyph for each character as is assigned to the character in the ASCII code, and therefore as also in the UCS. For example, the character LATIN CAPITAL LETTER A has the character identifier U+0041 and is represented by the glyph "A" which has the glyph identifier 41 (hexadecimal). However, certain characters of the ASCII code have had their interpretation refined as coded character sets have developed over time. This has led to departures from a strict correspondence even for the ASCII code. In particular:

U+0060 is the character GRAVE ACCENT but glyph identifier 0060 is a left single quotation mark (the glyph identifier for a grave accent is 00C1).

The use of code positions 27 (U+0027 is APOSTROPHE) and 60 for right and left single quotation marks was an allowed alternative in the original ASCII code. The glyph for a right single quotation mark is acceptable also for an apostrophe, but that for a left single quotation mark is not acceptable as a grave accent. These ASCII alternatives are still present in the registration entry under ISO 2375 for the ASCII code, namely ISO IR-6 in the International Register of Coded Character Sets to be used with Escape Sequences, as this entry dates from 1975. Register entries, once made, cannot be revised (other than in exceptional circumstances and if the possibility of revision was stated in the original entry). However, these alternatives are not present in the international standard equivalent to ASCII, namely the International Reference Version (IRV) of ISO/IEC 646:1991. Nevertheless, that standard states explicitly that its IRV may be identified as ISO IR-6.

For use in a wider context, ISO/IEC 9541-1 specifies a structured-name form for the identification of glyphs registered under ISO/IEC 10036. These have the form

ISO/IEC 10036/RA/Glyphs::nnnn

where nnnn is a sequence of decimal digits, beginning with a non-zero digit, which represents the hexadecimal value of the glyph identifier assigned by the Registration Authority. The concept of a structured-name is specified normatively in ISO/IEC 9541-2, which gives both ASN.1 and SGML forms for such names.

Top of UCS Guide