CEN Guide to the Use of Character Sets in Europe

TC 304

UCS - Visual representation of characters

Combining and non-combining characters

The concept of a combining character was discussed above in the section of this guide on precomposed and decomposed characters. This section describes their use in more detail.

The definition given in ISO/IEC 10646-1 is:

combining character: A member of an identified subset of the coded character set of ISO/IEC 10646 intended for combination with the preceding non-combining character, or with a sequence of combining characters preceded by a non-combining character.

The fact that combining characters are coded following the base non-combining character separates them in principle from the non-spacing diacritical marks used for coding purposes in the variable-length code of ISO/IEC 6937. The nature of a non-spacing symbol, based on the capabilities of electromechanical printing devices, requires that it is received and printed by such a device before the base character with which it is combined. Combining characters in the UCS behave in a more intuitive way; it is more natural to apply diacritical marks to a character that is already known or written than to one that is to be notified later on.

Combining characters are not an essential part of the coding of the Latin script. They are modifiers of letters and all combinations of base letter and diacritical mark in normal use in languages written in the Latin script are separately coded as precomposed characters. Some examples which illustrate the level to which the coding of precomposed Latin characters has been taken are:

0149 LATIN SMALL LETTER N PRECEDED BY APOSTROPHE
0171 LATIN SMALL LETTER U WITH DOUBLE ACUTE
01E0 LATIN CAPITAL LETTER A WITH DOT ABOVE AND MACRON
1E1C LATIN CAPITAL LETTER E WITH CEDILLA AND BREVE

However, diacritical marks are also used in the Latin script for reasons other than as part of normal spelling, for example to indicate stress positions in pronunciation. This may require combinations of letter and mark that are not used in normal language. The entire range of diacritical marks used with Latin (and Greek) scripts is available in collection 7, COMBINING DIACRITICAL MARKS.

A similar situation holds for the Greek script, both for its monotonic and polytonic forms. However, many other scripts such as Arabic and the Indic scripts (Devanagari, Gujarati, etc.) are written in such a manner that combining characters are an essential part of coding. The Indic scripts, for example, represent vowels by combining marks. The Cyrillic script falls between these two situations; whether combining characters are or are not essential depends on the language being represented.

The use of combining characters can add significantly to the difficulties of processing encoded text. For this reason, ISO/IEC 10646-1 defines three distinct levels of implementation in which either none (level 1), or some (level 2), or all (level 3) of the combining characters of the UCS are permitted to be encoded. These levels are described in more detail elsewhere in this guide. At any specified level of implementation, the defined character collections of the UCS have to be interpreted as if any combining characters whose use is not permitted at that level have been removed.

These differences between scripts lead to three distinct situations regarding the character collections of the UCS:

Collections not containing combining characters at any level of implementation: These are the collections not marked with either * or † in the tables given above. They include all the collections of graphic characters for the Latin and Greek scripts and also the Basic Hebrew collection. Text requiring only these collections can be encoded at implementation level 1.
Collections containing combining characters when used at level 3 but not at levels 1 or 2: These are the collections marked with † in the tables given above. At present there are four collections of this nature, but three of them (those containing the word COMBINING in their name) are empty at levels 1 and 2. The only collection of this nature available for use at levels 1 and 2 is the CYRILLIC collection. Much text written in the Cyrillic script does not make use of the combining characters and so can be encoded at implementation level 1. When they are required, it is necessary to use level 3.
Collections containing combining characters when used at either of levels 2 and 3: These are the collections marked with * in the tables given above. They include the HEBREW EXTENDED collection and all the collections for the Indic scripts. Much text requiring these collections can be encoded at level 2 but use of level 1 is not normally practicable.

It is the general intention of the UCS that for most purposes it will not be necessary to use a level 3 implementation but that the choice between levels 1 and 2 will depend on the character collections to be used.

Composite sequences

The definitions clause of ISO/IEC 10646-1 includes the following terms:

composite sequence: A sequence of graphic characters consisting of a non-combining character followed by one or more combining characters.
graphic symbol: The visual representation of a graphic character or of a composite sequence.
repertoire: A specified set of characters that are represented in a coded character set.

It also contains the following notes:

A graphic symbol for a composite sequence generally consists of the combination of the graphic symbols of each character in the sequence.
A composite sequence is not a character and therefore is not a member of the repertoire of ISO/IEC 10646.

The term glyph is not used in ISO/IEC 10646 but its use enables us to divide the formation of a graphic symbol into two stages:

The non-combining character or composite sequence is mapped to a glyph, which is an abstract graphic symbol;
An image of the glyph is then formed according to some presentation process, to form a real graphic symbol.

This division separates all aspects such as the font to be used, its size and emphasis, etc. from the selection of the appropriate glyph. The process of creating the image of the appropriate glyph will in general be very complex, but it is entirely separate from the process of selecting the appropriate glyph to represent the character data. The mapping from the character data (single character or composite sequence) to the glyph is from one abstract entity to another, devoid of all complications arising from the actual presentation process.

The definitions quoted above give rise to the following features of the mapping from character data to glyphs:

A single non-combining character (which is in the repertoire of the UCS) and a composite sequence (which is not in the repertoire) may map to the same glyph, e.g. the character LATIN SMALL LETTER E WITH ACUTE and the composite sequence LATIN SMALL LETTER E followed by COMBINING ACUTE ACCENT.
A composite sequence may map to a glyph which represents a character that is not present in the UCS, e.g. LATIN SMALL LETTER G followed by COMBINING GRAVE ACCENT; there is no character LATIN SMALL LETTER G WITH GRAVE in the UCS, although LATIN SMALL LETTER G WITH ACUTE is present. Characters which can be represented in this way by the UCS, but which are not in the UCS, are not considered to be part of the repertoire of the UCS.
A given piece of text may have many equally valid representations by a string of characters, depending on whether precomposed or decomposed characters are used.
As there are no restrictions on the use of combining characters (in the levels of implementation at which they are permitted at all), many composite sequences will not map to any meaningful glyph. This applies in particular to composite sequences in which non-combining characters from one script are followed from combining characters from another script. This is not forbidden but it is unlikely to be meaningful.

These features make the UCS very different from ISO/IEC 6937, despite the similarity at first sight between the combining characters of the UCS and the non-spacing diacritical marks of ISO/IEC 6937. This latter standard has the following features which may be compared with the list given above for the UCS:

The letters which can be formed from a base letter preceded by a non-spacing diacritical mark are part of the repertoire of ISO/IEC 6937.
A given piece of text has a unique coding in terms of ISO/IEC 6937 since accented characters are available only in decomposed form (i.e. coded with the aid of non-spacing diacritical marks).
There is a normative list of the characters comprising the repertoire of the standard, so constructions that have no meaningful glyph, such as NON-SPACING GRAVE ACCENT followed by POUND SIGN, are not conformant to ISO/IEC 6937. Unfortunately this has resulted in certain combinations of grave and acute accents, and of diaeresis, with the letters W and Y, used in the Welsh language, being absent from the repertoire of that standard even though they have natural representations in terms of the available non-spacing diacritical marks.

Use of multiple combining characters

The UCS permits more than one combining character to be applied to a single non-combining character. When this occurs, it lays down rules for their interpretation. In outline, these are as follows:

When two combining characters can interact, then by default they are to be positioned in the sequence in which they are received, working from the base character outward. For example, if a combining macron is followed by a combining diaeresis then the diaeresis would appear above the macron, but if the order were reversed then the macron would be above the diaeresis.
Some specific combining characters override the default behaviour by being positioned horizontally, rather than vertically, in relation to one another or by forming a ligature with an adjacent combining character. In this case they are positioned in the sequence in which they are received, working in the same direction (left-to-right or right-to-left) as the script is written.
When two combining characters do not interact, such as when one is positioned above the base character and the other below it, then the order in which they are coded does not affect the resulting glyph.

Top of UCS Guide