CEN Guide to the Use of Character Sets in EuropeTC 304

UCS - Visual representation of characters


Combining and non-combining characters

The concept of a combining character was discussed above in the section of this guide on precomposed and decomposed characters. This section describes their use in more detail.

The definition given in ISO/IEC 10646-1 is:

combining character
A member of an identified subset of the coded character set of ISO/IEC 10646 intended for combination with the preceding non-combining character, or with a sequence of combining characters preceded by a non-combining character.

The fact that combining characters are coded following the base non-combining character separates them in principle from the non-spacing diacritical marks used for coding purposes in the variable-length code of ISO/IEC 6937. The nature of a non-spacing symbol, based on the capabilities of electromechanical printing devices, requires that it is received and printed by such a device before the base character with which it is combined. Combining characters in the UCS behave in a more intuitive way; it is more natural to apply diacritical marks to a character that is already known or written than to one that is to be notified later on.

Combining characters are not an essential part of the coding of the Latin script. They are modifiers of letters and all combinations of base letter and diacritical mark in normal use in languages written in the Latin script are separately coded as precomposed characters. Some examples which illustrate the level to which the coding of precomposed Latin characters has been taken are:

However, diacritical marks are also used in the Latin script for reasons other than as part of normal spelling, for example to indicate stress positions in pronunciation. This may require combinations of letter and mark that are not used in normal language. The entire range of diacritical marks used with Latin (and Greek) scripts is available in collection 7, COMBINING DIACRITICAL MARKS.

A similar situation holds for the Greek script, both for its monotonic and polytonic forms. However, many other scripts such as Arabic and the Indic scripts (Devanagari, Gujarati, etc.) are written in such a manner that combining characters are an essential part of coding. The Indic scripts, for example, represent vowels by combining marks. The Cyrillic script falls between these two situations; whether combining characters are or are not essential depends on the language being represented.

The use of combining characters can add significantly to the difficulties of processing encoded text. For this reason, ISO/IEC 10646-1 defines three distinct levels of implementation in which either none (level 1), or some (level 2), or all (level 3) of the combining characters of the UCS are permitted to be encoded. These levels are described in more detail elsewhere in this guide. At any specified level of implementation, the defined character collections of the UCS have to be interpreted as if any combining characters whose use is not permitted at that level have been removed.

These differences between scripts lead to three distinct situations regarding the character collections of the UCS:

Collections not containing combining characters at any level of implementation
These are the collections not marked with either * or † in the tables given above. They include all the collections of graphic characters for the Latin and Greek scripts and also the Basic Hebrew collection. Text requiring only these collections can be encoded at implementation level 1.
Collections containing combining characters when used at level 3 but not at levels 1 or 2
These are the collections marked with † in the tables given above. At present there are four collections of this nature, but three of them (those containing the word COMBINING in their name) are empty at levels 1 and 2. The only collection of this nature available for use at levels 1 and 2 is the CYRILLIC collection. Much text written in the Cyrillic script does not make use of the combining characters and so can be encoded at implementation level 1. When they are required, it is necessary to use level 3.
Collections containing combining characters when used at either of levels 2 and 3
These are the collections marked with * in the tables given above. They include the HEBREW EXTENDED collection and all the collections for the Indic scripts. Much text requiring these collections can be encoded at level 2 but use of level 1 is not normally practicable.

It is the general intention of the UCS that for most purposes it will not be necessary to use a level 3 implementation but that the choice between levels 1 and 2 will depend on the character collections to be used.

Composite sequences

The definitions clause of ISO/IEC 10646-1 includes the following terms:

composite sequence
A sequence of graphic characters consisting of a non-combining character followed by one or more combining characters.
graphic symbol
The visual representation of a graphic character or of a composite sequence.
repertoire
A specified set of characters that are represented in a coded character set.

It also contains the following notes:

The term glyph is not used in ISO/IEC 10646 but its use enables us to divide the formation of a graphic symbol into two stages:

  1. The non-combining character or composite sequence is mapped to a glyph, which is an abstract graphic symbol;
  2. An image of the glyph is then formed according to some presentation process, to form a real graphic symbol.

This division separates all aspects such as the font to be used, its size and emphasis, etc. from the selection of the appropriate glyph. The process of creating the image of the appropriate glyph will in general be very complex, but it is entirely separate from the process of selecting the appropriate glyph to represent the character data. The mapping from the character data (single character or composite sequence) to the glyph is from one abstract entity to another, devoid of all complications arising from the actual presentation process.

The definitions quoted above give rise to the following features of the mapping from character data to glyphs:

These features make the UCS very different from ISO/IEC 6937, despite the similarity at first sight between the combining characters of the UCS and the non-spacing diacritical marks of ISO/IEC 6937. This latter standard has the following features which may be compared with the list given above for the UCS:

Use of multiple combining characters

The UCS permits more than one combining character to be applied to a single non-combining character. When this occurs, it lays down rules for their interpretation. In outline, these are as follows:

  1. When two combining characters can interact, then by default they are to be positioned in the sequence in which they are received, working from the base character outward. For example, if a combining macron is followed by a combining diaeresis then the diaeresis would appear above the macron, but if the order were reversed then the macron would be above the diaeresis.
  2. Some specific combining characters override the default behaviour by being positioned horizontally, rather than vertically, in relation to one another or by forming a ligature with an adjacent combining character. In this case they are positioned in the sequence in which they are received, working in the same direction (left-to-right or right-to-left) as the script is written.
  3. When two combining characters do not interact, such as when one is positioned above the base character and the other below it, then the order in which they are coded does not affect the resulting glyph.

To Top of UCS Guide