TC 304
UCS - Visual representation of characters
The concept of a combining character was discussed above in the
section of this guide on precomposed and decomposed characters.
This section describes their use in more detail.
The definition given in ISO/IEC 10646-1 is:
- combining character
- A member of an identified subset of the coded character set
of ISO/IEC 10646 intended for combination with the preceding non-combining
character, or with a sequence of combining characters preceded
by a non-combining character.
The fact that combining characters are coded following the base
non-combining character separates them in principle from the non-spacing
diacritical marks used for coding purposes in the variable-length
code of ISO/IEC 6937. The nature of a non-spacing symbol, based
on the capabilities of electromechanical printing devices, requires
that it is received and printed by such a device before the base
character with which it is combined. Combining characters in the
UCS behave in a more intuitive way; it is more natural to apply
diacritical marks to a character that is already known or written
than to one that is to be notified later on.
Combining characters are not an essential part of the coding of
the Latin script. They are modifiers of letters and all combinations
of base letter and diacritical mark in normal use in languages
written in the Latin script are separately coded as precomposed
characters. Some examples which illustrate the level to which
the coding of precomposed Latin characters has been taken are:
- 0149 LATIN SMALL LETTER N PRECEDED BY APOSTROPHE
- 0171 LATIN SMALL LETTER U WITH DOUBLE ACUTE
- 01E0 LATIN CAPITAL LETTER A WITH DOT ABOVE AND MACRON
- 1E1C LATIN CAPITAL LETTER E WITH CEDILLA AND BREVE
However, diacritical marks are also used in the Latin script for
reasons other than as part of normal spelling, for example to
indicate stress positions in pronunciation. This may require combinations
of letter and mark that are not used in normal language. The entire
range of diacritical marks used with Latin (and Greek) scripts
is available in collection 7, COMBINING DIACRITICAL MARKS.
A similar situation holds for the Greek script, both for its monotonic
and polytonic forms. However, many other scripts such as Arabic
and the Indic scripts (Devanagari, Gujarati, etc.) are written
in such a manner that combining characters are an essential part
of coding. The Indic scripts, for example, represent vowels by
combining marks. The Cyrillic script falls between these two situations;
whether combining characters are or are not essential depends
on the language being represented.
The use of combining characters can add significantly to the difficulties
of processing encoded text. For this reason, ISO/IEC 10646-1 defines
three distinct levels of implementation in which either none (level
1), or some (level 2), or all (level 3) of the combining characters
of the UCS are permitted to be encoded. These levels are described
in more detail elsewhere in this guide. At any specified level
of implementation, the defined character collections of the UCS
have to be interpreted as if any combining characters whose use
is not permitted at that level have been removed.
These differences between scripts lead to three distinct situations
regarding the character collections of the UCS:
- Collections not containing combining characters at any level
of implementation
- These are the collections not marked with either * or
in the tables given above. They include all the collections of
graphic characters for the Latin and Greek scripts and also the
Basic Hebrew collection. Text requiring only these collections
can be encoded at implementation level 1.
- Collections containing combining characters when used at level
3 but not at levels 1 or 2
- These are the collections marked with in the tables
given above. At present there are four collections of this nature,
but three of them (those containing the word COMBINING in their
name) are empty at levels 1 and 2. The only collection of this
nature available for use at levels 1 and 2 is the CYRILLIC collection.
Much text written in the Cyrillic script does not make use of
the combining characters and so can be encoded at implementation
level 1. When they are required, it is necessary to use level
3.
- Collections containing combining characters when used at either
of levels 2 and 3
- These are the collections marked with * in the tables given
above. They include the HEBREW EXTENDED collection and all the
collections for the Indic scripts. Much text requiring these collections
can be encoded at level 2 but use of level 1 is not normally practicable.
It is the general intention of the UCS that for most purposes
it will not be necessary to use a level 3 implementation but that
the choice between levels 1 and 2 will depend on the character
collections to be used.
The definitions clause of ISO/IEC 10646-1 includes the following
terms:
- composite sequence
- A sequence of graphic characters consisting of a non-combining
character followed by one or more combining characters.
- graphic symbol
- The visual representation of a graphic character or of a composite
sequence.
- repertoire
- A specified set of characters that are represented in a coded
character set.
It also contains the following notes:
- A graphic symbol for a composite sequence generally consists
of the combination of the graphic symbols of each character in
the sequence.
- A composite sequence is not a character and therefore is not
a member of the repertoire of ISO/IEC 10646.
The term glyph is not used in ISO/IEC 10646 but its use enables
us to divide the formation of a graphic symbol into two stages:
- The non-combining character or composite sequence is mapped
to a glyph, which is an abstract graphic symbol;
- An image of the glyph is then formed according to some presentation
process, to form a real graphic symbol.
This division separates all aspects such as the font to be used,
its size and emphasis, etc. from the selection of the appropriate
glyph. The process of creating the image of the appropriate glyph
will in general be very complex, but it is entirely separate from
the process of selecting the appropriate glyph to represent the
character data. The mapping from the character data (single character
or composite sequence) to the glyph is from one abstract entity
to another, devoid of all complications arising from the actual
presentation process.
The definitions quoted above give rise to the following features
of the mapping from character data to glyphs:
- A single non-combining character (which is in the repertoire
of the UCS) and a composite sequence (which is not in the repertoire)
may map to the same glyph, e.g. the character LATIN SMALL LETTER
E WITH ACUTE and the composite sequence LATIN SMALL LETTER E followed
by COMBINING ACUTE ACCENT.
- A composite sequence may map to a glyph which represents a
character that is not present in the UCS, e.g. LATIN SMALL LETTER
G followed by COMBINING GRAVE ACCENT; there is no character LATIN
SMALL LETTER G WITH GRAVE in the UCS, although LATIN SMALL LETTER
G WITH ACUTE is present. Characters which can be represented in
this way by the UCS, but which are not in the UCS, are not considered
to be part of the repertoire of the UCS.
- A given piece of text may have many equally valid representations
by a string of characters, depending on whether precomposed or
decomposed characters are used.
- As there are no restrictions on the use of combining characters
(in the levels of implementation at which they are permitted at
all), many composite sequences will not map to any meaningful
glyph. This applies in particular to composite sequences in which
non-combining characters from one script are followed from combining
characters from another script. This is not forbidden but it is
unlikely to be meaningful.
These features make the UCS very different from ISO/IEC 6937,
despite the similarity at first sight between the combining characters
of the UCS and the non-spacing diacritical marks of ISO/IEC 6937.
This latter standard has the following features which may be compared
with the list given above for the UCS:
- The letters which can be formed from a base letter preceded
by a non-spacing diacritical mark are part of the repertoire of
ISO/IEC 6937.
- A given piece of text has a unique coding in terms of ISO/IEC
6937 since accented characters are available only in decomposed
form (i.e. coded with the aid of non-spacing diacritical marks).
- There is a normative list of the characters comprising the
repertoire of the standard, so constructions that have no meaningful
glyph, such as NON-SPACING GRAVE ACCENT followed by POUND SIGN,
are not conformant to ISO/IEC 6937. Unfortunately this has resulted
in certain combinations of grave and acute accents, and of diaeresis,
with the letters W and Y, used in the Welsh language, being absent
from the repertoire of that standard even though they have natural
representations in terms of the available non-spacing diacritical
marks.
The UCS permits more than one combining character to be applied
to a single non-combining character. When this occurs, it lays
down rules for their interpretation. In outline, these are as
follows:
- When two combining characters can interact, then by default
they are to be positioned in the sequence in which they are received,
working from the base character outward. For example, if a combining
macron is followed by a combining diaeresis then the diaeresis
would appear above the macron, but if the order were reversed
then the macron would be above the diaeresis.
- Some specific combining characters override the default behaviour
by being positioned horizontally, rather than vertically, in relation
to one another or by forming a ligature with an adjacent combining
character. In this case they are positioned in the sequence in
which they are received, working in the same direction (left-to-right
or right-to-left) as the script is written.
- When two combining characters do not interact, such as when
one is positioned above the base character and the other below
it, then the order in which they are coded does not affect the
resulting glyph.
Top of UCS Guide