CEN Guide to the Use of Character Sets in Europe

TC 304

UCS - Repertoires and subsets

The concept of repertoire

There is really only one concept of a repertoire, namely a repertoire is a specified set of characters. However, the concept is defined slightly differently in different character set standards and it is interpreted in ways that may differ from one's expectations. Two particular definitions are

repertoire: A specified set of characters that are represented by one or more bit combinations of a coded character set. [ISO/IEC 6937]
repertoire: A specified set of characters that are represented in a coded character set. [ISO/IEC 10646-1]

For completeness, a coded character set also has a formal definition

coded character set: A set of unambiguous rules that establishes a character set and the one-to-one relationship between the characters of the set and their bit combinations. [ISO/IEC 6937]
coded character set: A set of unambiguous rules that establishes a character set and the relationship between the characters of the set and their coded representation. [ISO/IEC 10646-1]

It is instructive to see how these two standards differ in their use of the concept of repertoire. Recall that ISO/IEC 6937 is a standard that bases a variable-length encoding of characters from the Latin script on forming combinations of non-spacing diacritical marks with unaccented letters. It is based on two separate 7-bit coded character sets that are separately registered under ISO 2375. The primary set of ISO/IEC 6937 is the left-hand set, coded in an 8-bit code as 20 to 7E This is precisely the ASCII set registered as ISO-IR 6. The supplementary set of ISO/IEC 6937 is the right-hand set, coded as A0 to FF, which contains both the non-spacing diacritical marks and other (spacing) characters.

The repertoire of ISO/IEC 6937 is specified separately, as a list of characters together with their (variable length) coded representations. It consists of 333 characters, including SPACE. Its characters include the accented characters that are coded by two octets, the first representing a non-spacing diacritical mark from the supplementary set and the second representing an unaccented letter from the primary set. The repertoire of ISO/IEC 6937 does not include the non-spacing diacritical marks as characters in their own right.

This is entirely consistent with the definition of a repertoire. The repertoire of ISO/IEC 6937 is established by that standard as a specific list of characters, each of which is represented by one or more bit combinations. It is quite separate from the union of the repertoires of the primary and supplementary sets of ISO/IEC 6937, which consists of the 191 characters, including SPACE, each coded by one octet. That repertoire does include, say, NON-SPACING ACUTE ACCENT, but it does not include LATIN SMALL LETTER E WITH ACUTE, while the repertoire of ISO/IEC 6937 includes the latter character but not the former one.

The concept of repertoire as used in ISO/IEC 10646 corresponds in the context of ISO/IEC 6937 to that of the union of its primary and supplementary sets, not to that of ISO/IEC 6937 itself. The repertoire of ISO/IEC 10646 consists of the characters that are assigned to code positions within the 31-bit coding space of the UCS. It therefore includes combining characters (which are the nearest equivalent in ISO/IEC 10646 to the non-spacing diacritical marks of ISO/IEC 6937) but does not include either composite sequences or characters, such as LATIN SMALL LETTER G WITH GRAVE, which have glyphs that can be represented by composite sequences.

There is a faint indication of this difference in the definitions given in these two standards. In ISO/IEC 6937 the definition refers to characters represented by bit combinations; in ISO/IEC 10646-1 it refers to characters represented in the coded character set. There is no conflict, since it is the definition of a coded character set that is crucial. A coded character set is first required to establish a character set, before it assigns coding. That character set is then the repertoire of the coded character set. A repertoire, composed of characters, is therefore whatever the relevant standard says it is. It is, in principle, quite distinct from the set of glyphs that may be represented by the characters of the repertoire. For many purposes it is this set of glyphs that is relevant, not the set of characters used to represent them. But describing or specifying this set of glyphs is outside of the scope of standards for coded character sets.

Levels of implementation of the UCS

There are three levels of implementation specified in ISO/IEC 10646, distinguished from one another by limitations on the characters that may be encoded at the level concerned. They are as follows:

Implementation level 1: At level 1, the prohibited characters are those from the HANGUL JAMO block and all combining characters.
Implementation level 2: At level 2, the prohibited characters are those from the HANGUL JAMO block and a specific subset of combining characters listed in annex B of the standard.
Implementation level 3: At level 3, there are no restrictions on the characters that may be used.

Hangul Jamo characters are used in the Hangul syllable composition method. A sequence of two or three Hangul Jamo characters has a glyph that represents a syllable. Hangul syllables also have precomposed coding in the HANGUL EXTENDED block of the I-zone of the BMP. The relationship between coding in terms of Hangul Jamo and that as a single syllabic character is similar to that between the precomposed and decomposed forms of Latin characters with diacritical marks. However, there is no distinction for the Hangul Jamo characters corresponding to that between the non-combining and combining characters of a composite sequence. No Hangul Jamo characters have a meaning in isolation within the Hangul script. For this reason it is specifically stated that the characters of the HANGUL JAMO block are not combining characters. Note that the Hangul syllabic characters of the HANGUL EXTENDED block are permitted at all levels of implementation.

The chapter on visual representation of characters gives more information about the scripts that can be represented at the different levels of implementation.

Collections and subsets

A collection of characters consists of the characters of the UCS that are allocated to code positions lying within one of the ranges specified for this purpose in annex A of ISO/IEC 10646-1. Each collection is assigned both a number and a name. There is a collection associated with, and frequently identical to, each block into which the BMP is divided. These collections, together with their names and numbers, are listed in the chapter of this guide on the Basic Multilingual Plane (BMP). It should be noted that, as a collection is defined by a range, it may include code positions which have not been assigned characters. An amendment to the standard may allocate characters to such code points. Thus the repertoire defined in a collection may change over time. This is not always desirable, so the notion of a fixed collection was introduced in Corrigendum No.2. As a consequence the definition of a fixed collection has to be much more precise in that no range can contain unassigned code points.

Two different collections of characters may overlap, but of those associated with specific blocks the only overlap is that two of the four characters comprising the collection ZERO-WIDTH BOUNDARY INDICATORS are also present in collections for a number of specific scripts. A number of other specialized collections are defined in annex A which put together selections of characters that are also present in other collections. These consist of script-specific formatting characters and alternate forms. There are also two collections related to the permitted levels of implementation. One consists of all combining characters and the other of those combining characters that are not permitted in an implementation at level 2. Finally there are five large collections (two of which are fixed collections) defined as follows:

Large collections of the UCS
299 BMP FIRST EDITION
Note: a fixed collection containing only characters contained in the first edition prior to any amendments. See ISO/IEC 10646-1 A.3

300 BMP 0000-D7FF, E000-FFFD

301 BMP-AMD.7
Note: a fixed collection containing those characters of the first edition as amended by amendments 1 to 7. See ISO/IEC 10646-1 A.3

400 PRIVATE USE PLANES G=00, P=0F, 10, E0-FF

500 PRIVATE USE GROUPS G=60-7F

Large collections of the UCS
299	BMP FIRST EDITION Note: a fixed collection containing only characters contained in the first edition prior to any amendments.	See ISO/IEC 10646-1 A.3
300	BMP	0000-D7FF, E000-FFFD
301	BMP-AMD.7 Note: a fixed collection containing those characters of the first edition as amended by amendments 1 to 7.	See ISO/IEC 10646-1 A.3
400	PRIVATE USE PLANES	G=00, P=0F, 10, E0-FF
500	PRIVATE USE GROUPS	G=60-7F

The specifications of collections 300 and 400 were changed by Amendment 1 consequent on the introduction of the S-zone and its reservation for the use of UCS Transformation Format 16.

A subset is a more general term that refers to any identified set of characters from the entire repertoire of the UCS. Two alternative means of specifying subsets are recognized within ISO/IEC 10646-1:

Limited subset: A limited subset is specified by giving explicitly a list of the graphic characters in the subset. They may be listed by their names or their code positions in the UCS.
Selected subset: A selected subset is specified as a list of collections. A selected subset shall always include the BASIC LATIN collection.

A selected subset is more restricted than a limited subset in its permitted content, but it has two great advantages. It is much more concise to list collections rather than individual characters. Also, annex M of ISO/IEC 10646-1 specifies by algorithm an ASN.1 object identifier that may be used to identify a selected subset of the UCS within any context in which OSI protocols are used.

A limited subset may be assigned an ASN.1 object identifier, but only by means outside the scope of ISO/IEC 10646-1. The following European pre-standard:

ENV 1973:1995 Information technology - European Subsets of ISO/IEC 10646-1

contains the definition of a limited subset (the Minimum European Subset) and assigns an ASN.1 object identifier to it. It also describes a selected subset (the Extended European Subset) that has an ASN.1 object identifier assigned in accordance with the algorithm of ISO/IEC 10646-1.

Significance of subsets for conformance to the UCS

Because of the size and open-ended nature of the repertoire of the UCS, conformance to ISO/IEC 10646-1 does not require the ability to handle all of the characters in the repertoire. Instead, a claim of conformance for information interchange is required to identify:

a specific method of coding (see the chapter on coding methods of the UCS);
a specific subset of characters (see above);
a specific level of implementation (see above);

A separate definition of conformance is given for conformance of a device. For this purpose a device is a component of information processing equipment which can transmit and/or receive coded information, such as an input/output device, an application program or a gateway function. A claim of conformance for a device is required to specify the above three items and in addition

a specific selection of control functions that are used in conjunction with the UCS (see the chapter on control functions).

The precise meaning of conformance to ISO/IEC is specified in ISO/IEC 10646 and will not be reproduced here. The important aspect here is that conformance only requires support of the UCS within the limits determined by these specified items.

Subsets as an aid to migration from 8-bit codes

The ability to conform to ISO/IEC 10646 while supporting only a subset of its characters is a great aid to migration from other coded character sets. In particular it permits support to be developed collection by collection. It is only in a few cases that there is a direct correspondence between the collections defined in ISO/IEC 10646-1 and the repertoires of other standardized coded character sets. However, expansion of support one collection at a time eases substantially the effort required, such as the development glyphs for additional characters.

The assignment by ISO/IEC 10646 of an ASN.1 object identifier for any selected subset provides a means within OSI protocols for an application to notify its peer, in any communication, of the collections that it supports. The Extended European Subset (EES) specified in ENV 1973 consists of the collections numbered 1-11, 27-28, 30-48, 63, 65 and 70. These contain 4013 code positions, of which 3095 are currently assigned to characters. These are all the collections that contain characters of the Latin, Greek, Cyrillic, Armenian and Georgian scripts together with other characters of the International Phonetic Alphabet and a wide range of symbols used for academic, commercial and scientific purposes within Europe. This subset is defined as guidance for product developers, but it in no way restricts the ability of any developer to extend support to either a smaller or a larger range of collections than that of the EES.

Top of UCS Guide