CEN Guide to the Use of Character Sets in EuropeTC 304

UCS - Basic Multilingual Plane (BMP)


Relationship to 8-bit codes

For reasons of compatibility, the row of the BMP with R = 00 has been given the structure of an 8-bit code according to ISO/IEC 2022. This requires that

This enables the coded representation of a control function to be obtained by a simple algorithm from its coded representation in an 8-bit code in accordance with ISO/IEC 2022. The algorithm is described elsewhere in this guide.

The graphic characters in the remaining 190 code positions of row 00 are allocated in accordance with the 8-bit code specified in

That code, and therefore row 00 of the BMP, contains graphic characters used for general purpose applications in typical office environments in at least the following languages:

This incorporation of ISO/IEC 8859-1 in particular makes the cells 21-7E of row 00 have the same allocations as the graphic characters of ASCII, which in its internationally standardized form is also known as the International Reference Version (IRV) of:

The 5 zones of the BMP

To aid its interpretation and development, the Basic Multilingual Plane is divided into five zones corresponding to the following code positions:

The R-zone terminates at FFFD as positions FFFE and FFFF are reserved; see the section of this guide on the 4-octet code structure of the UCS.

Each zone has a distinctive use:

The transformation format UTF-16 was introduced by Amendment 1 to the first edition of ISO/IEC 10646-1, which also created the S-zone by a splitting of the O-zone. Prior to that amendment the O-zone extended to code position DFFF. UTF-16 extends the two-octet coding of the BMP into a variable-length coding. In that coding the characters of all zones of the BMP (P=00) other than the S-zone are encoded in two octets while in addition characters of any of the fifteen planes P=01 to P=10 (remember that 10 here is a hexadecimal value) are encoded in four octets.

Alphabetic and syllabic scripts of the A-zone

The A-zone is structured into named blocks, each consisting of a consecutive range of cells. Each block is allocated to a related set of characters, although a block may contain individual cells that are currently unallocated. The characters in the UCS from a particular script may be grouped together in a single block (such as BENGALI) or they may be divided among several blocks (such as BASIC ARABIC and ARABIC EXTENDED). The characters of the Latin script occupy the first four named blocks BASIC LATIN, LATIN-1-SUPPLEMENT, LATIN EXTENDED-A, LATIN EXTENDED-B but in addition there is one further block of Latin characters, LATIN EXTENDED ADDITIONAL, which occurs further into the code table.

Separate from the block structure, but closely related to it, is the concept of a collection of characters. A collection is the subset of characters allocated to a specified range of cells. The difference between a block and a collection is that the cells of a collection need not be consecutive and two collections may overlap. Collections are assigned both a name and a number. Blocks divide the code space into separate areas that are allocated for a coherent purpose. Collections put blocks and/or individual characters together to form subsets of practical significance. A user may then put several collections together to form a subset meeting a particular need, such as communication in English and Hebrew.

The following table shows the blocks and collections of the first nine rows of the A-zone, comprising cells 0000-08FF. It gives both the name and the range of cells that comprise the block. With the exception of the collection HEBREW EXTENDED, which is formed from two blocks, there is a one-to-one correspondence between blocks and collections for the characters in these seven rows. The table also gives the number assigned to the collection in the first column; the collection name is the same as that of the block.

Blocks and Collections of rows 00-08 of the UCS

(collection = block, except for collection 13; *,† = contains combining characters; see the section below on combining characters for the significance of these markings)

1BASIC LATIN0020-007E
2LATIN-1-SUPPLEMENT00A0-00FF
3LATIN EXTENDED-A0100-017F
4LATIN EXTENDED-B0180-024F
5IPA EXTENSIONS0250-02AF
6SPACING MODIFIER LETTERS 02B0-02FF
7†COMBINING DIACRITICAL MARKS 0300-036F
8BASIC GREEK0370-03CF
9GREEK SYMBOLS AND COPTIC 03D0-03FF
10†CYRILLIC0400-04FF
(Reserved for future standardization) 0500-052F
11ARMENIAN0530-058F
HEBREW EXTENDED-A

(31 further Hebrew characters have been allocated to previously reserved cells in this block by Amd. 7)

0590-05CF
12BASIC HEBREW05D0-05EA
HEBREW EXTENDED-B05EB-05FF
13*HEBREW EXTENDED (This collection comprises the two blocks HEBREW EXTENDED-A and HEBREW EXTENDED-B)
14*BASIC ARABIC0600-065F
15*ARABIC EXTENDED0660-06FF
85SYRIAC

(added by Amd.27, hence the out-of sequence number)

0700-074F
(Reserved for future standardization) 0750-077F
86*THAANA

(added by Amd.24, hence the out-of sequence number)

0780-07BF
(Reserved for future standardization) 07C0-08FF

Certain characters in the blocks LATIN-1-SUPPLEMENT AND LATIN-EXTENDED-B have had their names changed by Technical Corrigendum 1 (1996) since the publication of the first edition of the standard in 1993. In the first of these blocks the characters affected are:

In the other block the affected characters are these same characters with added diacritical marks MACRON or ACUTE. The same name changes will be made in the next editions of the parts of ISO/IEC 8859 in which these characters appear.

The next five rows, 09-0D, are allocated to scripts that require the two special characters

in the coding of languages written in those scripts. As with rows 00-06, there is a collection corresponding to each block, but for these rows the collection consists of the characters allocated to that block together with these two special characters.

The following table shows the blocks and collections of rows 09-0D of the A-zone, comprising cells 0900-0DFF. It gives both the name and the range of cells that comprise the block. The table also gives the number assigned to the collection that consists of the characters allocated to the block together with the additional characters at positions 200C and 200D. The collection name is the same as that of the block on which it is based.

Blocks and Collections of Rows 09-0D of the UCS

(collection = block + 200C + 200D; * = contains combining characters)

16*DEVANAGARI0900-097F
17*BENGALI0980-09FF
18*GURMUKHI0A00-0A7F
19*GUJARATI0A80-0AFF
20*ORIYA0B00-0B7F
21*TAMIL0B80-0BFF
22*TELUGU0C00-0C7F
23*KANNADA0C80-0CFF
24*MALAYALAM0D00-0D7F
84*SINHALA

(added by Amd.21, hence the out-of sequence number)

0D80-0DFF

The remainder of the first 32 rows, namely rows 0E-1F, are either reserved or allocated to further scripts that correspond to collections on a one-to-one basis without additional characters. These are shown in the following table:

Blocks and Collections of Rows 0E-1F

(collection = block; * = contains combining characters)

25*THAI0E00-0E7F
26*LAO0E80-0EFF
72*BASIC TIBETAN

(added by Amd.6, hence the out-of sequence number)

0F00-0FBF
(Reserved for future standardization) 0FC0-109F
28GEORGIAN EXTENDED

(note that the collection number is out of sequence)

10A0-10CF
27BASIC GEORGIAN10D0-10FF
29HANGUL JAMO1100-11FF
73ETHIOPIC

(added by Amd.10, hence the out-of sequence number)

1200-137F
(Reserved for future standardization) 1380-139F
75CHEROKEE

(added by Amd.12, hence the out-of sequence number)

13A0-13FF
74UNIFIED CANADIAN ABORIGINAL SYLLABICS

(added by Amd.11, hence the out-of sequence number)

1400-167F
82OGHAM

(added by Amd.20, hence the out-of sequence number)

1680-169F
83RUNIC

(added by Amd.19, hence the out-of sequence number)

16A0-16FF
87*BURMESE

(added by Amd.26, hence the out-of sequence number)

1700-177F
88*KHMER

(added by Amd.25, hence the out-of sequence number)

1780-17FF
(Reserved for future standardization) 1800-1DFF
30LATIN EXTENDED ADDITIONAL

(one additional Latin character has been allocated to a previously reserved cell in this block by Amd.7.)

1E00-1EFF
31GREEK EXTENDED1F00-1FFF

The next eight rows of the A-zone contains symbols of various sorts and for various scripts, including technical and special purpose symbols. These take up rows 20-28 and they are followed by a further seven rows that are at present unallocated. This area of the A-zone is structured as follows:

Blocks and Collections of Rows 20-2F

(collection = block; † = contains combining characters)

32GENERAL PUNCTUATION 2000-206F
33SUPERSCRIPTS AND SUBSCRIPTS 2070-209F
34CURRENCY SYMBOLS 20A0-20CF
35†COMBINING DIACRITICAL MARKS FOR SYMBOLS 20D0-20FF
36LETTERLIKE SYMBOLS 2100-214F
37NUMBER FORMS2150-218F
38ARROWS2190-21FF
39MATHEMATICAL OPERATORS 2200-22FF
40MISCELLANEOUS TECHNICAL 2300-23FF
41CONTROL PICTURES2400-243F
42OPTICAL CHARACTER RECOGNITION 2440-245F
43ENCLOSED ALPHANUMERICS 2460-24FF
44BOX DRAWING2500-257F
45BLOCK ELEMENTS2580-259F
46GEOMETRIC SHAPES25A0-25FF
47MISCELLANEOUS SYMBOLS 2600-26FF
48DINGBATS2700-27BF
(Reserved for future standardization) 27C0-27FF
80BRAILLE PATTERNS

(added by Amd.16)

2800-28FF
(7 more rows reserved for future standardization) 2900-2FFF

The next 30 rows contain alphabetic scripts and symbols that are used by languages that also make use of ideographic scripts. The reference to CJK in the titles of some of the blocks of these rows is to unified Chinese/Japanese/Korean characters; see the section on ideographic scripts for more information. The blocks and collections of these rows are as follows:

Blocks and Collections of Rows 30-4D

(collection = block; * = contains combining characters)

49*CJK SYMBOLS AND PUNCTUATION 3000-303F
50*HIRAGANA3040-309F
51KATAKANA30A0-30FF
52BOPOMOFO3100-312F
53HANGUL COMPATIBILITY JAMO 3130-318F
54CJK MISCELLANEOUS3190-319F
55ENCLOSED CJK LETTERS AND MONTHS 3200-32FF
56CJK COMPATIBILITY3300-33FF
81CJK UNIFIED IDEOGRAPHS EXTENSION A

(Amd.17)

3400-4DBF
(Reserved for future standardization) 4DC0-4DFF

The CJK COMPATIBILITY block includes many symbols for scientific units that have been coded in Chinese national standards as if they were ideographs. Examples, together with their coding, are

mm³ (cubic millimetres)
SQUARE MM CUBED (coded at 33A3)
µs (microsecond)
SQUARE MU S (coded at 33B2)
rad/s² (radians per second per second, a unit of angular acceleration)
SQUARE RAD OVER S SQUARED (coded at 33AF)

The last 26 rows 34-4D of the A-Zone, now contain CJK Unified Ideographs Extension A (Amendment 17). However, these rows were allocated in the first edition of ISO/IEC 10646-1 to the Hangul syllabic script, divided into three blocks and corresponding collections numbered 57-59. Amendment 5 to this first edition deleted these allocations and created instead an allocation for a substantially larger set of Hangul syllabic characters in the O-zone. This was accepted as a violation of the principle that published allocations would not be changed, but there were compelling reasons to adopt this change. It will not be taken as a precedent for future changes of a similar nature.

Unified ideographs of the I-zone

The I-zone of the BMP is allocated as a single block to Chinese/Japanese/Korean unified ideographs, and it correspondingly forms a single collection. For completeness this is shown in the following table:

The one Block and Collection of the I-zone
60CJK UNIFIED IDEOGRAPHS 4E00-9FFF

An informative annex S has been added to ISO/IEC 10646-1 by Amendment 8 which describes the unification procedure. This section of the guide is based on that annex.

The I-zone contains 20992 code positions, of which 20902 are currently allocated to specific ideographs. These ideographs were derived from over 54000 ideographs which are found in various different national and regional standards for coded character sets. A process of unification was applied in which single ideographs from two or more of the source standards were associated together and assigned to a single code position in the I-zone. The ideographs that are thus associated are described, for the purposes of the UCS, as unified. To preserve data integrity, any ideographs that are separately encoded in any one of the source standards were not unified. Also ideographs that are unrelated in historical derivation are not unified. However, some ideographs encoded in two different standards for the same language may have been unified.

The unification process is based on the shapes of the ideographs, analyzed according to a systematic procedure. Any ideograph is composed of geometric elements which may themselves be composite structures and possibly ideographs in their own right. This enables the structure of an ideograph to be described by a component tree, where the top node is the ideograph itself and the bottom nodes are primitive elements. When two ideographs are compared, their component trees are compared to see if they agree in all of the following aspects:

If all of these aspects agree then the ideographs are considered to have the same abstract shape and are therefore unified. Annex S to ISO/IEC 10646-1 contains a listing of pairs or triples of ideographs that would have been unified under these rules except for the criteria concerning historical derivation or separate encoding in an existing standard.

Unified ideographs are named and listed in the code pages of ISO/IEC 10646-1 in a manner separate from that used for other scripts. For each unified ideograph, the listing reproduces all (which may only be one) of the graphic symbols (source ideographs) that have been unified into that code position. For each graphic symbol it specifies the source standard from which the graphic symbol is taken and the coded representation of the symbol in that standard. The name assigned to each unified ideograph is algorithmically generated by appending their two-octet coded representation to "CJK UNIFIED IDEOGRAPH-", for example CJK UNIFIED IDEOGRAPH-4E00.

The information concerning CJK united ideographs has now been replaced by Amd.13.

The Hangul syllabics of the O-zone and Yi

Amendment 5 to the first edition of ISO/IEC 10646-1 specified a change in the encoding of Hangul syllabic script. Prior to that Amendment, the last 26 rows of the A-zone (row numbers 34-4D) were allocated to the Hangul syllabic script and the entire O-zone was reserved for future standardization. Due to a major revision of the corresponding Korean national standard shortly after the final text of the first edition was agreed, it became necessary to accommodate substantially more syllabic characters into the UCS. To include these additional characters, the total space required would be almost 44 rows.

It was decided that this was sufficient of an exceptional circumstance to merit violating the principle that code positions, once allocated, should not be changed. The Hangul syllabic characters already encoded would be moved from the A-zone to the I-zone, where there was sufficient space to include both the original and the additional characters in a single block, with a corresponding single collection. The amendment contains the statement that this change is not intended to be regarded as a precedent for other changes of allocation in future editions. This statement will itself be incorporated into future editions.

Amendment 14 has added the syllables and radicals of the Yi script to the O-Zone.

Following these amendments, the O-zone has the structure shown in the following table:

The Blocks and Collections of the O-zone
76YI SYLLABLESA000-A48F
77YI RADICALSA490-A4CF
(Reserved for future standardization) A4D0-ABFF
71HANGUL EXTENDEDAC00-D7A3
(Reserved for future standardization) D7A4-D7FF

Amendment 5 contains a mapping table giving the correspondence between the code positions before and after this amendment for the characters originally allocated to rows 34-4D.

The Hangul syllabic characters are assigned names that follow the naming rules used for alphabetic scripts, e.g. HANGUL SYLLABLE GEOLH (KEOLH) rather than the algorithmic name structure used for the CJK unified ideographs of the O-zone.

The restricted use R-zone

The R-zone is distinguished from the remainder of the BMP in that its code positions are allocated for use only in special circumstances. There are three distinct uses for the R-zone:

Private use characters
These may be specific user-defined characters or may be dynamically-redefinable characters. In either case an agreement is necessary between sender and recipient, outside the scope of ISO/IEC 10646, if these are to be exchanged meaningfully between two communicating parties.
Presentation forms of characters
A presentation form is an alternative form, for use in a particular context, to the nominal form of a character or sequence of characters from the other zones of graphic characters. The transformation from the nominal form to the presentation forms may involve substitution, superimposition or combination. The rules for such transformations are outside the scope of ISO/IEC 10646.
Presentation forms are not normally intended to be used as a substitute for the nominal forms, but specific applications may use them in this way for particular purposes such as compatibility with existing devices.
The specification of presentation forms within ISO/IEC 10646, an example of which is LATIN SMALL LIGATURE FI at code position FB01, blurs the distinction between characters and glyphs discussed elsewhere in this guide.
Compatibility characters
Compatibility characters are included in the UCS primarily for compatibility with existing coded character sets to allow two-way code conversion without loss of information.

As with the other zones, it is divided into blocks and collections but the block for private use consists, by its very nature, only of unallocated code positions. The structure of this zone is as follows:

The Blocks and Collections of the R-zone

(collection = block; *,† = contains combining characters)

61PRIVATE USE AREAE000-F8FF
62CJK COMPATIBILITY IDEOGRAPHS F900-FAFF
63*ALPHABETIC PRESENTATION FORMS FB00-FB4F
64ARABIC PRESENTATION FORMS-A FB50-FDFF
(Reserved for future standardization) FE00-FE1F
65†COMBINING HALF MARKS FE20-FE2F
66CJK COMPATIBILITY FORMS FE30-FE4F
67SMALL FORM VARIANTS FE50-FE6F
68ARABIC PRESENTATION FORMS-B FE70-FEFE
(The single character at code position FEFF is not in any of the blocks into which the BMP is divided. Its significance is explained in the chapter of this guide on Serial Transmission of the UCS) FEFF
69HALFWIDTH AND FULLWIDTH FORMS FF00-FFEF
70SPECIALSFFF0-FFFD

Recall that the final two positions FFFE, FFFF are required to be left unused in every plane of the UCS. The collection numbered 200 is one of a number of special-purpose collections that have been assigned numbers in the range 200-299. See the chapter of this guide on repertoires and subsets for more information.


To Top of UCS Guide