CEN Guide to the Use of Character Sets in Europe

TC 304

UCS - Coding of character data

Fixed and variable length codes

A coded character set provides an unambiguous relationship between the characters of a specified set and sequences of binary digits (bits) that are used to represent these characters. One of the most well known coded character sets is ASCII, the American Standard Code for Information Interchange. This represents 128 characters and uses all possible combinations of 7 bits. But there is no reason other than convenience why each character should be represented by the same number of bits, provided that the structure of the code permits the boundaries between one character coding and the next to be distinguished.

Modern codes used for the interchange and processing of information encode each character by one or more octets, an octet being a sequence of 8 bits. A code is fixed length if each character of the code is represented by the same number of octets and is of variable length if this is not the case. One standardized code of variable length for the Latin script is that specified in

ISO/IEC 6937, Information technology - Coded graphic character set for text communication - Latin alphabet.

In this code, letters without diacritical marks, and other symbols, are encoded by a single octet. Letters with diacritical marks are considered to be characters in their own right but they are encoded by a sequence of two octets. The first octet identifies the diacritical mark and the second identifies the base letter. This ordering originates with electromechanical printing equipment in which diacritical marks are non-spacing characters. The following character is then superposed on the diacritical mark to form an accented character.

The use of non-spacing diacritical marks in the variable length encoding of ISO/IEC 6937 differs in principle from the use of combining characters in the UCS, but the practical effect is similar. In ISO/IEC 6937 the non-spacing diacritical marks are not considered to be characters in their own right. The octet that represents a particular non-spacing diacritical mark is not a valid encoding on its own, instead it carries an implicit signal that it is to be followed by a second octet in order to complete the encoding.

In contrast to ISO/IEC 6937, the combining characters of the UCS are characters in their own right but they do not have a visual representation on their own. A glyph, giving a visual representation, is only associated with a complete composite sequence in which combining characters form only a part. A code with such a structure is capable of representing more glyphs than there are characters in the code.

Inadequacy of single-octet codes

The ASCII 7-bit code reserved the first 32 of its 128 code positions for control characters. Of the remaining 96 positions, one was used for the SPACE character and another for a DELETE character, so only 94 positions were left for graphic characters.

Due to the influence of ASCII on the development of coded character sets, early 8-bit codes were structured as two 7-bit codes, conceptually with a left-hand and a right-hand code table distinguished from one another by the eighth bit. Each of the 7-bit codes had the first 32 code positions reserved for control characters, but SPACE and DELETE were not required a second time, leaving 96 positions in the right-hand code table for graphic characters.

Such an 8-bit code is very limiting, as even with the use of combining characters it does not contain sufficient space for the base letters of more than one or two alphabetic scripts. A number of single-octet codes with this construction are defined in the multi-part standard ISO/IEC 8859 and further parts are still under development. Each part contains the specification of a single-octet code that includes the graphic characters of ASCII and which makes no use of combining characters. Although an improvement on the 7-bit ASCII code, each part covers the characters required for only a small selection of the world's languages.

Limitations of two-octet codes

As there is no possibility of using a single-octet code for an ideographic script, the 8-bit code structure with its ASCII inheritance was extended in the simplest manner that would permit the coding of such scripts. A sequence of two or more octets, each corresponding to a code position for a graphic character from the same half (left-hand or right-hand) of the 8-bit code table, could be taken together to encode a character. A sequence of two octets would then permit 94 x 94 (i.e. 8836), or 96 x 96 (i.e. 9216), characters to be encoded instead of the 94 or 96 permitted by single octet coding. The structure for such codes, together with various code extension techniques, is specified in

ISO/IEC 2022, Information technology - Character code structure and extension techniques

This provides sufficient space for the encoding of the most commonly used Chinese characters in a single code for use in either the left-hand or right-hand area of such a code structure. The full requirements of Chinese, however, are well illustrated by the Chinese Standard Interchange Code CNS 11643 (1992). This defines 7 such character sets which between them provide for the coding of 48027 Chinese characters. The code extension techniques of ISO/IEC 2022 include code switching mechanisms which permit all such tables to be used in conjunction with one another.

By breaking away from the ISO/IEC 2022 code structure, the full 65536 code positions of a two-octet code become available. There is still the need to provide for the coding of control characters, but this is minimal in comparison with the available space. However, if 48027 of these code positions are required for Chinese characters, a single two-octet code becomes inadequate to cover all the languages of the world.

The four-octet structure of the UCS

Since the intention of the UCS is to have the capability of covering all characters of all languages, a four-octet structure has been adopted. The most significant bit of the most significant octet is constrained to be zero, which permits its use for private internal purposes in a data processing system. The remaining 31 bits allow for over two thousand million code positions, which should be more than enough for all future needs. For reference the four octets are named, in order from the most significant to the least significant,

the Group-octet, G-octet or simply G;
the Plane-octet, P-octet or simply P;
the Row-octet, R-octet or simply R;
the Cell-octet, C-octet or simply C.

The entire code space is correspondingly viewed as a four-dimensional structure composed of

128 groups, each specified by a value for G;
256 planes in each group, each plane specified by a value for P;
256 rows in each plane, each row specified by a value for R;
256 cells in each row, each cell specified by a value for C.

The values of any octet are specified by two hexadecimal digits 0-9, A-F, in which A through F correspond to the decimal values 11 through 15 respectively. The value of G is restricted to the range 00-7F.

A cell within a plane may be described by four hexadecimal digits giving its R and C values, thus 1234 corresponds to R=12, C=34. In every plane the cells FFFE and FFFF are left unused; FFFE has a special use in signatures for coding identification (see the chapter on serial transmission of the UCS) and FFFF is available whenever a value is required that is guaranteed not to be a valid character code.

The plane with G = 00, P = 00 is known as the Basic Multilingual Plane (BMP). The 34 planes P = 0F, 10, E0-FF in Group 00 and the whole of the 32 groups G = 60-7F are designated as available for private use, outside the scope of standardization. Planes P = 0F, 10 in Group 00 were added by Amendment 1 to those reserved for private use, to enable two private use planes to be accessed by the UTF-16 coding methods specified in that Amendment. There is also a block reserved for private use that lies within the BMP; see the chapter on the Basic Multilingual Plane.

The UCS is one of two major codes that have developed outside of the constraints of ISO/IEC 2022. The other is the commercially-developed UNICODE™, which was developed strictly as a two-octet code. In the interests of compatibility, ISO and the Unicode Consortium cooperated during the development of both codes to ensure that the BMP coincides with UNICODE™ code table. In addition the UCS standard specifies more than one form for the coded representation of characters. One of these is the two-octet BMP form which, as its name implies, provides for the encoding of the characters of the BMP by the values for their R and C octets alone. This coding is identical to that provided by UNICODE™.

The pressure to ensure that all languages in current use can be represented by UNICODE™, so requiring coding within the BMP, has led to compromises in design that would not have been necessary in a pure four-octet code. Once again it is the space required by the ideographic scripts that is the source of the difficulties. This is explained in more detail in the chapter of this guide on the Basic Multilingual Plane.

Top of UCS Guide