CEN Guide to the Use of Character Sets in Europe

TC 304

UCS - Coding methods of the UCS

The coding alternatives

At present there are four coding methods specified within ISO/IEC 10646-1, any one of which can be specified in a claim of conformance to that standard. These methods have been assigned acronyms for easy reference, as follows:

UCS-2 is the Two-octet BMP form of coding;

UCS-4 is the Four-octet canonical form of coding;

UTF-16 is UCS Transformation Format 16, which was added to ISO/IEC 10646-1 by Amendment 1;

UTF-8 is UCS Transformation Format 8, which was added to ISO/IEC 10646-1 by Amendment 2.

The first edition of ISO/IEC 10646-1 contained a specification of a transformation format UTF-1 but this was deleted from the standard by Amendment 4 and is not available as a coding method in a claim of conformance to ISO/IEC 10646-1.

UCS-2: Two-octet BMP form

The two-octet BMP form of coding permits the use of characters from the BMP with each character represented by two octets. The BMP is specified by the G-octet and P-octet both being 00. In this form of coding a character is represented by the R-octet and C-octet of its code position. When expressed as a four-digit hexadecimal number the R-octet gives the most significant two digits and the C-octet gives the least significant two digits. UCS-2 provides a fixed-length coding for all the characters of the BMP.

UCS-4: Four-octet canonical form

The four-octet canonical form permits the use of all characters of ISO/IEC 10646 with each character represented by the G-octet, P-octet, R-octet and C-octet of its code position. These are taken in decreasing order of significance in the expression of a code position as an eight-digit hexadecimal number. UCS-4 provides a fixed-length coding for all the characters of the UCS.

UTF-16: UCS Transformation format 16

Once characters start to be allocated outside of the BMP, it will no longer be possible to use UCS-2 to encode all the allocations that have been made. However, a transition to UCS-4 instantly halves the rate of transfer of data through a communication link or the amount that can be stored on a given storage medium. This effect occurs even if the transition to UCS-4 has been made in order to accommodate only very occasional characters coded outside of the BMP.

The transformation format UTF-16 has been designed to avoid this halving of capacity, by means of a variable-length coding. It provides a means of coding any character within the first 17 planes P=00-10 of Group 00 such that the coding of any character within the BMP (Plane 00) is unchanged from its UCS-2 form. This multiplies the number of available code positions by 17 when compared with the BMP, but the number of octets used for coding is increased only for the (occasional) characters that are allocated to the planes outside the BMP. The capacity of a transmission link or storage device will therefore be little affected.

This has been achieved by reserving the S-zone of the BMP, consisting of the 8 rows D8-DF, for the exclusive use of UTF-16. These R-octet values can therefore never occur in an encoding within UCS-2. They are used instead to provide an escape mechanism into the 16 planes G=00, P=01-10. Amendment 1, which specifies UTF-16, also amended the planes of Group 00 which are reserved for private use. It added planes P=0F, 10 to the planes P=E0-FF which were already reserved for this purpose in the first edition of ISO/IEC 10646-1. The effect of this change is to include two private use planes in the 16 additional planes accessible by use of UTF-16.

The UTF-16 coding for a character coded within Planes 00-10 of Group 00 is constructed as follows:

If P=00 then the coding is in two octets and is as for UCS-2, i.e. the R-octet followed (i.e. with lower significance) by the C-octet;
If P=01-10 then the coding is in four octets constructed from the UCS-2 coding of two code positions from the S-zone. The first (most significant) two octets are from the range D800-DBFF, the second (least significant) two octets are from the range DC00-DFFF. The code space P=01-10 is divided into blocks of 400 (hexadecimal value) cells for the purpose of determining the coding. The first two octets determine in which block the code position lies. The second two octets determine the position within the block.

In more detail the correspondences between the UCS code position and the pair of S-zone positions is as follows:

The first two octets are D800 if P=01, R=00-3F; D801 if P=01, R=40-7F, …, D804 if P=02, R=00-3F, …, DBFF if P=10, R=C0-FF.
The second two octets run from DC00 to DFFF as the position within the block of cells runs from the first to the last position.

The UTF-16 encoding D800 DC00 and the UCS-4 encoding 00010000 therefore represent the same character.

UTF-8: UCS Transformation format 8

The aim of UTF-8 is entirely different from that of UTF-16. The transformation format UTF-8 is intended for the transmission of data through communication systems which treat the data stream as a sequence of octets from a coding system conforming to the 8-bit code structure laid down in ISO/IEC 4873. This code structure is specific as to the interpretation of octet values in the range 00-7F but octets in the range 80-FF have a variable interpretation that requires agreement between the communicating parties. A communication channel expecting data to conform to ISO/IEC 4873 may therefore only presume to know the interpretation of octets 00-7F.

In particular the communication system may interpret any octet in the range 00-1F as a control character as specified in ISO/IEC 4873, any octet in the range 20-7E as the ASCII character with this coding, and octet 7F as the DELETE character. To comply with this, UTF-16 encodes BMP code positions 0000 - 007F inclusive by means of their final octet only. This range of positions includes those reserved for control characters and the DELETE function and it relies on the positioning of the ASCII graphic character set in positions 0020-007E of the BMP.

All other code positions in the UCS are represented in UTF-8 by a sequence of 2, 3, 4, 5 or 6 octets. The first octet of such a sequence is in the range C0-FD. Continuing octets are in the range 80-BF. Octets FE and FF are not used.

There is no concept of most significant and least significant octets in UTF-8 encoding. It is a conversion of UCS characters into an ordered sequence of octets, for transmission or other processing in this form. The terms first octet and continuing octets refer to the order in which the octets occur in the sequence. This order must be maintained, even in transmission systems which serialize 16-bit words as octets by sending the least significant octet before the most significant octet.

The details of the transformation from UCS-4 coding to UTF-8 coding are complex and are not given here in detail. The transformation is such that a code position within the BMP takes at most 3 octets and a code position in planes P=01-1F of Group 00 takes at most 4 octets. These positions that take a maximum of 4 octets to encode therefore include, and exceed, those that can be encoded within UTF-16.

Top of UCS Guide