CEN Guide to the Use of Character Sets in Europe

TC 304

UCS - Serial transmission of the UCS

Octet ordering

For many purposes, character data encoded according to any of the encoding methods of the UCS will need to be transmitted in serial form as a sequence of octets. Each of the encoding methods of the UCS, other than UTF-8, specifies its encodings in terms of octets ranked from a most significant octet to a least significant octet. This corresponds to the representation of the code value as a hexadecimal number of 2, 4, 6, 8 or more hexadecimal digits. This description in terms of orders of significance is entirely separate from the order in which the octets are transmitted down a serial communication channel. UTF-8 coding is distinct in that it is directly a coding of UCS data as an ordered sequence of octets.

It is laid down in 6.3 of ISO/IEC 10646-1 that in any transmission method the sequence order of octets from most to least significant shall be preserved and the most significant and least significant ends of the sequence must be identifiable. Furthermore, when character data is serialized as octets then a more significant octet shall precede a less significant octet. When not serialized as octets, the order of octets is a matter of agreement between sender and recipient. For example, if UCS-4 encoding is transmitted along a 16-bit data path then the most and least significant 16-bit words must be composed respectively of the G and P octets and of the R and C octets but it is a matter of private agreement as to the ordering of the two octets within a 16-bit word. The use of signatures for coding identification, explained below, enables a receiving implementation to determine the octet ordering within a 16-bit word in such cases.

A serial data stream of octets may be transmitted through a data path that is 8 bits wide, but it may also be transmitted in turn as a serial stream of single bits. Such representation of the octet stream as a stream of single bits is outside the scope of ISO/IEC 10646-1, so there is no requirement placed by that standard on the order in which the individual bits of an octet are transmitted. The requirements concerning octet order apply at a higher level of protocol at which the data exists in the form of complete octets.

Signatures for coding identification

There is one character of the BMP that is not part of any of the blocks into which the BMP is otherwise divided. This is the character

ZERO WIDTH NO-BREAK SPACE (U-0000FEFF)

It is not normally required for linguistic purposes and is not present in any of the collections associated with particular scripts. It is given a special significance, by a convention laid down in annex F of ISO/IEC 10646-1, that enables a receiving implementation to determine without prior knowledge what octet ordering is being adopted by the originating implementation. The coded form of this character, in each of the coding methods of the UCS, is known as the signature of that coding method.

Under this convention, an originating implementation sends the character ZERO WIDTH NO-BREAK SPACE at the beginning of a stream of characters. A receiving implementation that is unaware of the convention may simply ignore this character as it has no semantic meaning when sent as an initial character. But an implementation aware of the convention may use the received coded form to determine the ordering of octets being used by the sender.

The coded forms that may be received are:

UCS-2 signature: FEFF
UCS-4 signature: 0000 FEFF
UTF-16 signature: FEFF
UTF-8 signature: EF BB BF

The UTF-8 coding method is a special case, in that this method in itself specifies an octet ordering. The octets of the UTF-8 signature should therefore never be received in any other order. For the other three coded forms, if the data received is interpreted as containing the 16-bit word FFFE then it indicates that the order of octets within the word should be reversed. Recall that FFFE is prohibited from allocation to a character within any plane of the UCS, specifically to allow this method of signatures to operate without any possible ambiguity. If the word value FEFF (or FFFE) is preceded by the word value 0000 then it also serves to indicate that UCS-4 coding is being used. It is not possible to distinguish between UCS-2 and UTF-16 coding by signature alone, as they give identical encodings for every character of the BMP.

Top of UCS Guide