CEN Guide to the Use of Character Sets in Europe

TC 304

UCS - Use of control functions with the UCS

The coding of control functions in 7-bit and 8-bit codes

The general structure of 7-bit and 8-bit codes for character sets is given in:

ISO/IEC 2022 Information technology - Character code structure and extension techniques.

In particular, this standard provides the overall rules governing the use of control functions with such character sets. When control functions are used in conjunction with the UCS, their coded representation is algorithmically derived from their ISO/IEC 2022 representations. In the first edition of ISO/IEC 10646-1 this algorithm was based exclusively on the ISO/IEC 2022 representation for a 7-bit code, but Amendment 3 changed this to permit also coded representation of control functions based on their 8-bit code representation.

The following definitions are taken from ISO/IEC 2022:

control character: A control function the coded representation of which consists of a single bit combination.
control function: An action that affects the recording, processing, transmission or interpretation of data, and that has a coded representation consisting of one or more bit combinations.

In this context a bit combination is either a 7-bit or an 8-bit byte. In a 7-bit code the hexadecimal values 00-1F are reserved for control characters; this is known as the CL area of the code table. In an 8-bit code there are two such reserved areas, the CL area comprising values 00-1F and the CR area comprising 80-9F.

The coding of any control function either consists of a single control character from either the CL or CR areas, or such a character followed by one or more bit combinations. These following bit combinations are unrestricted in number and value. In particular, they may be (and usually are) values that on their own would be coded representations of graphic characters. The syntax for the use of any control character must provide a way of determining the end of the coding of each control function coded with its use.

ISO/IEC 2022 specifies the assignment of only one control character:

ESCAPE (acronym ESC) is required to be coded at 1B in the CL area.

The semantics specified for this control character are simple - it causes the meaning of a limited number of bytes following it to be changed. A sequence consisting of the ESCAPE character and such following bytes is known as an escape sequence.

ISO/IEC 2022 requires an escape sequence to have the following structure:

The first character is the ESCAPE character represented by byte value 1B;
the last character is a Final Byte with value in the range 30-7E;
between the first and last character there may be zero, one or more Intermediate Bytes, each with a value in the range 20-2F.

This structure enables the escape sequence to be delimited without knowledge of the semantics of the control functions represented by such sequences. ISO/IEC 2022 also specifies a more detailed structure for the use of escape sequences. These uses are all for code extension purposes, such as changing the sets of control characters or graphic characters that are in use, but that standard does not specify the semantics of individual escape sequences. These are subject to registration in accordance with:

ISO 2375 Data processing - Procedure for registration of escape sequences.

The register set up in accordance with ISO 2375 is:

International Register of Coded Character Sets to be used with Escape Sequences.

C0 and C1 sets of control characters

The ISO 2375 register includes escape sequences that invoke individual control functions, but it also includes escape sequences that invoke sets of control characters. Each such set is either a C0-set or a C1-set. One set of each type may be invoked simultaneously in either a 7-bit or an 8-bit code. In both cases a C0-set is mapped into the CL area of the code table. A C1-set may be mapped into the CR area of an 8-bit code, but an alternative coding is necessary in a 7-bit code and this alternative is also available for use with 8-bit codes.

This alternative coding of a C1-set is by means of escape sequences. Control characters that would be coded as values in the CR area are represented by a two-byte escape sequence consisting of ESCAPE followed by a Final Byte in the range 40-5F. These values 40-5F correspond sequentially to the values 80-9F of the CR area. This coding is known as the ESC Fe representation of the control character. This term is used in the first edition of ISO/IEC 10646-1 in its explanation of the coding of control functions, but the permitted coding of control functions was extended by Amendment 3 and references to ESC Fe sequences have been deleted.

Control functions for purposes other than code extension are specified in

ISO/IEC 6429 Information technology - Control functions for coded character sets.

This standard provides a repertoire of control functions from which particular functions may be selected to meet particular needs. Conformance to ISO/IEC 6429 does not require any particular control functions to be supported. It does, however, require that when any of its control functions are used then they shall be encoded as specified in that standard, and that encodings so specified shall not be used for any other purpose.

A particular type of coding specified in ISO/IEC 6429 is a control sequence. Control sequences are similar in nature to the escape sequences of ISO/IEC 2022 but they are less restricted in their use. In particular they permit the coding of control functions that take one or more numeric values as parameters. Control sequences start with the C1 control character

CONTROL SEQUENCE INTRODUCER (acronym CSI)

which may be coded either as 9B in the CR area or as the ESC Fe escape sequence ESC 5B. This character occupies the same relative location in a C1 set as does the ESCAPE character (1B) in a C0 set. A control sequence has the following structure:

It starts with the CSI function in either of its two coded representations;
the last character is a Final Byte with value in the range 40-7E;
between the first and last character there may be zero, one or more Parameter Bytes each with a value in the range 30-3F, followed by zero, one or more Intermediate Bytes, each with a value in the range 20-2F.

The Parameter Bytes used to represent a sequence of numeric parameters consist of the decimal representation of those numbers, with digits 0-9 coded by hexadecimal values 30-39, any separator symbol within a parameter (e.g. decimal point) coded as 3A and with distinct parameters separated from one another by value 3B.

The Final Byte value 6D is reserved in ISO/IEC 6429 for use by ISO/IEC 10646-1, in which it is used to encode the control function IDENTIFY UNIVERSAL CHARACTER SUBSET.

The use of control functions with the UCS

The coded representation of a control function for use with the UCS is obtained very simply from its coded representation for use with an 8-bit code. For an 8-bit code the coded representation consists of a sequence of one or more 8-bit bytes, of which the first is in one of the ranges 00-1F or 80-9F. Each of these bytes is converted to a UCS code position by taking the byte value as the C-octet value and setting the G-octet, P-octet and R-octet all to zero. Each UCS code position is then encoded according to the adopted coding method of the UCS, namely one of UCS-2, UCS-4, UTF-8 and UTF-16. Since code positions 0000-001F and 0080-009F of the BMP are reserved for the use of control characters, this gives an unambiguous coded representation of any control function without conflicting with the coding of graphic characters in the UCS.

In the first edition of ISO/IEC 10646-1 there was an additional requirement that control characters from a C1-set should be represented as ESC Fe escape sequences before this algorithm is applied. This requirement was removed by Amendment 3. As an example the control function CONTROL SEQUENCE INTRODUCER, which has coded representation 9B in the CR area of an 8-bit code, would have coded representations in a UCS-2 coding of

001B 005B prior to Amendment 3;
either 001B 005B or more simply 009B after Amendment 3.

Identification of UCS subsets by use of control functions

One particular control function is defined within ISO/IEC 10646-1 for identifying selected subsets of the UCS. This is coded by means of an ISO/IEC 6429 control sequence, using a Final Byte value reserved in that standard for the use of ISO/IEC 10646-1. The control function is

IDENTIFY UNIVERSAL CHARACTER SUBSET (acronym IUCS)

It takes as parameters the collection numbers of the collections that comprise the selected subset. The ISO/IEC 6429 coded representation of this control function consists of :

the control function CONTROL SEQUENCE INTRODUCER (CSI)
followed by parameter bytes that encode the sequence of collection numbers
followed by the Intermediate Byte 20
followed by the (reserved) Final Byte 6E.

This coded representation is then converted to the form appropriate to the UCS coding method in use, following the transformation rules explained above.

As an example,

the subset comprising the BASIC LATIN and LATIN-1 SUPPLEMENT collections
which are collections numbered 1 and 2
may be identified by the control function IUCS 1, 2
which has ISO/IEC 6429 coded representation CSI 31 3B 32 20 6E
and which has UCS-2 encoding 001B 005B 0031 003B 0032 0020 006E.

This UCS-2 coding uses the ESC Fe coding form for CSI; following Amendment 3 to ISO/IEC 10646-1 the code values 001B 005B can be replaced by the single code value 009B.

Invocation of the UCS from an 8-bit code

The character code structure and extension techniques specified in ISO/IEC 2022 includes provision for control functions that

designate and invoke an identified coding system different from that of ISO/IEC 2022, and
return to the coding system of ISO/IEC 2022 following such an invocation.

The coded representation of such a control function is by means of an escape sequence in which the first Intermediate Byte has value 25. If there is a second Intermediate Byte with value 2F, it signifies that the new coding system either has no means of return to ISO/IEC 2022 coding or that it provides its own means for so doing. If the second Intermediate Byte is either absent or has any other value then it signifies that the new coding system uses the escape sequence

ESC 25 40

to return to the coding system of ISO/IEC 2022 and that on such return, the state of the coding system (which sets of control and graphic characters are invoked, etc.) is restored to the state at the time that the other coding system was invoked. This is known as the standard return.

When a new coding system is designated and invoked by means of an escape sequence that signifies that the invocation is without standard return, then any return to the coding system of ISO/IEC 2022 is (by 15.4.2 of ISO/IEC 2022) a return to that coding system in an unspecified state. No announcements, designations and invocations of sets of control and graphic characters, or of other features of ISO/IEC 2022, survive from the state in which the new coding system was invoked. This is true even if the new coding system in fact specifies that return shall be by means of the escape sequence ESC 25 40; it is the escape sequence used to invoke the new coding method, not that used to return from it, that determines the state of the ISO/IEC 2022 coding system upon return. This has particular significance for the UCS, for reasons that will be seen below.

The assignment of particular escape sequences of these forms to particular coding methods is subject to registration in accordance with ISO 2375. The escape sequences so allocated are then published in the International Register of Coded Character Sets to be used with Escape Sequences. The following escape sequences have been registered to designate and invoke the coding methods of the UCS. They are given along with their registration numbers:

ISO-IR 162 allocates ESC 25 2F 40 to designate and invoke UCS-2 coding at implementation level 1 without standard return;
ISO-IR 163 allocates ESC 25 2F 41 to designate and invoke UCS-4 coding at implementation level 1 without standard return;
ISO-IR 174 allocates ESC 25 2F 43 to designate and invoke UCS-2 coding at implementation level 2 without standard return;
ISO-IR 175 allocates ESC 25 2F 44 to designate and invoke UCS-4 coding at implementation level 2 without standard return;
ISO-IR 176 allocates ESC 25 2F 45 to designate and invoke UCS-2 coding at implementation level 3 without standard return;
ISO-IR 177 allocates ESC 25 2F 46 to designate and invoke UCS-4 coding at implementation level 3 without standard return;
ISO-IR 178 allocates ESC 25 42 to designate and invoke UTF-1 coding (now withdrawn from ISO/IEC 10646-1 by Amendment 4) with standard return - the register entry includes the complete specification of UTF-1;
ISO-IR 190 allocates ESC 25 2F 47 to designate and invoke UTF-8 coding at implementation level 1 without standard return;
ISO-IR 191 allocates ESC 25 2F 48 to designate and invoke UTF-8 coding at implementation level 2 without standard return;
ISO-IR 192 allocates ESC 25 2F 49 to designate and invoke UTF-8 coding at implementation level 3 without standard return;
ISO-IR 193 allocates ESC 25 2F 4A to designate and invoke UTF-16 coding at implementation level 1 without standard return;
ISO-IR 194 allocates ESC 25 2F 4B to designate and invoke UTF-16 coding at implementation level 2 without standard return;
ISO-IR 195 allocates ESC 25 2F 4C to designate and invoke UTF-16 coding at implementation level 3 without standard return;
ISO-IR 196 allocates ESC 25 47 to designate and invoke UTF-8 coding at an unspecified implementation level but with standard return.

Note that of the coding methods defined for the UCS, only UTF-8 and (the now withdrawn) UTF-1 have the ability to make use of the standard return, since they are the only coding methods in which the escape sequence for the standard return is not transformed when used with the coding method concerned.

ISO/IEC 10646-1 does specify a means to return to the ISO/IEC 2022 coding system when it has been invoked without standard return. The specified method is by means of the escape sequence ESC 25 40 transformed to the appropriate coding method of the UCS as described above, e.g. 001B 0025 0040 designates a return from UCS-2. This is the same escape sequence as that used for a standard return, but except when used with UTF-8 its coded representation will include additional padding octets with value zero. Even when used with UTF-8, when that coding method has been invoked by means of ISO-IR 190, 191 or 192, it is interpreted for the purposes of ISO/IEC 2022 as a non-standard return, with consequent loss of the state of that coding system as described above.

Top of UCS Guide