EUROPEAN STANDARD prEN ???? E NORME EUROPÉENE EUROPÄISCHE NORM October 1997 ___________________________________________________________________________ ICS : 35.040 Key words: coded character sets, character set conversion, transformation DRAFT 5 English version Information technology - Character repertoire and coding transformations - Part 1: General model for character set transformations Technologies de l'lnformation - Informationstechnik - Transformations des jeux des charactères Transformation des et des codes des charactères - Zeigenvorraten und -kodierungen - Part 1: Modelle generelle pour la Teil 1: Allgemeine Modelle für transformation des jeux des charactères das Transformation des Zeigenvorraten CEN European Committee for Standardization Comité Européen de Normalisation Europäisches Komitee für Normung Central Secretariat : rue de Stassart, 36 B-1050 Brussels ______________________________________________________________________________ Copyright © CEN 1997 Ref. No prEN ???? E Copyright reserved to all CEN members.FOREWORD
This European Standard has been prepared by the Technical Committee CEN/TC 304 "Character Set Technology" of which the secretariat is held by STRÍ.
The Standard is only available in English.
Annex A is an informative annex with a bibliography.
CONTENTS page FOREWORD 2 1 Scope 4 2 Normative references 4 3 Definitions 4 4 Relation between a character repertoire and its encoding 5 5 Transformation model 6 5.1 Staged description 5.2 Layered description 6 Character transformation schemes 9 Annexes: A Bibliography (informative) 10
character: a member of a set of elements used for the organization, control, or representation of data. (10646)
control function: An element of a character set that controls the recording, processing, transmission or interpretation of data, and that has a coded representation as one or mor bit combinations (6937).
graphic character: A character, other than a control function, that has a visual representation normally handwritten, printed or displayed (10646).
character set: A specified set of characters.
coded character set: A set of unambiguous rules that establishes a character set and the relationship between the characters of the set and their coded representation. (10646)
NOTE 2. In the ISO 8859 series a CCS can consist of, for example, two registered CCS of 2375. For example ISO 8859-1 is defined as ISO 2375 register number 6 and number 100- it does not have control characters. ISO 646 IRV (US-ASCII) is defined as the graphical CCS registration number 6 and the control CCS registration number 1. ISO 6937 is defined as ISO 2375 registration number 2 and then registration 101, with some specific rules for combinations of two octets into one character. This could be called a composite coded character set, as it may consist of several simple coded character sets.
NOTE 3. In ISO 9945 (POSIX) standards, a CCS is taken to be a description of the whole datastream, that is including both the control characters and the graphic characters. This is close to the MIME "charset" definition, but POSIX does not (yet) have specification techniques to define state-dependent encodings like 2022 based encodings. POSIX can handle ISO 6937 and Shift-JIS and UTF-8 with its charmap specification technique. This could be called a fully coded character set, as all bit combinations are defined.
encoding: the relation from the binary representation via coded character sets to (abstract) characters. The encoding defines the meaning of a binary data stream. It can consist of more than one coded character set, and an encoding scheme can be applied to regulate how these coded characer sets are encoded. Also symbolic characters can be encoded in the encoding.
encoding scheme: A set of unambigeous rules that establishes a character set and the relationship between the characters of the set and their coded representation, by combining more simple coded character sets or encoding a simple coded character set in another way. Examples are 2022 or UTF-8, the 2022 coding system works on one or more simple CCS as registered with ISO 2375. The procedures of forming characters in ISO 6937 is not a coding system, this is a (composite) coded character set.
repertoire: A specified set of characters that are represented in a coded character set. (10646)
symbolic character: a character represented by a name or number. Examples are html/sgml character entities like ä and RFC 1345 mnemonics like &a: TeX and groff character names.
transfer-encoding (MIME): a general transformation applicable to a binary stream, to obtain some specific properties of the byte stream. Examples are base64, uuencode, quoted-printable. UTF-8 is not a transfer-encoding.
1. The (simple, composite or full) coded character sets, for example ISO/IEC 8859-1 combined with the control character set of ISO/IEC 6429.
2. The rules for combining or coding one or more simple coded character sets, for example ISO/IEC 2022 or UTF-8.
3. A symbolic character notation, like SGML entities of the type á
On top of the encoding, general transformation schemes that are applicable to any binary representation may be applied. These generally applicable schemes include:
Figure 1: Transformation model
At this layer, a character string is transferred from the sender to the receiver, passing through a transformation process during the transfer. The transformation process is determined by the receiver, is well defined but need not be reversible. It also need not be on a character-by-character basis; it is a transformation of character strings and context dependence is permitted. At each end the character string has a coded representation according to some internal code of the end system concerned.
This layer is sufficiently general to cover both transliteration and transcription. The sender and receiver may in fact be within the same system, so it also covers local transliteration and transcription.
Specification at this layer is of the transformation process. The internal coding used by each system has no external visibility and does not form part of the specification. No negotiation is required at layer 4 as the transformation is solely of concern to the receiver.
Examples 1 and 2 (the two examples are identical at layer 4)
Sent string is Greek capital letters Epsilon Lambda Omicron Tau, internally represented in the sending system in ISO 8859-7
Received string is Latin capital letters E L O T, internally represented in the receiving system in EBCDIC.
The layer 4 transformation is according to some standard for transliteration from Greek to Latin.
At this layer a character string is transferred unchanged from the sender to the receiver. At each end the character string has a coded representation according to some internal code of the end system concerned, generally but not necessarily the same code as used at layer 4. The character string concerned is obtained from the sender's source string of layer 4 by a reversible character transformation, the purpose of which is to change the repertoire for the string to one that is acceptable to both sender and receiver. The character transformation needs to be agreed by both parties and may be the subject of negotiation at layer 3.
Specification at this layer is of the transformation from the source string to the layer 3 string. Both ends need to be aware of this transformation. The sender applies it directly to the source string in passing from layer 4 to layer 3. The receiver, in passing from layer 3 to layer 4, applies a composite constructed from the inverse of the layer 3 transformation followed by the layer 4 transformation. These two transformations may of course be implemented sequentially, but the model considers only the composite as otherwise the receiver would be required to be able to represent internally the original source string.
Example 1:
Sent string is the Latin sequence &EPSILON&LAMBDA&OMICRON&TAU, internally represented in the sending system in ISO 8859-7
Received string is the same, but internally represented in EBCDIC.
The layer 3 transformation is an SGML-style representation of the source string (sorry, I'm not familiar with SGML, otherwise it could be real SGML).
Example 2:
The sent string is the source string, internally represented in ISO 8859-7
Received string is the source string, internally represented in UCS2 of ISO 10646.
The layer 3 transformation is null in this example.
At this layer a byte sequence is transferred unchanged from the sender to the receiver. It is the coded representation of the layer 3 character string in some code that is acceptable to both sender and receiver. The code to be used needs to be agreed by both parties and may be the subject of negotiation at layer 2.
In going from layer 3 to layer 2, the sender is applying a code transformation from internal code to the agreed external code, and conversely in going from layer 2 to layer 3 the receiver is applying a code transformation from external code to internal code. Any problems associated with the code transformation, such as ambiguity in the selection of corresponding characters in the two codes, is a matter solely for the end system concerned. There is no external visibility of the code transformations, only of the transfer code itself.
Example 1:
The byte string transferred is the coding of the layer 3 string in 7-bit ISO 646.
Example 2:
The byte string transferred is the coding of the layer 3 string (i.e. the source string in this case) in UTF8 of ISO 10646.
The data transferred at layer 1 is obtained from the byte sequence of layer 2 by a transformation of binary data that is independent of the semantic content of the data. This represents compression, encryption or similar transformations at the binary layer and may be the subject of negotiation at layer 1.
Examples 1 and 2:
Each semi-octet is converted into the ISO646 representation of the corresponding hexadecimal digit.