Character repertoire and coding transformations -- Part 1: General model for character set transformations

Source: Keld Simonsen
Status: Project 9.1, working draft 5
Date: 1997-10-02

EUROPEAN STANDARD                                                prEN ???? E
NORME EUROPÉENE
EUROPÄISCHE NORM                                             October 1997
___________________________________________________________________________



ICS : 35.040

Key words: coded character sets, character set conversion, transformation




DRAFT 5

English version
                                          
Information technology -
Character repertoire and coding transformations -
Part 1: General model for character set transformations

Technologies de l'lnformation -             Informationstechnik -  
Transformations des jeux des charactères    Transformation des
et des codes des charactères -           Zeigenvorraten und -kodierungen -
Part 1: Modelle generelle pour la           Teil 1: Allgemeine Modelle für
transformation des jeux des charactères     das Transformation des Zeigenvorraten


                                          





CEN

European Committee for Standardization
Comité Européen de Normalisation
Europäisches Komitee für Normung


Central Secretariat : rue de Stassart, 36 B-1050 Brussels
______________________________________________________________________________
Copyright © CEN 1997                                    Ref. No prEN ???? E
Copyright reserved to all CEN members.

FOREWORD

This European Standard has been prepared by the Technical Committee CEN/TC 304 "Character Set Technology" of which the secretariat is held by STRÍ.

The Standard is only available in English.

Annex A is an informative annex with a bibliography.

CONTENTS                                                               page


      FOREWORD                                                            2

1     Scope                                                               4

2     Normative references                                                4

3     Definitions                                                         4

4     Relation between a character repertoire and its encoding            5

5     Transformation model                                                6
      5.1 Staged description
      5.2 Layered description

6     Character transformation schemes                                    9

Annexes:

A     Bibliography (informative)                                         10

1 Scope

This Standard specifies the model to be used on coded character transformation schemes, and outlines the necessity for several transformation schemes to exist.

2 Normative References

This Standard incorporates by dated or undated reference, provisions from other publications. These normative references are cited at the appropriate places in the text and the pubications are listed hereafter. For dated references, subsequent amendments to or revisions of any of these publications apply to this Standard only when incorporated by ammendment or revision. For undated references the latest edition of the publication referred to applies.

3 Definitions

For the purpose of this Standard the following definitions apply.

character: a member of a set of elements used for the organization, control, or representation of data. (10646)

control character: A control function the coded representation of which consists of a single bit combination (6937).

control function: An element of a character set that controls the recording, processing, transmission or interpretation of data, and that has a coded representation as one or mor bit combinations (6937).

graphic character: A character, other than a control function, that has a visual representation normally handwritten, printed or displayed (10646).

character set: A specified set of characters.

coded character set: A set of unambiguous rules that establishes a character set and the relationship between the characters of the set and their coded representation. (10646)

encoding: the relation from the binary representation via coded character sets to (abstract) characters. The encoding defines the meaning of a binary data stream. It can consist of more than one coded character set, and an encoding scheme can be applied to regulate how these coded characer sets are encoded. Also symbolic characters can be encoded in the encoding.

encoding scheme: A set of unambigeous rules that establishes a character set and the relationship between the characters of the set and their coded representation, by combining more simple coded character sets or encoding a simple coded character set in another way. Examples are 2022 or UTF-8, the 2022 coding system works on one or more simple CCS as registered with ISO 2375. The procedures of forming characters in ISO 6937 is not a coding system, this is a (composite) coded character set.

repertoire: A specified set of characters that are represented in a coded character set. (10646)

symbolic character: a character represented by a name or number. Examples are html/sgml character entities like ä and  RFC 1345 mnemonics like &a: TeX and groff character names.

transfer-encoding (MIME): a general transformation applicable to a binary stream, to obtain some specific properties of the byte stream. Examples are base64, uuencode, quoted-printable. UTF-8 is not a transfer-encoding.

4 Relation between a character repertoire and its encoding.

The encoding of a character repertoire can consist of at least three parts:

1. The (simple, composite or full) coded character sets, for example ISO/IEC 8859-1 combined with the control character set of ISO/IEC 6429.

2. The rules for combining or coding one or more simple coded character sets, for example ISO/IEC 2022 or UTF-8.

3. A symbolic character notation, like SGML entities of the type á

On top of the encoding, general transformation schemes that are applicable to any binary representation may be applied. These generally applicable schemes include:

5 Transformation model

The general transformation model is described in the following in two ways: The model can also be seen as a number of processes, with a well defined conceptual interface for data between them.

Figure 1: Transformation model

5.1 Stage description

The intention of the model is to distinguish the following aspects of character transformation: The character transformation model is based on a communication path where abstract charaters are sent from the originating side to the recieving side. The abstract character data are then transformed in a number of stages before the recieving side can it as abstract characters, some of the stages may be optional.

5.1.1 Transmitting side:
5.1.2 Recieving side:

5.2 Layered description

The general models can also be described in layers, each responsible for transforming from one representation of data to another. Starting from the top, the layers are as follows.

5.2.1 Layer 4: (top) abstract string layer.

At this layer, a character string is transferred from the sender to the receiver, passing through a transformation process during the transfer. The transformation process is determined by the receiver, is well defined but need not be reversible. It also need not be on a character-by-character basis; it is a transformation of character strings and context dependence is permitted. At each end the character string has a coded representation according to some internal code of the end system concerned.

This layer is sufficiently general to cover both transliteration and transcription. The sender and receiver may in fact be within the same system, so it also covers local transliteration and transcription.

Specification at this layer is of the transformation process. The internal coding used by each system has no external visibility and does not form part of the specification. No negotiation is required at layer 4 as the transformation is solely of concern to the receiver.

Examples 1 and 2 (the two examples are identical at layer 4)

Sent string is Greek capital letters Epsilon Lambda Omicron Tau, internally represented in the sending system in ISO 8859-7

Received string is Latin capital letters E L O T, internally represented in the receiving system in EBCDIC.

The layer 4 transformation is according to some standard for transliteration from Greek to Latin.

5.2.2 Layer 3

At this layer a character string is transferred unchanged from the sender to the receiver. At each end the character string has a coded representation according to some internal code of the end system concerned, generally but not necessarily the same code as used at layer 4. The character string concerned is obtained from the sender's source string of layer 4 by a reversible character transformation, the purpose of which is to change the repertoire for the string to one that is acceptable to both sender and receiver. The character transformation needs to be agreed by both parties and may be the subject of negotiation at layer 3.

Specification at this layer is of the transformation from the source string to the layer 3 string. Both ends need to be aware of this transformation. The sender applies it directly to the source string in passing from layer 4 to layer 3. The receiver, in passing from layer 3 to layer 4, applies a composite constructed from the inverse of the layer 3 transformation followed by the layer 4 transformation. These two transformations may of course be implemented sequentially, but the model considers only the composite as otherwise the receiver would be required to be able to represent internally the original source string.

Example 1:

Sent string is the Latin sequence &EPSILON&LAMBDA&OMICRON&TAU, internally represented in the sending system in ISO 8859-7

Received string is the same, but internally represented in EBCDIC.

The layer 3 transformation is an SGML-style representation of the source string (sorry, I'm not familiar with SGML, otherwise it could be real SGML).

Example 2:

The sent string is the source string, internally represented in ISO 8859-7

Received string is the source string, internally represented in UCS2 of ISO 10646.

The layer 3 transformation is null in this example.

5.2.3 Layer 2:

At this layer a byte sequence is transferred unchanged from the sender to the receiver. It is the coded representation of the layer 3 character string in some code that is acceptable to both sender and receiver. The code to be used needs to be agreed by both parties and may be the subject of negotiation at layer 2.

In going from layer 3 to layer 2, the sender is applying a code transformation from internal code to the agreed external code, and conversely in going from layer 2 to layer 3 the receiver is applying a code transformation from external code to internal code. Any problems associated with the code transformation, such as ambiguity in the selection of corresponding characters in the two codes, is a matter solely for the end system concerned. There is no external visibility of the code transformations, only of the transfer code itself.

Example 1:

The byte string transferred is the coding of the layer 3 string in 7-bit ISO 646.

Example 2:

The byte string transferred is the coding of the layer 3 string (i.e. the source string in this case) in UTF8 of ISO 10646.

Layer 1: Binary layer

The data transferred at layer 1 is obtained from the byte sequence of layer 2 by a transformation of binary data that is independent of the semantic content of the data. This represents compression, encryption or similar transformations at the binary layer and may be the subject of negotiation at layer 1.

Examples 1 and 2:

Each semi-octet is converted into the ISO646 representation of the corresponding hexadecimal digit.

6 Character transformation schemes

A representation of characters can be transformed into another representation in a number of ways, each having separate qualities and drawbacks: A number of transformation schemes exist on the market, and to facilitate transformation with and in between the scheme, separate specifications of the individual schemes are needed.

Annex A: Bibliography (informative)