General model for character set transformations

Character repertoire and coding transformations -- Part 1: General model for character set transformations

Source: Keld Simonsen
Status: Project 9.1, working draft 5
Date: 1997-10-02


EUROPEAN STANDARD                                                prEN ???? E
NORME EUROPÉENE
EUROPÄISCHE NORM                                             October 1997
___________________________________________________________________________



ICS : 35.040

Key words: coded character sets, character set conversion, transformation




DRAFT 5

English version
                                          
Information technology -
Character repertoire and coding transformations -
Part 1: General model for character set transformations

Technologies de l'lnformation -             Informationstechnik -  
Transformations des jeux des charactères    Transformation des
et des codes des charactères -           Zeigenvorraten und -kodierungen -
Part 1: Modelle generelle pour la           Teil 1: Allgemeine Modelle für
transformation des jeux des charactères     das Transformation des Zeigenvorraten


                                          





CEN

European Committee for Standardization
Comité Européen de Normalisation
Europäisches Komitee für Normung


Central Secretariat : rue de Stassart, 36 B-1050 Brussels
______________________________________________________________________________
Copyright © CEN 1997                                    Ref. No prEN ???? E
Copyright reserved to all CEN members.

FOREWORD

This European Standard has been prepared by the Technical Committee CEN/TC 304 "Character Set Technology" of which the secretariat is held by STRÍ.

The Standard is only available in English.

Annex A is an informative annex with a bibliography.

CONTENTS                                                               page


      FOREWORD                                                            2

1     Scope                                                               4

2     Normative references                                                4

3     Definitions                                                         4

4     Relation between a character repertoire and its encoding            5

5     Transformation model                                                6
      5.1 Staged description
      5.2 Layered description

6     Character transformation schemes                                    9

Annexes:

A     Bibliography (informative)                                         10

1 Scope

This Standard specifies the model to be used on coded character transformation schemes, and outlines the necessity for several transformation schemes to exist.

2 Normative References

This Standard incorporates by dated or undated reference, provisions from other publications. These normative references are cited at the appropriate places in the text and the pubications are listed hereafter. For dated references, subsequent amendments to or revisions of any of these publications apply to this Standard only when incorporated by ammendment or revision. For undated references the latest edition of the publication referred to applies.

ISO/IEC 2022 Information Technology - ISO 7-bit and 8-bit coded character sets - Code extension techniques.
ISO 2375 Data processing - Procedure for registration of escape sequences
ISO/IEC 6429 Information Technology - Control functions for coded character sets
ISO/IEC 6937 Information Technology - Coded graphic character set for text communication - Latin alphabet
ISO/IEC 8859 Information Technology - 8-bit single-byte coded graphic character sets (all parts)
ISO/IEC 9945 Information Technology - Portable Operating System Interface (POSIX (all parts)
ISO/IEC 10646 Information technology - Universal Multiple-Octet Coded Character Set (UCS) (all parts)
ISO/IEC CD 14652 Information technology - Cultural Conventions Specification Standard
CEN ENV 12005:1996: Information Technology - Procedures for European Registration of Cultural Elements.

3 Definitions

For the purpose of this Standard the following definitions apply.

character: a member of a set of elements used for the organization, control, or representation of data. (10646)

NOTE: Characters are conveyers of some sound or other symbolic representation of some meaning, such as special characters, or control characters. Characters do not have an encoding to them, some also call them abstract characters. control character: A control function the coded representation of which consists of a single bit combination (6937).

control function: An element of a character set that controls the recording, processing, transmission or interpretation of data, and that has a coded representation as one or mor bit combinations (6937).

graphic character: A character, other than a control function, that has a visual representation normally handwritten, printed or displayed (10646).

character set: A specified set of characters.

coded character set: A set of unambiguous rules that establishes a character set and the relationship between the characters of the set and their coded representation. (10646)

NOTE 2. In the ISO 8859 series a CCS can consist of, for example, two registered CCS of 2375. For example ISO 8859-1 is defined as ISO 2375 register number 6 and number 100- it does not have control characters. ISO 646 IRV (US-ASCII) is defined as the graphical CCS registration number 6 and the control CCS registration number 1. ISO 6937 is defined as ISO 2375 registration number 2 and then registration 101, with some specific rules for combinations of two octets into one character. This could be called a composite coded character set, as it may consist of several simple coded character sets.

NOTE 3. In ISO 9945 (POSIX) standards, a CCS is taken to be a description of the whole datastream, that is including both the control characters and the graphic characters. This is close to the MIME "charset" definition, but POSIX does not (yet) have specification techniques to define state-dependent encodings like 2022 based encodings. POSIX can handle ISO 6937 and Shift-JIS and UTF-8 with its charmap specification technique. This could be called a fully coded character set, as all bit combinations are defined.

encoding: the relation from the binary representation via coded character sets to (abstract) characters. The encoding defines the meaning of a binary data stream. It can consist of more than one coded character set, and an encoding scheme can be applied to regulate how these coded characer sets are encoded. Also symbolic characters can be encoded in the encoding.

encoding scheme: A set of unambigeous rules that establishes a character set and the relationship between the characters of the set and their coded representation, by combining more simple coded character sets or encoding a simple coded character set in another way. Examples are 2022 or UTF-8, the 2022 coding system works on one or more simple CCS as registered with ISO 2375. The procedures of forming characters in ISO 6937 is not a coding system, this is a (composite) coded character set.

repertoire: A specified set of characters that are represented in a coded character set. (10646)

finite

symbolic character: a character represented by a name or number. Examples are html/sgml character entities like ä and RFC 1345 mnemonics like &a: TeX and groff character names.

transfer-encoding (MIME): a general transformation applicable to a binary stream, to obtain some specific properties of the byte stream. Examples are base64, uuencode, quoted-printable. UTF-8 is not a transfer-encoding.

4 Relation between a character repertoire and its encoding.

The encoding of a character repertoire can consist of at least three parts:

1. The (simple, composite or full) coded character sets, for example ISO/IEC 8859-1 combined with the control character set of ISO/IEC 6429.

2. The rules for combining or coding one or more simple coded character sets, for example ISO/IEC 2022 or UTF-8.

3. A symbolic character notation, like SGML entities of the type á

On top of the encoding, general transformation schemes that are applicable to any binary representation may be applied. These generally applicable schemes include:

compression schemes, (examples: zip, gzip)
encryption schemes, (examples PGP)
safer passage schemes, such as avoiding the 8th bit or byte values 0-31, (Examples: base64, uuencode)

5 Transformation model

The general transformation model is described in the following in two ways:

A staged approach, where each step in the process from the sender to the reciever is described, and
a layered approach, where the equivalences on the sending and recieving sides are explained.

The model can also be seen as a number of processes, with a well defined conceptual interface for data between them.

Figure 1: Transformation model

5.1 Stage description

The intention of the model is to distinguish the following aspects of character transformation:

Transformations for human use that need not be reversible (horizontal in the recieving side)
Transformations for repertoire reduction that need to be reversible (vertical)
Repertoire-preserving transformations of encoding method
Repertoire-independent (in fact, semantic independent) transformations of binary data

The character transformation model is based on a communication path where abstract charaters are sent from the originating side to the recieving side. The abstract character data are then transformed in a number of stages before the recieving side can it as abstract characters, some of the stages may be optional.

5.1.1 Transmitting side:

stage 1: abstract characters (repertoire) to abstract with symbolic character names.
stage 2: abstract with symbolic character names to coded characters
stage 3: coded haracters to encoding system (eg utf8 or utf18)
stage 4: encoding system to bits-for-transmission , applying transformations like compression and encryption (outside this model)

5.1.2 Recieving side:

stage 5: decryption, decompression (outside this model)
stage 6: decoding encoding system
stage 7. converting from one encoded character set to another (egt iso 8859-1 to CP850). incl error processing for unrecognized characters, and representing characters symbolically.
stage 8: converting into abstraact strings with symbgolic character names
convering abstract charaacters to other aabstract characters using a limited repertoire (eg because of limited hardware capabilities, or a transliteration or transcription scheme employed by a user).

5.2 Layered description

The general models can also be described in layers, each responsible for transforming from one representation of data to another. Starting from the top, the layers are as follows.

5.2.1 Layer 4: (top) abstract string layer.

At this layer, a character string is transferred from the sender to the receiver, passing through a transformation process during the transfer. The transformation process is determined by the receiver, is well defined but need not be reversible. It also need not be on a character-by-character basis; it is a transformation of character strings and context dependence is permitted. At each end the character string has a coded representation according to some internal code of the end system concerned.

This layer is sufficiently general to cover both transliteration and transcription. The sender and receiver may in fact be within the same system, so it also covers local transliteration and transcription.

Specification at this layer is of the transformation process. The internal coding used by each system has no external visibility and does not form part of the specification. No negotiation is required at layer 4 as the transformation is solely of concern to the receiver.

Examples 1 and 2 (the two examples are identical at layer 4)

Sent string is Greek capital letters Epsilon Lambda Omicron Tau, internally represented in the sending system in ISO 8859-7

Received string is Latin capital letters E L O T, internally represented in the receiving system in EBCDIC.

The layer 4 transformation is according to some standard for transliteration from Greek to Latin.

5.2.2 Layer 3

At this layer a character string is transferred unchanged from the sender to the receiver. At each end the character string has a coded representation according to some internal code of the end system concerned, generally but not necessarily the same code as used at layer 4. The character string concerned is obtained from the sender's source string of layer 4 by a reversible character transformation, the purpose of which is to change the repertoire for the string to one that is acceptable to both sender and receiver. The character transformation needs to be agreed by both parties and may be the subject of negotiation at layer 3.

Specification at this layer is of the transformation from the source string to the layer 3 string. Both ends need to be aware of this transformation. The sender applies it directly to the source string in passing from layer 4 to layer 3. The receiver, in passing from layer 3 to layer 4, applies a composite constructed from the inverse of the layer 3 transformation followed by the layer 4 transformation. These two transformations may of course be implemented sequentially, but the model considers only the composite as otherwise the receiver would be required to be able to represent internally the original source string.

Example 1:

Sent string is the Latin sequence &EPSILON&LAMBDA&OMICRON&TAU, internally represented in the sending system in ISO 8859-7

Received string is the same, but internally represented in EBCDIC.

The layer 3 transformation is an SGML-style representation of the source string (sorry, I'm not familiar with SGML, otherwise it could be real SGML).

Example 2:

The sent string is the source string, internally represented in ISO 8859-7

Received string is the source string, internally represented in UCS2 of ISO 10646.

The layer 3 transformation is null in this example.

5.2.3 Layer 2:

At this layer a byte sequence is transferred unchanged from the sender to the receiver. It is the coded representation of the layer 3 character string in some code that is acceptable to both sender and receiver. The code to be used needs to be agreed by both parties and may be the subject of negotiation at layer 2.

In going from layer 3 to layer 2, the sender is applying a code transformation from internal code to the agreed external code, and conversely in going from layer 2 to layer 3 the receiver is applying a code transformation from external code to internal code. Any problems associated with the code transformation, such as ambiguity in the selection of corresponding characters in the two codes, is a matter solely for the end system concerned. There is no external visibility of the code transformations, only of the transfer code itself.

Example 1:

The byte string transferred is the coding of the layer 3 string in 7-bit ISO 646.

Example 2:

The byte string transferred is the coding of the layer 3 string (i.e. the source string in this case) in UTF8 of ISO 10646.

Layer 1: Binary layer

The data transferred at layer 1 is obtained from the byte sequence of layer 2 by a transformation of binary data that is independent of the semantic content of the data. This represents compression, encryption or similar transformations at the binary layer and may be the subject of negotiation at layer 1.

Examples 1 and 2:

Each semi-octet is converted into the ISO646 representation of the corresponding hexadecimal digit.

6 Character transformation schemes

A representation of characters can be transformed into another representation in a number of ways, each having separate qualities and drawbacks:

an information-preserving transformation, possibly with some symbolic character representation. Examples of this are RFC 1345 mnemonic transformation.
an information-losing transformation optimized for reading. Examples of this is RFC 1345 mnemonic transformation without an intro-sequence, and C3 level 2 conversion. Many translitteration schemes can also be seen as examples of this.
an information-losing transformation preserving the length of the string Examples of this is C3 level 1 conversion.
a transformation guaranteeing roundtrip integrity. Examples of this is the IBM CDRA conversions.
other forms of transformation

A number of transformation schemes exist on the market, and to facilitate transformation with and in between the scheme, separate specifications of the individual schemes are needed.

Annex A: Bibliography (informative)

RFC 1345: Character Mnemonics and Character Sets. K. Simonsen, July 1992.
RFC 2045: Multipurpose Internat Mail Extensions (MIME) Part one: Format of Internet message Bodies. N. Freed, N. Borenstein, November 1996.
IBM CDRA
C3