SC22/WG20 N958

Problem statement for expressing DIN 5007 in 14651

Status: Expert contribution

Author: Marc Wilhelm Küster

Date: 2002-06-11

Action: For discussion

Umlaut and trema

DIN 5007, the long-established German standard for ordering, as well as current

practice in German libraries distinguishes between two diacritics of very

similar - in fact today mostly identical - appearance:

- the umlaut

- the trema

Both diacritics are encoded in the UCS as U0308, the COMBINING DIAERESIS.

However, in a number of traditional library coding schemes and software with

a long history such as the *Tübingen System für TextVerarbeitungs Programme

(TUSTEP)* these two are, distinguished in their encoding.

Both diacritics have a very different roots and traditional German

typography visually set both diacritics apart through the relative distance

of the two dots and sometimes through their diameter. That distinction has,

however, largely disappeared with the advent of PostScript and is now

almost obsolete.

Traditional ordering

This analysis may sound like a plea for the encoding of two separate

diacritics. This is not the case. A disunification for umlaut and trema

would, for a variety of reasons, be undesirable.

However, the current unification poses a problem is with German ordering, as

DIN 5007 treats letters with an umlaut different from letters with trema. In

the ordering of entities which are not names letters with umlaut come

directly after the respective base letter whereas letters with trema follow

after many of the remaining diacritics.

Hence, you have a sequence of the type of a ä á if ä is an a with umlaut,

but you get a á ä if the ä is an a with trema. This distinction is mandatory

in DIN 5007.

Both versions of ä would, from the point of view of the UCS, be encoded

identically, namely as U00E4 or U0061 + U0308. Tailoring in 14651 can only

be on the level of individual characters and character / diacritic

combinations. For this reason, there is no way to express DIN 5007 as a

profile of 14651. This is unfortunate and causes problems in German

libraries, especially in large research libraries.

Analogous problems exist for the ordering of names.

Desired guidance

The author would like guidance from WG20 on how to handle this problem within

the 14651 framework. Such guidance could take the form of either:

* a note in 14651 stating the best practice or

* some other WG20 best practice document that can be readily referenced or

* the resolution that WG20 has no views on this matter and leaves it entirely

up to the national ordering standards to take provisions for this and similar

cases.

A technical recommendation for this a technical solution could work along the

following lines:

In order to maintain the difference between a letter with umlaut and the same

letter with trema

* Mark up the distinction between the two diacritics through a higher level

protocol if this distinction is deemed necessary in a particular context

* Decompose the string at least with regards to the ambiguous letters

* Map the markup + combining trema combination to a character in the private

use area and treat that character as a combining diacritic for ordering

purposes

* Tailor the template on that assumption.

The author is open for any other suggestion.

Ideally, such suggestions should be generic enough to be applicable in comparable

cases such as may arise with regards to other cultural practices.