Problem statement for expressing DIN 5007 in 14651
Status: Expert contribution
Author: Marc Wilhelm Küster
Action: For discussion
Umlaut and trema
DIN 5007, the long-established German standard for ordering, as well as current
practice in German libraries distinguishes between two diacritics of very
similar - in fact today mostly identical - appearance:
- the umlaut
- the trema
Both diacritics are encoded in the UCS as U0308, the COMBINING DIAERESIS.
However, in a number of traditional library coding schemes and software with
a long history such as the *Tübingen System für TextVerarbeitungs Programme
(TUSTEP)* these two are, distinguished in their encoding.
Both diacritics have a very different roots and traditional German
typography visually set both diacritics apart through the relative distance
of the two dots and sometimes through their diameter. That distinction has,
however, largely disappeared with the advent of PostScript and is now
This analysis may sound like a plea for the encoding of two separate
diacritics. This is not the case. A disunification for umlaut and trema
would, for a variety of reasons, be undesirable.
However, the current unification poses a problem is with German ordering, as
DIN 5007 treats letters with an umlaut different from letters with trema. In
the ordering of entities which are not names letters with umlaut come
directly after the respective base letter whereas letters with trema follow
after many of the remaining diacritics.
Hence, you have a sequence of the type of a ä á if ä is an a with umlaut,
but you get a á ä if the ä is an a with trema. This distinction is mandatory
in DIN 5007.
Both versions of ä would, from the point of view of the UCS, be encoded
identically, namely as U00E4 or U0061 + U0308. Tailoring in 14651 can only
be on the level of individual characters and character / diacritic
combinations. For this reason, there is no way to express DIN 5007 as a
profile of 14651. This is unfortunate and causes problems in German
libraries, especially in large research libraries.
Analogous problems exist for the ordering of names.
The author would like guidance from WG20 on how to handle this problem within
the 14651 framework. Such guidance could take the form of either:
* a note in 14651 stating the best practice or
* some other WG20 best practice document that can be readily referenced or
* the resolution that WG20 has no views on this matter and leaves it entirely
up to the national ordering standards to take provisions for this and similar
A technical recommendation for this a technical solution could work along the
In order to maintain the difference between a letter with umlaut and the same
letter with trema
* Mark up the distinction between the two diacritics through a higher level
protocol if this distinction is deemed necessary in a particular context
* Decompose the string at least with regards to the ambiguous letters
* Map the markup + combining trema combination to a character in the private
use area and treat that character as a combining diacritic for ordering
* Tailor the template on that assumption.
The author is open for any other suggestion.
Ideally, such suggestions should be generic enough to be applicable in comparable
cases such as may arise with regards to other cultural practices.