SC22/WG20 N1037

L2/03-081

Hangul Collation Requirements

Draft 2003-02-28, MED

The following represent proposed requirements on collation for Korean text in ISO/IEC 14651 and in the UCA (UTS #10: Unicode Collation Algorithm). The definitions, requirements, and desired features are numbered for reference.

Notation and Definitions

The following abbreviations are used:

L any leading Jamo
V any vowel Jamo
T any trailing Jamo
Lv any composite Hangul syllable character in the range U+AC00..U+D7A3 that is canonically equivalent to a sequence L + V
Lvt any composite Hangul syllable character in the range U+AC00..U+D7A3 that is canonically equivalent to a sequence L + V + T
S any standard Korean syllable-block (whether a composite character or a sequence of jamo or a mixture)
X any character that is not of the above

In examples, subscripts will be used to indication specific letters, such as L1 vs L2.

Abbreviated Examples. We give examples below such as "V1 < V2". This is the common way to cite the ordering among these characters. However, these examples are really abbreviations, since the V's alone do not form complete syllables. What such examples stand for is really "xV1 < xV2", where x is the identical string on both sides, and contains characters such that the string forms a standard Korean syllable-block. It would make the exposition somewhat clumsier to always spell this out, which is why the variable x and the explanation of its contents are omitted in such examples.

D1. Cluster. A sequence consisting of one or more L's, one or more V's, or one or more T's is called a cluster. A cluster of all L's is a leading cluster; similarly there are vowel clusters and trailing clusters. Thus L1L2 is a leading cluster, and V1 is a vowel cluster, but L1L2V1 is not a cluster; it is a sequence of a leading cluster (L1L2) followed by a vowel cluster (V1).

D2. Standard Korean syllable-block. This is a sequence of letters that consists of a leading cluster followed by a vowel cluster, optionally then followed by a trailing cluster. In regular expression syntax, this is of the form: L+ V+ T*. It also includes any canonically equivalent sequence. For details on canonical equivalence, see the Unicode Standard, Chapter 3. Examples:

D3. Existing Syllables are those that actually occur in modern or ancient practice, which is a small subset of all theoretically possible syllables. The existing syllables include all syllables encoded as the modern composite 11,172 Hangul Syllables, plus documented ancient syllables. See Desired Features.

D4. CTT. In this document, this refers to the current weighting table for UCA and the current Common Tailorable Template Table (CTT) for ISO/IEC 14651, which are kept in synchronization by the respective organizations.

Note: Contractions, Thai reversal, and canonical equivalence are the only mechanisms that are stateful in the UCA architecture; otherwise each character is simply mapped to a sequence of weights without regard to context. ISO/IEC 14651 allows for implementations to use canonical equivalence but does not require it. It is silent on the issue of Thai reversal.

Note: There may be intermediate versions of the UCA for which there is no corresponding ISO/IEC 14651 table, because implementers require tables that are kept up to date with the latest version of Unicode and ISO/IEC 10646.


Requirements

R1. Tailorability. It must be possible to tailor the UCA so as to meet the following requirements for Korean.

R2. Canonical Equivalence. Whenever text is canonically equivalent, it must sort the same. In particular, one string made up of HANGUL SYLLABLE characters will sort the same as a string made up of the equivalent JAMO characters. Example: the following sequences must sort the same:

R3. Performance/Size. Whatever approach is taken, it must be possible to sort strings containing the composite 11,172 Hangul Syllables (which form the vast majority of Korean text) without undo performance/storage costs. In particular, when restricted to such text, sort keys must be reasonably short and incremental comparison reasonably fast, in comparison to current good implementations of Korean sorting. A reasonable limit is no more than about 3 times worse. See Performance.

R4. Existing Korean Syllable-Blocks. It is only important that existing standard Korean syllable-blocks and equivalent text sorts correctly, as long as for other characters the ordering is still determinate. That is:

R3a. There is no requirement on "degenerate" syllables (such as L1T1).

R3b. It is only a requirement to sort existing syllables, not all theoretically possible syllables. See Desired Features. It must, however, be possible to tailor additional syllables.

R5. Syllables as Primary Units. Standard Korean syllable-blocks sort as primary units, meaning that if two syllables are different, no further characters in the string make any difference (in ordering). Examples:

R6. Clusters as Primary Units. Within a standard Korean syllable-block, elements of a cluster sort as primary units, meaning that if two such elements are different, no further characters in the syllable make any difference (in ordering). Examples:

R6. Shorter Clusters Sort First. A shorter, but otherwise identical, cluster sorts before a longer one, no matter what is later in the string. Examples:

Implications

The current UCA/14651 architecture simply assigns a different primary weight to each L, V, T and other characters. It can also assign primary weights to sequences, called contractions.

Without any change in that architecture, the above requirements would have the following implications on the weights for those characters. However, these are only implications given the current architecture; under discussion is whether we should change the architecture.

Weighting vs. Non-Korean

  1. X < V. Because all syllables sort as primary units, L1V1X must sort before L1V1V. Thus all non-Jamo must be weighted before all V's.
  2. X < T. Because all syllables sort as primary units, L1V1X must sort before L1V1T. Thus all non-Jamo must be weighted before all T's.

There are no implications vis--vis weighting of L vs X. However, the CTT puts V and T before Katakana, Hiragana, and more importantly, CJK Ideographs. Without tailoring all V and T to be after all other characters that could occur right after a Hangul syllable, the current tables break the requirement that syllables sort as primary units. See Trailing Weights in the UCA for an example.

Weighting of Jamo

  1. L < T. Because all syllables sort as primary units, L1V1L must sort before L1V1T. Thus all L's must be weighted before all T's
  2. T < V. Because shorter clusters sort before longer ones, L1V1T must sort before L1V1V. Thus all T's must be weighted before all V's.
  3. Any supported LL combination must be treated as a contraction. Because shorter clusters sort before longer ones, L1V1x sorts before L1Ly. If L1L were not a contraction, then this would imply that V < L, which contradicts i-iv

Desired Features

The following are features that are desired, if they can be accommodated without impacting the requirements.

DF1. Theoretical Syllables. Sort all theoretical syllables (not just existing syllables) according to the requirements. As a practical matter, this would imply that contractions could not be used as in (v) above, since it would not be possible to list all possible contractions.

DF2. Further Decompositions. (open issue) Some of the encoded Hangul Jamo can be decomposed further than canonical decomposition does. For example:

U+11AA (ᆪ) HANGUL JONGSEONG KIYEOK-SIOS =>

U+11A8 (ᆨ) HANGUL JONGSEONG KIYEOK +
U+11BA (ᆺ) HANGUL JONGSEONG SIOS

It may be desirable to have these sort as the same (on at least a primary and secondary level, perhaps also on a tertiary level).

DF3. Compatibility & Half-Width. (open issue) Map compatibility and half-width Jamo more intelligently to standard Korean syllable-blocks. The current weight table for compatibility and half-width Jamo is stateless. It simply maps each such character to
the same weight as the conjoining Jamo would have. It may be desireable to perform some additional transformation, so that strings of compatibility or half-width Jamo sort the same as the equivalent sequences of Hangul syllables. If so, the details of that transformation need to be spelled out.

DF4. Default Sorting. The requirement R1 is that it be possible for a tailoring to meet the other requirements. It is also desirable that the standard CTT and architecture meet the other requirements, without tailoring being necessary.

Performance

We have to be careful of exactly what is involved in any change, especially involving a preprocessing step. Why is that? People around the world are extremely sensitive to the performance of collation. Unless "preprocessing" is designed correctly so that in practice it can actually be done incrementally, and without large changes in architecture it could slow down processing very significantly. And in practice, it would not be accepted by people in the market place.

The incremental nature is very important. Here is an example: in most production systems, there are two separate code paths: (1) sort key generation, and (2) direct string comparison.

  1. sort key: preprocess a string to produce a sequence of bytes. Later, such sequences of bytes can be compared using a binary compare.
  2. direct compare: process both strings incrementally; the comparison produces the same results as a binary compare of sort keys, but no sort key is produced.

The important features are that on average in real data a direct string comparison is something like 20 times faster than the production of a sort key (and may be even faster). This is because typical comparisons are actually decided in the first few characters, and the rest of the processing of the string is normally skipped. Unless a very large number of comparisons are made, the direct string comparison wins, and is commonly used. But incremental comparison is very performance sensitive; one can't even call a strlen() function on the text, since that screws up the performance.

The other feature is sort key length. If we end up with sort keys that are very significantly longer than what is done right now, it becomes a significant performance drain: in terms of the generation of the sort key, the extra storage it uses, the increase time to read sort keys in and out of memory, and so on. Since the same weight data is used in incremental processing, longer weights also mean slower incremental comparison.

We must be very careful of the implications for performance in everything we end up with for Hangul, since if it is too slow and only marginally better as far as most Koreans are concerned then it will end up being ignored (e.g. tailored away).