** **

**Comments on PDAM 1 on
ISO/IEC 14651:2001.**

**Document Type: Working Group
Document**

**Source: Kent Karlsson**

**Status: Expert Contribution**

**Date:
2001-10-03**

1. Changes to clause 3 should be listed in the amendment:

* refer to ISO/IEC 10646-1:2000

* refer to ISO/IEC 10646-2:2001

Have a 'note' saying that those two together have the same repertoire and code point

assignments as Unicode 3.1. References to older editions of 10646 and its amendments

should be removed.

2. Annex B: All of the tailoring examples should refer to the new table.

3. Annex A: The table should be extended to cover all of Unicode 3.1. If, during

the processing of this amendment to 14651, amendment 1 to 10646-1:2000 is

approved, consideration should be given to cover also those characters.

4. The ordering of runic letters should follow the FUÞARK based ordering.

This is the ordering in which runic letters are traditionally presented. ABC-ordering

of the runic letters should be left to tailorings for specific purposes. Even though

ABC-ordering of runic letters may have been used during the late period of runic letter

usage, no evidence of this has been presented by runic experts, and would even so

be an influence of the then dominating latin script, not the order in which runes where

originally thought of. Further, when using runic letters as counters (for calendaric

calculations, at least), the FUÞARK order was used, much as ABC order of the

latin script is used for item list numbering.

5. All Han based characters (CJK URO, CJK ext A, CJK ext B), should be ordered

together, in the order mentioned (currently ext A comes before the URO).

6. All <MIN> subweightings at level 3 that correspond to (pseudo-) combining

characters should be changed to <BLK> (possibly name-changed to <ACC>), in order

to make the weighting system more transparent to a human reader. (NOTE THAT

<MIN> should NOT be read as 'third level MINimum weight', but as 'third level weight

for MINuscule (or caseless) letter/digit'; <BLK> (=<ACC>) is the minimum third level

weight.

7. US NB suggests to introduce a <MAX> weight; but since <MIN> is not minimum, the

name <MAX> would be misleading, making the reader think that <MIN> is mimumum.

Instead of <MAX> the name <LIG> is suggested.

8. Insert as a new annex E "Order-preserving subkey reduction"; moving the

bibliography to annex F. The text is attached at the end of these comments, and can

be obtained in electronic form from the Swedish NB. The kind of subkey reduction

described in (new) annex E is inherent in the "position" directive, and this annex

motivates why the "position" directive is sound. Further, annex E describe "high-level"

optimisations, not just implementation level optimisations.

9. The clauses in 14651 should be rearranged as follows (i.e. insert this as an

instruction to the editor, in the amendment text; as a basis for the next edition of 14651):

1. **Scope**

2. **Conformance**

3. **Normative** **references**

4. **Definitions** **and** **symbols** [from
current 4 and 5]

5. **Common** **Template** **Table: formation
and interpretation**

[current 6.3, except 6.3.5 and 6.3.6]

6. **Declaration** **of** **a** **delta** [current 6.4]

7. **String comparison** [current 6.1 and 6.2]

8. **Conditions for collation equivalence**
[from current 6.3.5 and 6.3.6]

Annex A - **Common Template Table** [and its name,
current 6.5]

[and then the other annexes: B to F]

The subclause divisions are intended to be kept, even though the subclauses are not

listed above.

This is a more logical exposition order of the contents of 14651, without making any

normative change.

10. The note for I 3 should use <BASE> instead of <BLANK>.

11. It has been suggested that the ordering after this amendment should not

be changed, INCLUDING that added characters are collated (via the CTT) at the end,

leaving better ordering of them to tailorings. Sweden does NOT support this LATTER

part of stability, since the CTT should give a good default ordering for characters

that are not of immediate concern to tailor.

The template table of collation weights has four levels. When applied to a string, each level nominally produces a subkey that is about as long, in number of weights, as the number of characters in the string itself. There are, however, a number of ways to reduce the size of the subkeys without changing the ordering as determined by the nominal key. This international standard neither specifies normatively, and even less requires the use of, any subkey reduction technique. However, for conformity, any key size reduction method must preserve the order between strings as determined by the nominal key produced by the given tailoring of the template table.

In order to illustrate what can be done in terms of subkey reduction, we here present two example reduction methods. Good implementations of these methods produce the reduced key directly, without producing the nominal key first.

Applied correctly to each nominal key, these reduction methods keep the order between the strings, since each is a strictly monotonically increasing mapping, i.e.,

if *nominal_key1*
< *nominal_key2*, then

**reduce1**(*nominal_key1*) < **reduce1**(*nominal_key2*), and

**reduce2**(*nominal_key1*) < **reduce2**(*nominal_key2*).

Each subkey reduction method that results in a strictly monotonically increasing mapping can be applied at any level. Moreover, different reduction methods can be applied at different levels, as long as it is done consistently for all keys. E.g., example method 1 below can be used for levels 2 and 4, while using example method 2 below for level 3, while no reduction method need be applied to level 1.

These example methods are not normative, and can probably be improved upon to reduce the subkey sizes further (e.g., one may first remove trailing weights that are minimal at levels 2 and 3).

This method can be applied for a single selected weight at each level this method is used, preferably a weight that is very commonly occurring at that level.

This method uses a count value, which may be
stored in the subkey using different number of digits from what the weights
uses at that level. The transformed value (see below) of the count need be not
related to any of the weight values. In this illustration, we will use integer
values between 00 and FF (in hexadecimal), the latter is the *maxcnt** _{q}*. A practical implementation may of course use a wider range of
values.

The reduction method works as follows, described in principle, not in implementation terms: Each maximal subrun of the for that level selected weight to be reduced, including subruns of length 0, is replaced by a transformed count value, as follows:

· Consider each subkey to be terminated by a less than minimal (at that level), or zero, weight.

·
*swa** _{q}* is the selected
reduction weight at level

·
*wb* is the weight (including subkey terminator weight) that follows the
(empty or not) maximal subrun, of the weight *swa** _{q}*, currently processed.

·
*cnt
*is the length (≥ 0) of the subrun currently
processed.

·
*maxcnt** _{q}* is the value (≥
0) that is the largest value that can be stored as a count in the subkey, given
the number of digits used for the storage of a (transformed) count in a subkey
at that level.

·
*guardcnt** _{q}* is a value ≥ 0
and ≤

·
*lowcntsize** _{q}* =

·
*highcntsize** _{q}* =

·
If *swa** _{q}* >

(

·
If *swa** _{q}* <

(

Note that if the limits are set reasonably high,
the count/weight-pairs will be repeated zero times in practical cases. In the
other extreme, if *maxcnt** _{q}* is zero, the
subkey is not changed.

The reason this method works only for one weight per level is that subruns otherwise would cause ambiguous key reductions.

This reduction method can be applied to level 2, where <BASE> can be expected to be a very common weight. It can also be used for level 3, where <MIN> (minuscule, i.e. lowercase, or caseless) can be expected to be very common. This reduction method can also be applied to level 4, if one uses a tailoring that gives one very common weight on level 4. Note that if the selected weight is not very common in a string, the resulting key by this method may be longer than the nominal key, since empty subruns of the selected weight must be replaced by another subrun that is non-empty.

Below is an example reduction using this method on
level 2 (selecting the weight <BASE>), on level 3 (selecting the weight
<MIN>), and on level 4 (selecting a maximal weight called <PLAIN>;
for a tailoring of the template table). For readability, we will in this
example write <a>, <b>, etc. for the weights of the letters, <-> for <BASE>, <l> for
<MIN>, <u> for
<CAP>, <*> for
<PLAIN>, <A> for
<ACUTE>, <H> for
the level 4 weight of a hyphen, and <P> for the level 4 weight of an apostrophe. The example character string is “Vice-Président’s”. The nominal
key, according to the template table, can be expressed as a number in *R*, expressed as a hexadecimal fractional
number; <v> etc. really stands for digit sequences; spaces are used for
alignment for clarity, they are in no way part the actual key; the 0000 at the
end of the subkeys is the subkey termination weight):

0.<v><i><c><e> <p><r><e> <s><i><d><e><n><t> <s> 0000

<-><-><-><-> <-><-><-><A><-><-><-><-><-><-> <-> 0000

<u><l><l><l> <u><l><l> <l><l><l><l><l><l> <l> 0000

<*><*><*><*><H><*><*><*> <*><*><*><*><*><*><P><*> 0000

Do the reduction as described above (note that here: <-> (and <l>) is smaller than any other level 2 (level 3) weight, but greater than 0 (the subkey termination weight), and <*> is greater than any other level 4 weight):

0.<v><i><c><e> <p><r><e> <s><i><d><e><n><t> <s> 0000

F8 <A> 07 0000

FF <u> FC <u> 0B 0000

04 <H> 09 <P> 01 0000

Note that “count” values are at the beginning and end of the reduced subkeys, as well as between each non-selected weight. This key is significantly shorter than the nominal key, in this example as well as for most (not all) other strings that one can expect to normally occur.

This method can be applied for a set of weights at each level, preferably the ones that are commonly used at that level.

This method also uses a count value, but the transformed weight value must have the same number of digits as all other weights at that level uses. The transformed values (see below) for the count must have values that are in the neighborhood of the weight in the subrun that is replaced. This neighborhood of a weight must not overlap with any other weights or neighborhoods of weights.

The reduction method works as follows, described in principle, not in implementation terms: Each non-empty maximal subrun of a weight selected for reduction is replaced as follows:

· Consider each subkey to be terminated by a less than minimal (at that level), or zero, weight.

·
*wa* is the weight in the subrun.

·
*wb* is the weight (including subkey terminator) that follows the
non-empty maximal subrun of *wa*
currently processed.

·
*cnt
*is the length (≥ 1) of the subrun currently
processed.

·
*minnbh** _{wa}* is the minimum
value that constitutes the neighborhood of

·
*maxnbh** _{wa}* is the maximum
value that constitutes the neighborhood of

·
*lownbhsize** _{wa}* =

·
*highnbhsize** _{wa}* =

·
If *wa*
> *wb*: Replace the subrun with the
weight *wa* repeated (*cnt* **div**
*lownbhsize** _{wa}*) times, and, if (

·
If *wa*
< *wb*: Replace the subrun with the
weight *wa* repeated (*cnt* **div**
*highnbhsize** _{wa}*) times, and, if
(

Note that if the neighborhood of *wa* consists of *wa* only, the subrun is not changed.

This reduction method can be applied to level 2, for <BASE>. It can also be used for level 3, e.g. for <MIN>, <CAP>, <HIRA>, and <KATA>. It can also be applied to level 4, for <PLAIN> (for a tailoring that uses <PLAIN> as a maximal weight at level 4). Note that even if the weight reduced is not common, the resulting subkey is never longer than the nominal subkey. A nontrivial neighborhood is needed around each of the selected weights for a shortening of the subkey to actually take place.

We do an example reduction using this method on level 2 (for <BASE>), on level 3 (for <MIN> and <CAP>), and on level 4 (for <PLAIN>). For this we make the following assumptions:

· <-> (level 2): weight: 0026; minimum – maximum of neighborhood: 0022 – 002A.

· <l> (level 3): weight: 0005; minimum – maximum of neighborhood: 0002 – 0007.

· <u> (level 3): weight: 0017; minimum – maximum of neighborhood: 0015 – 001A.

· <*> (level 4): weight: 0F80; minimum – maximum of neighborhood: 0F00 – 0FFE.

For the same example string as in the description of example reduction method 1, we still have the nominal key:

0.<v><i><c><e> <p><r><e> <s><i><d><e><n><t> <s> 0000

<-><-><-><-> <-><-><-><A><-><-><-><-><-><-> <-> 0000

<u><l><l><l> <u><l><l> <l><l><l><l><l><l> <l> 0000

<*><*><*><*><H><*><*><*> <*><*><*><*><*><*><P><*> 0000

Do the reduction as described above:

0.<v><i><c><e> <p><r><e> <s><i><d><e><n><t> <s> 0000

0024 <A>0028 0000

0015 0005 0015 0005 0005 0002 0000

0F03 <H>0F08 <P>0F00 0000

If the lower neighborhood of <l> had been a bit bigger, we could have used only one weight (or two), instead of three, for the sequence of nine <l> weights.

This key is also significantly shorter than the nominal key, and this method never lengthens the key, since every subrun replaced is replaced by one that is shorter or at most as long as the original subrun. This method can also be applied to several weights at each level, which example method 1 cannot. On the other hand, example method 2 is a bit more complex than example method 1, since it needs to keep track of non-overlapping neighborhoods around some of the weights.

In many cases, at level 2 and up, differing weights don’t convey a real difference for the ordering, since subkeys that are more significant may contain a necessary ordering difference already. This is especially common for the fourth level, where the Common Template Table has different weights for every collating element. Since there are usually other more significant differences, it is rarely this difference at the fourth level matters. Implementers are in such cases allowed to map several nominally different weights to the same actual weight, in such a way that the resulting ordering is unchanged, in order to enable better subkey reduction.