From greger@iuk Wed Dec 5 20:29:38 1990 Received: from [128.212.16.14] by dkuug.dk via EUnet with SMTP (5.64+/8+bit/IDA-1.2.8) id AA10567; Wed, 5 Dec 90 20:29:38 +0100 Received: by ism.isc.com (Sendmail5.61/1.35) id AA07368; Wed, 5 Dec 90 11:33:02 -0800 Received: from friherr by iuk.isc.com (5.61/smail2.2/11-14-88) id AA13614; Wed, 5 Dec 90 18:39:25 GMT Received: by (5.61/1.35/jcb-s) id AA00626; Wed, 5 Dec 90 18:52:52 GMT Date: Wed, 5 Dec 90 18:52:52 GMT Message-Id: <9012051852.AA00626@> To: Mark_E_Davis.PINKTEAM%gateway.qm.apple.com@ism, unicode%noddy.eng.sun.com@ism, Internet_UniCore.PINKLINK%gateway.qm.apple.com@ism, i18n%dkuug.dk@ism From: greger@ism.isc.com ("greger@ism.isc.com (Greger Leijonhufvud, ISC, High Wycombe, U.K.)") Subject: Re(2): 10646 Advantages X-Charset: ASCII X-Char-Esc: 29 In reply to your message of Fri Nov 16 20:03:28 1990 ------- The following is a comment/follow-up to Dominic Dunlop's message dated Nov 16 1990. I quote from Dominic's answer: > The net effect of this is that the ISO POSIX working group (!) is > currently running with the issue because it needs a solution: the > UNIX shell and tools embody collation and related concepts > (filename expansion and listing, the sort command, regular > expressions), and a corresponding international standard must be > internationally applicable. Work in progress suggests that, by > making up to four passes backwards and forwards through text, > assigning different weights (including ``ignore'', ``high'' and > ``low'') to each encoded character encountered on each pass, you > can achieve useful real-world collation. Although you probably > can't do a telephone book sort even in New York, never mind Tokyo. > Our work has been based primarily on encodings without the > non-spacing diacritics (accents) of Unicode. If it turns out that > we can't accommodate these, we'll think again: the ability to > handle Unicode is at the very least an important proof of concept > for us. (My feeling is that, compared to the handling of stateful > encodings with locking shifts -- something else that we intend to > accommodate -- non-spacing diacritics should be a piece of cake.) The current collation scheme in POSIX.2 supports collation specification with n number of passes (n = 2-4, typically); free assignment of weights per pass, including IGNORE; multi-character collating elements, 1-to-many mapping, different evaluation order per pass (forward/backward), and equivalence classes. It supports the proposed Canadian standard collating sequence, which is the most complex I have seen in official documents so far. Certainly, it cannot do advanced telephone book collation (which is often based on phonetics), but will do a creditable job of dictionary ordering for Western languages. In this context, the multi-character collating element feature is of interest. It allows the specification of character sequences to be collated as an entity. An example of the use is the Spanish 'ch', which collates as an entity between 'c' and 'd'. Another, and pertinent in this context, is with code sets employing non-spacing characters (which we tend to frown upon, for obvious reasons; they are difficult to handle in programs, as the character width becomes variable); you define all characters that are created via non-spacing characters as a multi-character collating element made up of the non-collating element and the 'base character' (e.g. 'A). Greger Leijonhufvud INTERACTIVE Systems High Wycombe, UK greger@{iuk,ism}.isc.com