Extrait du Framework Model on Internationalisation (modŠle de r‚f‚rence ou
modŠle-cadre de l'internationalisation) - Contribution d'Alain LaBont‚,
compl‚t‚e pour Fortran par Miles Ellis (universit‚ d'Oxford)

________________________________________________________________________________
Annex 1 - From a requirement to its implementation - Compare, Sort, Search

A.1.0 The case of the Canadian requirement for ordering English and French

    One of the most important specification of cultural elements is the
specification of the characteristics of ordering for text data strings. The
first normative requirement for comprehensive, fully predictable culturally
valid requirement for ordering has been the Canadian Standard CSA
Z243.4.1-1992, adopted as a preliminary standard (to be confirmed as
national standard of Canada in the beginning of 1994) after 6 years of work
with input from 7 different countries (Canada, France, USA, Belgium, The
Netherlands, Germany, Switzerland) to fine tune the Canadian proposal
issued out of a Quebec government proposal dated 1986.

    The Canadian Standard describes collating weights usable at once for
dictionary ordering of English, French, German, Dutch, Portuguese and
Italian without excluding other languages that can be handled with slight
modifications.

    This technique assigns four levels of weights that can be used for
fine-tuning the ordering function and provide absolute predictability of the
results while being culturally acceptable to a majority of users of those
languages. It is based mainly on ordering rules used in main dictionaries
of the French, English and German languages, the primary rules learned at
school by all young children, and unlike other more sophisticated
classification techniques, understood by all street people, and not only by
scholars.

A.1.1 Example of cultural requirement.

    Those rules are essentially the following:

    1. The 3 languages agree on a single alphabetic order, from A to Z,
       where no consideration is normally done for diacritical signs for
       the single purpose of ordering, unless there is a tie due to
       quasi-homography, i.e. words that look the same if diacritics are
       removed.

    2. For ligatures, expansion is done as if they were written as 2
       separate letters (ae, oe, ss [and ij in Dutch for that matter]);

    3. English and German dictionaries state that unaccented words precede
       accented ones in case of quasi-homography; French dictionaries need
       more precise rules as it it quite frequent that lists of 3 or 4
       quasi homographs are encountered for a series of identical letters,
       accented differently; in case of quasi-homography, the rule is
       in French that "the last difference in the word determines the
       order" (which means scanning the words to be compared from the end
       and back, until a difference is encountered in accentuation).
       Not to make French a special case, many sources recommended to use
       the French rule for solving ties due to that fact, as it is
       generally recognized that this does not bring extra overhead (a
       stack is as easy to use as a list) for other languages and that this
       is culturally equal for other languages.

    4. In case of homography on alphabetic characters and diacritics, then
       case becomes significant to determine a difference. English and
       German dictionaries agree that small letters should precede capital
       letters. French dictionaries do not make a difference, as they
       generally use only capital letters for their general language words
       (including accented capital letters, contrary to a wide-spread
       belief that capital accented letters are not used in French!)
       French encyclopedias and proper name dictionaries tend to use
       capital letters first but as there are numerous exceptions and that
       dictionaries are mute on this subject it has been decided to use
       English and German rules in the Canadian Standard to harmonize rules
       without really making the rule culturally incorrect for French (to
       be noted that Danish dictionaries specifically state that capitals
       precede small letters).

    5. Characters not part of the alphabet (spaces, hyphens, apostrophes,
       asterisks and so on, orthographic or not) are not significant for
       dictionary ordering. So that ordering can be predictable the
       Canadian standard specifies an order for these, but only in case all
       other 3 levels of significance of text data (alphabetic data,
       diacritics and case) are absolutely identical. So far it is the way
       the Canadian Standard specifies it, even if the normative benchmark
       contains a list of English words that could be ordered in a more
       refined way, but no English-speaking native objected so it is
       assumed to be culturally acceptable: "coop, co-op, COOP, CO-OP"
       constitutes a list sorted correctly according to the Canadian
       Standard; a refinement would have been possible to make "coop, COOP,
       co-op, CO-OP" the preferred order, but this seems at this point to be
       a matter of preference and of very fine tuning that could require
       extra overhead in some environments - 5 levels instead of 4 -
       although it would seem to be more consistent for very specific
       orthographic characters like hyphens, apostrophes and spaces, and
       this for all languages involved (these characters would have to be
       processed after the diacritics have been considered but before case,
       the other specials being processed after case).


    For example the following records are ordered correctly per the
Canadian Standard specification.

    COTE           / last difference  /
    C[o>]te       /  on 2nd letter   /last difference
    cot[e']                         / on 4th letter  / equal except/
    c*o*t*[e']   --.                                /for specials /
    C[o>]t[e']     |__ last difference on 2nd character / last difference
    coter   --.                                        / on last characters
    Coter     |__ equal except for case

    where [o>] represents the SMALL LATIN LETTER O WITH CIRCUMFLEX ACCENT
      and [e'] represents the SMALL LATIN LETTER E WITH ACUTE ACCENT.




A.1.2 Example of a specification technique

    The Canadian Standard describes this behavior using words (in English
and in French), diagrams, and tables. To simplify the understanding of such
tables, let's consider here the following tables of relative numbers:

      ALPHA       ACCENTS     CASE         SPECIALS
      1st level   2nd level   3rd level    4th level
INPT  token       token       token        token
CHAR  (serial)    (stacked)   (serial)     (serial)

 c    6            3           1           N/A
 C    6            3           2           N/A
 e    7            3           1           N/A
[e']  7            4           1           N/A
 E    7            3           2           N/A
 o    8            3           1           N/A
[o>]  8            5           1           N/A
 O    8            3           2           N/A
 r    9            3           1           N/A
 t    10           3           1           N/A
 T    10           3           2           N/A
 *    N/A          N/A         N/A          1

The Canadian Standard then suggests a conformance algorithm that
establishes a series of subkeys to be numerically composed as follows for
our examples (numbers are only indicative here, and showing a relation only
between the subset of characters chosen for the only purposes of the
example). To avoid composing a fourth key with place
holders for all non-special characters, comparison is done on the positions
of the special characters, and in case of equality, on the
weight assigned to the special character itself.
For that algorithm to work if all 4 subkeys are concatenated for
a multi-level one-pass sort, a special logical zero delimiter (which could
be 1 for C if all other relative numbers are offset by 1!) is coded between
the 3rd and 4th subkey. If certain conditions are not met in the careful
choice of relative numbers, such a logical zero delimiter would be
advisable between each of the subkeys.

Original string     Subkey 1     Subkey 2     Subkey 3   Logical   Subkey 4
                                                         delim.

    COTE            6,8,10,7     3,3,3,3      2,2,2,2    0
    C[o>]te         6,8,10,7     3,3,5,3      2,1,1,1    0
    cot[e']         6,8,10,7     4,3,3,3      1,1,1,1    0
    c*o*t*[e']      6,8,10,7     4,3,3,3      1,1,1,1    0      2,1,4,1,6,1
    C[o>]t[e']      6,8,10,7     4,3,5,3      2,1,1,1    0
    coter           6,8,10,7,9   3,3,3,3,3    1,1,1,1,1  0
    Coter           6,8,10,7,9   3,3,3,3,3    2,1,1,1,1  0

The Canadian Standard presents a reduction technique (non normative but
originating from the Quebec government which implemented it [designer:
Alain LaBonte']; it is now in the public domain) to reduce these subkeys
without affecting the comparison process if keys are to be stored by an
application for further comparison by a dumb process (such as a hardware
device able to search on binary sequences or an old unmodifiable "indexed
sequential" access method of any kind that orders keys numerically): the net
effect is that for most French words and more than 99% of English words, no
storage is required for the second key, as no accent is present, and in
certain conditions storage is highly reduced for the third subkey.  The same
subkeys reduced and concatenated to give a one-pass directly comparable
numerically would be:

    COTE            6,8,10,7,                 2,2,2,2,   0
    C[o>]te         6,8,10,7,    3,3,5,       2,         0
    cot[e']         6,8,10,7,    4,                      0
    c*o*t*[e']      6,8,10,7,    4,                      0,     2,1,4,1,6,1
    C[o>]t[e']      6,8,10,7,    4,3,5,       2,         0
    coter           6,8,10,7,9,                          0
    Coter           6,8,10,7,9,               2,         0

One should not implement this reduction technique without carefully looking
at the Canadian standard and its references for caveats in designing other
tables. Only in certain conditions can such an optimizing technique be
used. However even without reduction, the principle of forming a single key
(out of a multilevel specification) to be passed to old applications that
"know" how to sort numerical data strings is highly valid and economically
very important to support past applications that can be "internationalized"
without significant modifications if any.

A.2 Other specification techniques

    After the Canadians released their specifications, POSIX defined a
model that could handle it in a general way. This is but an abstract
specification technique that can be used but it nevertheless does the job
adequately. A good recommendation would be not to reinvent the wheel and
use it as it is the only international specification technique that exists
so far for describing collation tables. It does not use relative numbers
but rather a clever sequential ordering system that allows the description
of multilevel weights without specifying any numerical data. Order can be
changed just by inserting lines for specific additional characters or
swapping lines.

    Moreover it is, like the Canadian Standard, a codeset-independent
specification technique which will not necessitate as many specifications as
there are equivalent character sets.

    Hence it is a very flexible technique that allows the handling of a
general specification. It might be that refinements be made in the future
but it represents, like the Canadian Standard, the state of the art in this
domain and it is expected that future work will build on this specification
technique. To the knowledge of different experts, it can handle most of the
languages and scripts of the world without major difficulties. Extensions
are conceivable to handle combining sequences as present in ISO 10646 level
3 for those languages that absolutely require those combinations to be
handled to give a culturally valid ordering.








    The way to specify the minimum table to describe the previous example
according to the POSIX ordering specifications would be (simplified):

...
collating-symbol <SMALL>
collating-symbol <CAPITAL>      Definitions of symbols that are not
collating-symbol <NONE>         known as characters but which are needed
collating-symbol <ACUTE>        to describe relative weights
collating-symbol <CIRCUMFLEX>
...
                                                        |This statement
order_start forward;backward;forward;forward,position   |describes the
                                                        |scanning direction
                                                        |for each level
                                                        |(even allows
                                                        | position tokens
                                                        | if desired)

<SMALL>                                     |will result in SMALL=1
<CAPITAL>                                   |will result in CAPITAL=2
<NONE>                                      |will result in NONE=3
<ACUTE>                                     |will result in ACUTE=4
<CIRCUMFLEX>                                |will result in CIRCUMFLEX=5
<c>   <c>;<NONE>;      <SMALL>              |... c=6 and reused for itself
<e>   <e>;<NONE>;      <SMALL>              |... e=7 and reused for itself
<o>   <o>;<NONE>;      <SMALL>              |... o=8 and reused for itself
<r>   <r>;<NONE>;      <SMALL>              |... r=9 and reused for itself
<t>   <t>;<NONE>;      <SMALL>              |... t=10 and reused for itself
<C>   <c>;<NONE>;      <CAPITAL>            | from now on,
<O>   <o>;<NONE>;      <CAPITAL>             | no new value that needs to
<T>   <t>;<NONE>;      <CAPITAL>            | be resolved; numeric weights
<E>   <e>;<NONE>;      <CAPITAL>             | are all already known
<e'>  <e>;<ACUTE>;     <SMALL>
<o/>> <o>;<CIRCUMFLEX>;<SMALL>
<*>   IGNORE;IGNORE;IGNORE;SMALL            |SMALL=1, why not?
                                            |IGNORE means no value assigned
                                            |at each level that specifies it
...

A.3 User group requirements and functionality

    Interestingly enough, SHARE Europe, having had a look on Canadian
specifications, described a series of programming requirements expressed in
a White Paper published in 1990 in Geneva and titled "National Language
Architecture". Contrarily to the POSIX standard (ISO/IEC 9945-1 and ISO/IEC
9945-2) which, surprisingly, do not define the functions that could be
associated with the specifications of ordering (the standards refer to the
C standard which is explicit about that but other languages could implement
the C language equivalents or not, in addition to defining new ones), SHARE
Europe requires the support of a series of functions at the operating
system level to exploit to its full potential the specification of
ordering.

    It is to be noted that if an operating system does not provide those
functions they could be implemented in a common set of library routines
available to different programming environments, and that is exactly what
the Quebec government has done for its data centres and is about to
implement for small machines (PCs, Macintoshes, minis, and so on) without
any modification to the compilers it uses. For economic reasons, as
surprising as it may seems, COBOL has been used for that, in spite of the
recommendation to use a more portable programming language. For other
environments than the mainframes, other decisions have been taken (C
language routines are being developed).

    Obviously if some of these functions would be implemented in
programming language syntax, development of applications would be easier
for programmers with possibly some gains in performance (so far performance
has not be significantly affected, though: before taking the decision to do
such a project it is reasonable to consider it is more productive to do
things right, with potential productivity gains for the end-users, than
producing lightning-fast garbage that results in numerous operational
mistakes done by end-users who can't retrieve the information they are
searching with "traditional" methods [it took centuries if not millenaries
to develop universally accepted traditions in ordering for each given
script, but a few years of technology usage to scramble them and create a
so-called "collating tradition" that fools programmers themselves, even if
they often don't want to admit it]).

A.3.1 SHARE Europe Requirements of functions

    The following requirements have been addressed by SHARE Europe in the
above-mentioned White Paper on National Language Architecture:

A.3.1.1 Extended key generation

    Given a string and identification of its coding, a function should
exist to return the 4 subkeys of the Canadian specification (note: this
could be generalized to N levels instead of 4, with a possible information
being returned on the number of levels and a table of dimension N for the N
subkeys).

A.3.1.2 Original key regeneration

    Given the N keys generated by the previous function, regenerate the
original in the coding specified (coding which could be different from the
original but the original character string would be functionally equivalent
to the original from the user's point of view). This supposes that the
system of tables used to generate extended subkeys is known to the
underlying process, of course. Since all the information is contained in
the extended subkey (principle of absolute predictability, this has been
shown to be possible and implemented later on by the Quebec government for
the only needs of the Canadian specification).

A.3.1.3 Comparison operation

    Given 2 character strings on input and their coding, or the N subkeys,
return the following information:

    Case 1: The 2 strings are absolutely equal (ex. "ABC"="ABC");

    Case 2: The 2 strings are equivalent up to the level N of comparison;
            Case 2.1 Canadian spec (ex. "COTE"=="Cot[e'] at level 1 only);
            Case 2.2 Canadian spec (ex. "cote"=="Cote" up to level 2);
            Case 2.3 Canadian spec (ex. "c*o*t*e"=="cote" up to level 3)

    Case 3: String 1 comes before string 2 in order (ex. "Cote"<"cot[e']")
    Case 4: String 1 comes after string 2 in order (ex. "cot[e']>"COTE")

    Case 5: Fuzzy match: "Phydeault" ~ "FIDO" for French (snobbish dogs
            obviously write their names using the first spelling!)

    Notes: Case 2 is a rephrasing of the SHARE Europe requirement: the
        original requirement specifies on input what kind of equivalence is
        accepted, the answer indicating equality only for this case if
        absolute equality is not returned, i.e. equivalence required if
        only different because of specials, or because of case, or because
        of diacritics. This is better generalized to N levels with this
        respecification.

           Case 5 requires algorithmic fuzzy pattern matching functions
        that go beyond economical development in most environments because
        they require expert system technology which, for example, "knows"
        the exact phonetic environment in which it is applied: for example
        phonetic equivalent in cockney English for certain populations of
        London, which are not valid elsewhere, or foreign accent biases on
        the language, and so on: to buy that function for the Quebec
        government, function which by the way was commercially available
        (for pedagogic applications teaching French to young children of
        different origins) at time of development for the Quebecer accent
        applied on French and the different foreign accents commonly
        encountered in Montreal applicable to French [including what
        Quebecers call the "French accent"], the cost of implementing it
        would have more than tripled the cost of the basic functions
        mentioned above. In this particular case it has been decided not to
        implement this last requirement, useful for a police department,
        but generally not much for most commercial applications.

A.3.1.4 Coding conversion

    Given a string and its coding, and a resulting coding identification,
return an equivalent string in its new coded equivalent.

A.3.1.5 Sort

    Given a list of strings, perform an internal sort using the comparison
operation described above to obtain consistent results. For an external
sort, the same function should be used to obtain the same consistency of
operation.

A.3.1.6 Merge

    Given two lists assumed to be sorted according to the previous
function, merge the two lists in one using the same comparison operation
described above.

A.3.1.7 Substring Search

    Given two strings, search for the occurrence of the second one in the
first one, with parameters indicating what kind of equivalence level is
acceptable; return the offset of the retrieved string and its length (which
can be different from the one searched if equivalences are encountered, as
for example if ligatures are equivalent to separate letters in a given
specification).

A.3.1.8 Conversion to upper case unaccented data

    Given a rich text data including accented/unaccented lower/upper-case
data, return the "traditional" unaccented upper-case equivalent (see below
under section A.4 on how the at-first-glance-unrealistic reciprocal
function has been implemented, even if it is not a requirement so far in
the international community).

A.4 Complementary functions out of the scope of programming languages

    For the information of the readers, the implementation of these
functions from scratch will be very useful in all new end-user environments
(including American sites, some of which are said to have also implemented
the Canadian specifications as it was considered a requirement to solve
problems of character data processing in an unilingual English
environment).

    However it may be interesting to know that old data bases in Quebec, as
in Europe in general, have long used unaccented-capital-letters-only data to
avoid many of the problems solved by these specifications (not all, though,
as the presence of special characters, even if less visible, is an existing
problem, with various degrees of seriousness). To implement these new
functions, mixing upper-lower-accented-unaccented data is necessary for
talking with external sites. The requirements of SHARE Europe allow such
mixing. Furthermore the Quebec government also implemented automatic
functions to add accents and lower-case to existing person data and
geographica name data that was unaccented before (with a 99,7% accuracy, the
remaining cases necessitating human intervention because unsolvable by
automatic means: homographs like "Masse" and "Mass[e'], 2 existing family
names), to avoid having to retype that information in huge data bases.
These are problems for which no requirement is likely to be addressed to
programming language standards designers but which are nevertheless real and
solvable in existing environments, and that shows that it is also possible
to deal with the past without the necessity to start from scratch, in which
case no action would be possible forever.

A.5 Consequences of imbedding these functions in languages

    The basic functions that were previously mentioned are implementable
using present tools or with extensions of languages. The latter would be
highly desirable, the consequence being that resulting programs can be
designed to be portable in different cultures, the behavior of the
functions being parametrically provided outside of the language, but the
functionality being fully provided by the language to directly interface
those external specifications, while optimizing performance goals.

A.6 Specific language implementation examples

A.6.1 Fortran Culturally Sensitive String Comparison

The following example indicates how a Fortran module might be used to
implement culturally sensitive string comparison, using the approach outlined
above.  This module assumes the existence of the two additional intrinsic
functions suggested in section 6.5.

MODULE Cultural_Strings
   IMPLICIT NONE
   PRIVATE

!  This module provides the necessary services to support the character
!  handling requirements in a particular model of internationalization
!  and localization.

!  As written here it operates automatically in the cultural environment
!  which is current when the program begins execution.  It could be extended
!  to allow for a user-specified cultural environment.

!  Establish current cultural environment and character kind
   INTEGER, PARAMETER :: environment=CULTURAL_ENVIRONMENT(),  &
                         ch_kind=REPERTOIRE_KIND()

!  Specify overloaded comparison operators
   INTERFACE OPERATOR ( < )
      LOGICAL FUNCTION cultural_lt (s1,s2)
         CHARACTER(KIND=ch_kind,LEN=(*)), INTENT(IN) :: s1,s2
      END FUNCTION cultural_lt
   END INTERFACE

   INTERFACE OPERATOR ( <= )
      LOGICAL FUNCTION cultural_le (s1,s2)
         CHARACTER(KIND=ch_kind,LEN=(*)), INTENT(IN) :: s1,s2
      END FUNCTION cultural_le
   END INTERFACE
     .
     .
!  Specify those entities to be exported from the module
   PUBLIC environment,ch_kind,OPERATOR(<),OPERATOR(<=), ...

CONTAINS

   LOGICAL FUNCTION cultural_lt (s1,s2)
      CHARACTER(KIND=ch_kind,LEN=(*)), INTENT(IN) :: s1,s2
      INTEGER, DIMENSION(4) :: generate_keys,key1,key2
      LOGICAL, DIMENSION(5) :: compare_strings,comp

   !  The system function generate_keys takes a character string and its
   !  kind, and returns the four subkeys of the Canadian specification
   !  as a rank one integer array of dimension four.
      key1 = generate_keys(s1)
      key2 = generate_keys(s2)

   !  The system function campare_strings takes two arrays of integer subkeys
   !  and returns a rank one logical array of dimension 5.  Each element of
   !  the array value of the function specifies the truth or otherwise of the
   !  corresponding case in the specification of the comparison operation.
      comp = compare_strings(key1,key2)

   !  Return result of comparison as true if Case 3 is true (s1 before s2)
   !  and Cases 1 and 2 are false (s1 not equal and not equivalent to s2)
      cultural_lt = comp(3) .AND. .NOT.comp(1) .AND. .NOT.comp(2)

   END FUNCTION cultural_lt

   LOGICAL FUNCTION cultural_le (s1,s2)
      CHARACTER(KIND=ch_kind,LEN=(*)), INTENT(IN) :: s1,s2
      INTEGER, DIMENSION(4) :: generate_keys,key1,key2
      LOGICAL, DIMENSION(5) :: compare_strings,comp

      key1 = generate_keys(s1)
      key2 = generate_keys(s2)
      comp = compare_strings(key1,key2)

   !  Return result of comparison as true if any of Case 1 (s1 equals s2),
   !  Case 2 (s1 equivalent to s2) or Case3 is true (s1 before s2)
      cultural_lt = comp(1) .OR.comp(2) .OR.comp(3)

   END FUNCTION cultural_lt
     .
     .
END MODULE Cultural_Strings

A program which wished to use this module to provide culturally correct
character handling could do so as follows:

PROGRAM Culturally_correct
   IMPLICIT NONE

!  Obtain access to all public elements in the module Cultural_Strings
   USE Cultural_Strings

!  Declare two 50-character strings of the default type for the current
!  environment
   CHARACTER(KIND=ch_kind,LEN=50) :: string_1,string_2
     .
     .
!  Read data into these strings
   READ *,string_1,string_2

!  Print the two strings in their correct order, using the overloaded <=
!  operator to ensure culturally correct ordering, with the first input
!  coming first if they are equal, or at least equivalent
   IF (string_1 <= string_2) THEN
      PRINT *,string_1,string_2
   ELSE
      PRINT *,string_2,string_1
   ENDIF
     .
     .
END PROGRAM Culturally_correct

END OF TEXT -----------------------------------------------------------
