Extrait du Framework Model on Internationalisation (modŠle de r‚f‚rence ou modŠle-cadre de l'internationalisation) - Contribution d'Alain LaBont‚, compl‚t‚e pour Fortran par Miles Ellis (universit‚ d'Oxford) ________________________________________________________________________________ Annex 1 - From a requirement to its implementation - Compare, Sort, Search A.1.0 The case of the Canadian requirement for ordering English and French One of the most important specification of cultural elements is the specification of the characteristics of ordering for text data strings. The first normative requirement for comprehensive, fully predictable culturally valid requirement for ordering has been the Canadian Standard CSA Z243.4.1-1992, adopted as a preliminary standard (to be confirmed as national standard of Canada in the beginning of 1994) after 6 years of work with input from 7 different countries (Canada, France, USA, Belgium, The Netherlands, Germany, Switzerland) to fine tune the Canadian proposal issued out of a Quebec government proposal dated 1986. The Canadian Standard describes collating weights usable at once for dictionary ordering of English, French, German, Dutch, Portuguese and Italian without excluding other languages that can be handled with slight modifications. This technique assigns four levels of weights that can be used for fine-tuning the ordering function and provide absolute predictability of the results while being culturally acceptable to a majority of users of those languages. It is based mainly on ordering rules used in main dictionaries of the French, English and German languages, the primary rules learned at school by all young children, and unlike other more sophisticated classification techniques, understood by all street people, and not only by scholars. A.1.1 Example of cultural requirement. Those rules are essentially the following: 1. The 3 languages agree on a single alphabetic order, from A to Z, where no consideration is normally done for diacritical signs for the single purpose of ordering, unless there is a tie due to quasi-homography, i.e. words that look the same if diacritics are removed. 2. For ligatures, expansion is done as if they were written as 2 separate letters (ae, oe, ss [and ij in Dutch for that matter]); 3. English and German dictionaries state that unaccented words precede accented ones in case of quasi-homography; French dictionaries need more precise rules as it it quite frequent that lists of 3 or 4 quasi homographs are encountered for a series of identical letters, accented differently; in case of quasi-homography, the rule is in French that "the last difference in the word determines the order" (which means scanning the words to be compared from the end and back, until a difference is encountered in accentuation). Not to make French a special case, many sources recommended to use the French rule for solving ties due to that fact, as it is generally recognized that this does not bring extra overhead (a stack is as easy to use as a list) for other languages and that this is culturally equal for other languages. 4. In case of homography on alphabetic characters and diacritics, then case becomes significant to determine a difference. English and German dictionaries agree that small letters should precede capital letters. French dictionaries do not make a difference, as they generally use only capital letters for their general language words (including accented capital letters, contrary to a wide-spread belief that capital accented letters are not used in French!) French encyclopedias and proper name dictionaries tend to use capital letters first but as there are numerous exceptions and that dictionaries are mute on this subject it has been decided to use English and German rules in the Canadian Standard to harmonize rules without really making the rule culturally incorrect for French (to be noted that Danish dictionaries specifically state that capitals precede small letters). 5. Characters not part of the alphabet (spaces, hyphens, apostrophes, asterisks and so on, orthographic or not) are not significant for dictionary ordering. So that ordering can be predictable the Canadian standard specifies an order for these, but only in case all other 3 levels of significance of text data (alphabetic data, diacritics and case) are absolutely identical. So far it is the way the Canadian Standard specifies it, even if the normative benchmark contains a list of English words that could be ordered in a more refined way, but no English-speaking native objected so it is assumed to be culturally acceptable: "coop, co-op, COOP, CO-OP" constitutes a list sorted correctly according to the Canadian Standard; a refinement would have been possible to make "coop, COOP, co-op, CO-OP" the preferred order, but this seems at this point to be a matter of preference and of very fine tuning that could require extra overhead in some environments - 5 levels instead of 4 - although it would seem to be more consistent for very specific orthographic characters like hyphens, apostrophes and spaces, and this for all languages involved (these characters would have to be processed after the diacritics have been considered but before case, the other specials being processed after case). For example the following records are ordered correctly per the Canadian Standard specification. COTE / last difference / C[o>]te / on 2nd letter /last difference cot[e'] / on 4th letter / equal except/ c*o*t*[e'] --. /for specials / C[o>]t[e'] |__ last difference on 2nd character / last difference coter --. / on last characters Coter |__ equal except for case where [o>] represents the SMALL LATIN LETTER O WITH CIRCUMFLEX ACCENT and [e'] represents the SMALL LATIN LETTER E WITH ACUTE ACCENT. A.1.2 Example of a specification technique The Canadian Standard describes this behavior using words (in English and in French), diagrams, and tables. To simplify the understanding of such tables, let's consider here the following tables of relative numbers: ALPHA ACCENTS CASE SPECIALS 1st level 2nd level 3rd level 4th level INPT token token token token CHAR (serial) (stacked) (serial) (serial) c 6 3 1 N/A C 6 3 2 N/A e 7 3 1 N/A [e'] 7 4 1 N/A E 7 3 2 N/A o 8 3 1 N/A [o>] 8 5 1 N/A O 8 3 2 N/A r 9 3 1 N/A t 10 3 1 N/A T 10 3 2 N/A * N/A N/A N/A 1 The Canadian Standard then suggests a conformance algorithm that establishes a series of subkeys to be numerically composed as follows for our examples (numbers are only indicative here, and showing a relation only between the subset of characters chosen for the only purposes of the example). To avoid composing a fourth key with place holders for all non-special characters, comparison is done on the positions of the special characters, and in case of equality, on the weight assigned to the special character itself. For that algorithm to work if all 4 subkeys are concatenated for a multi-level one-pass sort, a special logical zero delimiter (which could be 1 for C if all other relative numbers are offset by 1!) is coded between the 3rd and 4th subkey. If certain conditions are not met in the careful choice of relative numbers, such a logical zero delimiter would be advisable between each of the subkeys. Original string Subkey 1 Subkey 2 Subkey 3 Logical Subkey 4 delim. COTE 6,8,10,7 3,3,3,3 2,2,2,2 0 C[o>]te 6,8,10,7 3,3,5,3 2,1,1,1 0 cot[e'] 6,8,10,7 4,3,3,3 1,1,1,1 0 c*o*t*[e'] 6,8,10,7 4,3,3,3 1,1,1,1 0 2,1,4,1,6,1 C[o>]t[e'] 6,8,10,7 4,3,5,3 2,1,1,1 0 coter 6,8,10,7,9 3,3,3,3,3 1,1,1,1,1 0 Coter 6,8,10,7,9 3,3,3,3,3 2,1,1,1,1 0 The Canadian Standard presents a reduction technique (non normative but originating from the Quebec government which implemented it [designer: Alain LaBonte']; it is now in the public domain) to reduce these subkeys without affecting the comparison process if keys are to be stored by an application for further comparison by a dumb process (such as a hardware device able to search on binary sequences or an old unmodifiable "indexed sequential" access method of any kind that orders keys numerically): the net effect is that for most French words and more than 99% of English words, no storage is required for the second key, as no accent is present, and in certain conditions storage is highly reduced for the third subkey. The same subkeys reduced and concatenated to give a one-pass directly comparable numerically would be: COTE 6,8,10,7, 2,2,2,2, 0 C[o>]te 6,8,10,7, 3,3,5, 2, 0 cot[e'] 6,8,10,7, 4, 0 c*o*t*[e'] 6,8,10,7, 4, 0, 2,1,4,1,6,1 C[o>]t[e'] 6,8,10,7, 4,3,5, 2, 0 coter 6,8,10,7,9, 0 Coter 6,8,10,7,9, 2, 0 One should not implement this reduction technique without carefully looking at the Canadian standard and its references for caveats in designing other tables. Only in certain conditions can such an optimizing technique be used. However even without reduction, the principle of forming a single key (out of a multilevel specification) to be passed to old applications that "know" how to sort numerical data strings is highly valid and economically very important to support past applications that can be "internationalized" without significant modifications if any. A.2 Other specification techniques After the Canadians released their specifications, POSIX defined a model that could handle it in a general way. This is but an abstract specification technique that can be used but it nevertheless does the job adequately. A good recommendation would be not to reinvent the wheel and use it as it is the only international specification technique that exists so far for describing collation tables. It does not use relative numbers but rather a clever sequential ordering system that allows the description of multilevel weights without specifying any numerical data. Order can be changed just by inserting lines for specific additional characters or swapping lines. Moreover it is, like the Canadian Standard, a codeset-independent specification technique which will not necessitate as many specifications as there are equivalent character sets. Hence it is a very flexible technique that allows the handling of a general specification. It might be that refinements be made in the future but it represents, like the Canadian Standard, the state of the art in this domain and it is expected that future work will build on this specification technique. To the knowledge of different experts, it can handle most of the languages and scripts of the world without major difficulties. Extensions are conceivable to handle combining sequences as present in ISO 10646 level 3 for those languages that absolutely require those combinations to be handled to give a culturally valid ordering. The way to specify the minimum table to describe the previous example according to the POSIX ordering specifications would be (simplified): ... collating-symbol collating-symbol Definitions of symbols that are not collating-symbol known as characters but which are needed collating-symbol to describe relative weights collating-symbol ... |This statement order_start forward;backward;forward;forward,position |describes the |scanning direction |for each level |(even allows | position tokens | if desired) |will result in SMALL=1 |will result in CAPITAL=2 |will result in NONE=3 |will result in ACUTE=4 |will result in CIRCUMFLEX=5 ;; |... c=6 and reused for itself ;; |... e=7 and reused for itself ;; |... o=8 and reused for itself ;; |... r=9 and reused for itself ;; |... t=10 and reused for itself ;; | from now on, ;; | no new value that needs to ;; | be resolved; numeric weights ;; | are all already known ;; > ;; <*> IGNORE;IGNORE;IGNORE;SMALL |SMALL=1, why not? |IGNORE means no value assigned |at each level that specifies it ... A.3 User group requirements and functionality Interestingly enough, SHARE Europe, having had a look on Canadian specifications, described a series of programming requirements expressed in a White Paper published in 1990 in Geneva and titled "National Language Architecture". Contrarily to the POSIX standard (ISO/IEC 9945-1 and ISO/IEC 9945-2) which, surprisingly, do not define the functions that could be associated with the specifications of ordering (the standards refer to the C standard which is explicit about that but other languages could implement the C language equivalents or not, in addition to defining new ones), SHARE Europe requires the support of a series of functions at the operating system level to exploit to its full potential the specification of ordering. It is to be noted that if an operating system does not provide those functions they could be implemented in a common set of library routines available to different programming environments, and that is exactly what the Quebec government has done for its data centres and is about to implement for small machines (PCs, Macintoshes, minis, and so on) without any modification to the compilers it uses. For economic reasons, as surprising as it may seems, COBOL has been used for that, in spite of the recommendation to use a more portable programming language. For other environments than the mainframes, other decisions have been taken (C language routines are being developed). Obviously if some of these functions would be implemented in programming language syntax, development of applications would be easier for programmers with possibly some gains in performance (so far performance has not be significantly affected, though: before taking the decision to do such a project it is reasonable to consider it is more productive to do things right, with potential productivity gains for the end-users, than producing lightning-fast garbage that results in numerous operational mistakes done by end-users who can't retrieve the information they are searching with "traditional" methods [it took centuries if not millenaries to develop universally accepted traditions in ordering for each given script, but a few years of technology usage to scramble them and create a so-called "collating tradition" that fools programmers themselves, even if they often don't want to admit it]). A.3.1 SHARE Europe Requirements of functions The following requirements have been addressed by SHARE Europe in the above-mentioned White Paper on National Language Architecture: A.3.1.1 Extended key generation Given a string and identification of its coding, a function should exist to return the 4 subkeys of the Canadian specification (note: this could be generalized to N levels instead of 4, with a possible information being returned on the number of levels and a table of dimension N for the N subkeys). A.3.1.2 Original key regeneration Given the N keys generated by the previous function, regenerate the original in the coding specified (coding which could be different from the original but the original character string would be functionally equivalent to the original from the user's point of view). This supposes that the system of tables used to generate extended subkeys is known to the underlying process, of course. Since all the information is contained in the extended subkey (principle of absolute predictability, this has been shown to be possible and implemented later on by the Quebec government for the only needs of the Canadian specification). A.3.1.3 Comparison operation Given 2 character strings on input and their coding, or the N subkeys, return the following information: Case 1: The 2 strings are absolutely equal (ex. "ABC"="ABC"); Case 2: The 2 strings are equivalent up to the level N of comparison; Case 2.1 Canadian spec (ex. "COTE"=="Cot[e'] at level 1 only); Case 2.2 Canadian spec (ex. "cote"=="Cote" up to level 2); Case 2.3 Canadian spec (ex. "c*o*t*e"=="cote" up to level 3) Case 3: String 1 comes before string 2 in order (ex. "Cote"<"cot[e']") Case 4: String 1 comes after string 2 in order (ex. "cot[e']>"COTE") Case 5: Fuzzy match: "Phydeault" ~ "FIDO" for French (snobbish dogs obviously write their names using the first spelling!) Notes: Case 2 is a rephrasing of the SHARE Europe requirement: the original requirement specifies on input what kind of equivalence is accepted, the answer indicating equality only for this case if absolute equality is not returned, i.e. equivalence required if only different because of specials, or because of case, or because of diacritics. This is better generalized to N levels with this respecification. Case 5 requires algorithmic fuzzy pattern matching functions that go beyond economical development in most environments because they require expert system technology which, for example, "knows" the exact phonetic environment in which it is applied: for example phonetic equivalent in cockney English for certain populations of London, which are not valid elsewhere, or foreign accent biases on the language, and so on: to buy that function for the Quebec government, function which by the way was commercially available (for pedagogic applications teaching French to young children of different origins) at time of development for the Quebecer accent applied on French and the different foreign accents commonly encountered in Montreal applicable to French [including what Quebecers call the "French accent"], the cost of implementing it would have more than tripled the cost of the basic functions mentioned above. In this particular case it has been decided not to implement this last requirement, useful for a police department, but generally not much for most commercial applications. A.3.1.4 Coding conversion Given a string and its coding, and a resulting coding identification, return an equivalent string in its new coded equivalent. A.3.1.5 Sort Given a list of strings, perform an internal sort using the comparison operation described above to obtain consistent results. For an external sort, the same function should be used to obtain the same consistency of operation. A.3.1.6 Merge Given two lists assumed to be sorted according to the previous function, merge the two lists in one using the same comparison operation described above. A.3.1.7 Substring Search Given two strings, search for the occurrence of the second one in the first one, with parameters indicating what kind of equivalence level is acceptable; return the offset of the retrieved string and its length (which can be different from the one searched if equivalences are encountered, as for example if ligatures are equivalent to separate letters in a given specification). A.3.1.8 Conversion to upper case unaccented data Given a rich text data including accented/unaccented lower/upper-case data, return the "traditional" unaccented upper-case equivalent (see below under section A.4 on how the at-first-glance-unrealistic reciprocal function has been implemented, even if it is not a requirement so far in the international community). A.4 Complementary functions out of the scope of programming languages For the information of the readers, the implementation of these functions from scratch will be very useful in all new end-user environments (including American sites, some of which are said to have also implemented the Canadian specifications as it was considered a requirement to solve problems of character data processing in an unilingual English environment). However it may be interesting to know that old data bases in Quebec, as in Europe in general, have long used unaccented-capital-letters-only data to avoid many of the problems solved by these specifications (not all, though, as the presence of special characters, even if less visible, is an existing problem, with various degrees of seriousness). To implement these new functions, mixing upper-lower-accented-unaccented data is necessary for talking with external sites. The requirements of SHARE Europe allow such mixing. Furthermore the Quebec government also implemented automatic functions to add accents and lower-case to existing person data and geographica name data that was unaccented before (with a 99,7% accuracy, the remaining cases necessitating human intervention because unsolvable by automatic means: homographs like "Masse" and "Mass[e'], 2 existing family names), to avoid having to retype that information in huge data bases. These are problems for which no requirement is likely to be addressed to programming language standards designers but which are nevertheless real and solvable in existing environments, and that shows that it is also possible to deal with the past without the necessity to start from scratch, in which case no action would be possible forever. A.5 Consequences of imbedding these functions in languages The basic functions that were previously mentioned are implementable using present tools or with extensions of languages. The latter would be highly desirable, the consequence being that resulting programs can be designed to be portable in different cultures, the behavior of the functions being parametrically provided outside of the language, but the functionality being fully provided by the language to directly interface those external specifications, while optimizing performance goals. A.6 Specific language implementation examples A.6.1 Fortran Culturally Sensitive String Comparison The following example indicates how a Fortran module might be used to implement culturally sensitive string comparison, using the approach outlined above. This module assumes the existence of the two additional intrinsic functions suggested in section 6.5. MODULE Cultural_Strings IMPLICIT NONE PRIVATE ! This module provides the necessary services to support the character ! handling requirements in a particular model of internationalization ! and localization. ! As written here it operates automatically in the cultural environment ! which is current when the program begins execution. It could be extended ! to allow for a user-specified cultural environment. ! Establish current cultural environment and character kind INTEGER, PARAMETER :: environment=CULTURAL_ENVIRONMENT(), & ch_kind=REPERTOIRE_KIND() ! Specify overloaded comparison operators INTERFACE OPERATOR ( < ) LOGICAL FUNCTION cultural_lt (s1,s2) CHARACTER(KIND=ch_kind,LEN=(*)), INTENT(IN) :: s1,s2 END FUNCTION cultural_lt END INTERFACE INTERFACE OPERATOR ( <= ) LOGICAL FUNCTION cultural_le (s1,s2) CHARACTER(KIND=ch_kind,LEN=(*)), INTENT(IN) :: s1,s2 END FUNCTION cultural_le END INTERFACE . . ! Specify those entities to be exported from the module PUBLIC environment,ch_kind,OPERATOR(<),OPERATOR(<=), ... CONTAINS LOGICAL FUNCTION cultural_lt (s1,s2) CHARACTER(KIND=ch_kind,LEN=(*)), INTENT(IN) :: s1,s2 INTEGER, DIMENSION(4) :: generate_keys,key1,key2 LOGICAL, DIMENSION(5) :: compare_strings,comp ! The system function generate_keys takes a character string and its ! kind, and returns the four subkeys of the Canadian specification ! as a rank one integer array of dimension four. key1 = generate_keys(s1) key2 = generate_keys(s2) ! The system function campare_strings takes two arrays of integer subkeys ! and returns a rank one logical array of dimension 5. Each element of ! the array value of the function specifies the truth or otherwise of the ! corresponding case in the specification of the comparison operation. comp = compare_strings(key1,key2) ! Return result of comparison as true if Case 3 is true (s1 before s2) ! and Cases 1 and 2 are false (s1 not equal and not equivalent to s2) cultural_lt = comp(3) .AND. .NOT.comp(1) .AND. .NOT.comp(2) END FUNCTION cultural_lt LOGICAL FUNCTION cultural_le (s1,s2) CHARACTER(KIND=ch_kind,LEN=(*)), INTENT(IN) :: s1,s2 INTEGER, DIMENSION(4) :: generate_keys,key1,key2 LOGICAL, DIMENSION(5) :: compare_strings,comp key1 = generate_keys(s1) key2 = generate_keys(s2) comp = compare_strings(key1,key2) ! Return result of comparison as true if any of Case 1 (s1 equals s2), ! Case 2 (s1 equivalent to s2) or Case3 is true (s1 before s2) cultural_lt = comp(1) .OR.comp(2) .OR.comp(3) END FUNCTION cultural_lt . . END MODULE Cultural_Strings A program which wished to use this module to provide culturally correct character handling could do so as follows: PROGRAM Culturally_correct IMPLICIT NONE ! Obtain access to all public elements in the module Cultural_Strings USE Cultural_Strings ! Declare two 50-character strings of the default type for the current ! environment CHARACTER(KIND=ch_kind,LEN=50) :: string_1,string_2 . . ! Read data into these strings READ *,string_1,string_2 ! Print the two strings in their correct order, using the overloaded <= ! operator to ensure culturally correct ordering, with the first input ! coming first if they are equal, or at least equivalent IF (string_1 <= string_2) THEN PRINT *,string_1,string_2 ELSE PRINT *,string_2,string_1 ENDIF . . END PROGRAM Culturally_correct END OF TEXT -----------------------------------------------------------