Title ISO/IEC WD4.3 14651 - International String Ordering -Method for comparing Character Strings and Description of a Default Tailorable Ordering, for Characters Strings Using the repertoire (or subrepertoires) of ISO/IEC 10646

[ISO/CEI DT4.3 14651 - Classement international de chaînes de caractères - Méthode de comparaison de chaînes de caractères et description d'un ordre implicite adaptable pour les chaînes de caractères utilisant le répertoire (ou des sous-répertoires) de l'ISO/CEI 10646]

Status: Working Draft 4.3 for comments by SC22/WG20 members before the April 1996 Kyoto meeting

Date: 1996-03-18

Project: 22.30.02.02

Editor: Alain LaBonté

Gouvernement du Québec

Secrétariat du Conseil du trésor

Service de la prospective et de la francisation

875, Grande-Allée Est, 4C

Québec, QC G1R 5R8

Canada

GUIDE SHARE Europe

SCHINDLER Information AG

CH-6030 Ebikon (Bern)

Switzerland

Email: alb@sct.gouv.qc.ca

FOREWORD

ISO (International Standards Organisation) and IEC (International Electrotechnical Commission) form the specialised bodies for world-wide standardisation. National bodies that are members of ISO or IEC participate in the development of International Standards through technical committees. These technical committees are established by the respective organisation to deal with particular fields of mutual interest. In liaison with ISO and IEC, other international organisations, governmental and non-governmental, also take part in the work.

In the field of information technology, ISO and IEC have established a joint technical committee known as ISO/IEC JTC1. Draft International Standards adopted by the joint technical committee are circulated to the national bodies for voting. Publication as an international standard requires approval by at least 75% of the national bodies that cast a vote.

The ISO/IEC 14651 International Standard has been prepared by the Joint Technical Committee ISO/IEC JTC1, Information Technology.

INTRODUCTION

A default international ordering mechanism does not provide a universal solution for all situations. The purpose of such a mechanism is to correct errors of the past regarding only collation on binary coded character values. Past approaches have never respected cultures. English is one exception, although a poor one, when only upper case alphabetic data was used instead of other characters including punctuation and spacing.

This is one of the major flaws that affect portability between countries and between applications. (Traditionally, different programs make different ordering corrections.) Therefore, it has been considered feasible to design a Default Tailorable Ordering Mechanism (a method and a unique table). This mechanism will constitute an acceptable tool that will make sense for most users of the different scripts. Also, most simple applications will be able to use the mechanism without modification. These applications use ordering dependencies that are not dependent on any context.

Naturally, a modification mechanism is embedded in the model. The mechanism will accommodate particular languages with a minimum of changes. Let us look at Latin Script as an example. The Spanish and Scandinavian languages will have the order of a few letters changed compared to the order acceptable in most other European languages that use the Latin script. Also, a whole script order change could be desired relative to another one -for example, Thai before Latin, and so on.

Furthermore, there might be specific linguistic requirements that cannot be fulfilled without knowing the context. For example, Japanese names expressed in Kanji cannot be deduced solely in phonetic ordering. Instead, Japanese names need hidden multiple fields. Generally, in Japanese databases, a given Kanji proper name is associated with a hidden phonetic representation in a different field. This association allows correct ordering, otherwise a replication of items might be necessary for human searching of Kanji proper names in a list in the absence of other fields.

More generally, specific requirements exist for complex telephone-book type classification or for phonetic classification. This is particularly true in multi-lingual countries or organisations. As an example, the item "4" could sometimes be phonetically classified (transformed) in such lists to accomplish ordering. This classification requires that the item be reproduced several times. Each replicated item is hence transformed for phonetic ordering (for example, as "QUATRE", "FOUR" and "VIER" in French, English, and German respectively). In this way, a user can immediately retrieve the item "4" in a list under "Q", "F" and "V" depending on the individual user requirements.

To achieve these requirements, the comparison and ordering mechanism on which focus is directed here is included in a more general model. The general model is also described in this international standard. The general model allows multiple-field ordering and prehandling and posthandling classification phases. The ordering mechanism assumes this higher-level scheme.

Specifically, the prehandling and posthandling phases could be null processes. Also in the simplest applications, only one field will be ordered typically. In such cases, a straightforward order could be achieved and would be reasonably valid for the majority of users who do not require further specialised classification. The typical lexical dictionary order in a given natural language is an example of this type. It is assumed that lexical order is the minimal culturally acceptable order for a list so that the general public, and even specialists, can use it without error.

To simplify matters, the Default Tailorable Ordering Mechanism will describe a method to order text data independently of context. The method will be culturally acceptable to a majority of world-wide environments (with provisions to accommodate more local contexts).

It is obvious that ordering is not limited to a sorting program. Ordering requires that string comparison be consistently redefined with a new comparison engine. This engine will be used by processes which compare, sort, search, mix, and merge graphic character data. This engine will be described in this international standard.

The design of this international standard keeps in mind that old systems could also integrate culturally valid ordering with minimal changes. Therefore, the basic engine will not work directly on a text string of graphic characters. Instead, the first phase of the process reduces the text string to a single bit string that is suitable for direct and mechanical numeric comparisons.

Numeric data has two general kinds of representation. One type of representation is external and uses human readable graphic characters. The other type of representation is internal and is directly suitable for high-speed processing. For this reason, programming languages define data types for suitable processing of numbers (in general more than one type). In this way, programmers do not need to parse graphic characters before performing numeric processing. This parsing would be very prone to errors, add to programming complexity, and would not achieve general consistency among different applications.

Character comparisons are of a more complex nature. Therefore, having the programmers involved in parsing is not more desirable. Nevertheless, this was the prevailing situation before the present international standard was designed.

The consistent text data comparison engine described in this international standard works on an internal structure that is the result of parsing an original string for comparison. Parsing is done according to a formal description of cultural ordering conventions. The definition of such an engine makes it highly desirable that future versions of programming language standards define new data types. In each language, it is desirable that at least one data type manage graphic character string comparisons that are not limited to absolute equality. The programming language can define these data types as formal containers. These containers represent strings of text that can be processed internally, in a way that is very straightforward and independent of coded graphic characters.

In this way, the programmer is freed from parsing processes. Also, the probability of achieving application portability between different countries using different cultures would be increased because applications can be designed in a generic way.

Furthermore, the pre-digested structure materialising such a data type can be stored and reused in a given cultural environment for increasing performance and allow preserving past applications with minimal changes. Reusing the structure would require no further parsing by external, even ancient, hard-wired engines that have the capability to do straightforward binary comparisons (such as a hardware disk search engine, or an access method designed decades ago that developers do not want to redesign because of its high efficiency).

This feature is a non-negligible economic by-product of this international standard: once a string has been parsed for an environment, its processing does not require re-parsing. In fact, as for numbers, the standard graphic character representation need not be used until data is presented again to the user. This calls for reversibility of the process. The present standard makes that reversibility a possibility, in addition to guaranteeing the full predictability of the comparison operation. If two equivalent strings are not absolutely equal, then the tie must be broken. Consequently, a sort program, the simplest application, can always sort data in the same way.

Tutorial on problems solved by this standard

Why aren't existing standard codes, character by character comparisons and commercial sort programs appropriate for sorting and what must be done to solve the problem? For clarity, this discussion will start with the Latin script.

i. Sorting, in any language using the Latin script, including English, using standard ISO 646 coding, does not follow traditional dictionary sequence, which is the minimum the average user needs.

Ex.: Sorting the list "august", "August", "container", "coop","coop", "Vicepresident", "Vice versa" gives the following order, if ISO 646 coding is used and a simple sort following binary order is done:

August

Vice versa

Vicepresident

august

coop

container

coop

which is obviously wrong.

ii. Translating lower case to upper case and removing special characters gives a sorted list acceptable to users, but also unpredictable results.

Ex.: Sorting the list "August", "august", "coop", "coop" gives the following order:

August

august

coop

coop

Sorting the same list with a different initial order, say, "august", "August", "coop", "coop" gives a different order with this method:

august

August

coop

coop

iii. If accented characters are introduced using for example ISO 88591 code, the problems encountered in steps i and ii above are amplified but they share the same causes.

iv. If tables are reorganized to make all related characters contiguous, one might think it would permit a simplified singlecharacter sort, but this does not work either. Take upper and lower case unaccented letters as an example. If code point 01 is assigned to «a», code point 02 assigned to «A», code point 03 to «b», code point 04 to «B» and so on, let's see what happens in a list sorted directly by these rearranged values:

Sorted Internal

List Values

aaaa 01010101

abbb 01030303

Aaaa 02010101

Abbb 02030303

This is predictable also, but obviously wrong in any country from a cultural point of view.

v. The only path of solution is to decompose the initial data in a way that will respect traditional lexical order, and at the same time ensure absolute predictability. For the Latin script, this necessitates at least four levels:

1. The first decomposition renders information to be sorted case insentitive and diacritical mark insensitive, and removes all special characters which have no preestablished order in any human culture:

An example using English:

"résumé" (an English word derived from French but with a very different meaning in French) becomes "resume", without any accent.

An example using French:

"Vicelégation" becomes "vicelegation", with no accent, no upper case and no dash.

An example using German:

"groß" becomes "gross", with the sharps being converted to doubles to render it case insensitive.

In Spanish or Scandinavian languages, some extra letters are added to the 26 fixed letters of the English, French and German alphabet, which are not ordered according to the expectations of this group of languages. This calls for adaptability.

2. The second decomposition breaks ties on quasihomographs, strings that differ only because they have different diacritical marks. In the English example above, "resumé" and "résumé" are quasihomographs. Traditional lexical order requires that "resume" always come before "résumé" (which sorting using only the first level would not guarantee). In this case, tradition does not say if "resumé" (another spelling) should come before "résumé", which would seem logical: English and German dictionaries only state that unaccented words precede the accented words.

Here another characteristic is introduced. In French, because of the large number of multiple quasihomograph groups formed of more than 2 instances, main dictionaries follow a rule that is the following: accents are generally not taken into account for sorting, but in case of homographic ties, the last difference in the word determines the correct order between two given words, a priority order being then assigned to each type of accent. For example, "coté" should be sorted after "côte" but before "côté". This is easy to implement: a number is assigned to each character of original data to be sorted, representing either an accent or no accent at all, but these numbers are stacked instead of being added to a linear list: in other words, the resulting string is made starting from the last character of the original data and backward.

Example: to obtain the following order respecting this rule: "cote, "côte", "coté", "côté",numbers could be assigned indicating respectively «****», «**c*», «a***», «a*c*», where "*" means no accent, "a" means acute accent, "c" circumflex accent. Here this scheme is sufficient to break the tie correctly at this second level.

3. The third decomposition breaks ties for quasihomographs different only because uppercase and lowercase characters are used. This time, the tradition is well established in English and German dictionaries, where lower case always precedes upper case in homographs, while the tradition is not well established in French dictionaries, which generally use only accented capital letters for common word entries. In known French dictionaries where upper and lower case letters are mixed, the capitals generally come first, but this is not an established and stated rule, because there are numerous exceptions. So for a default template it is advisable to use English and German traditions, if one wants to group the largest possible number of languages together. Let's note here by the way that in Denmark, upper case comes before lower case, a different but well established rule. This is a second fact calling for adaptability in the model used in this standard.

Example: to have the following order: "august", "August", numbers could be assigned indicating respectively «llllll», «ulllll», where "l" means lower case and "u" upper case.

4. The fourth decomposition breaks the final tie that does not correspond to any tradition, the tie due to quasihomographs that differ only because they contain special characters. Breaking this tie is essential to ensure the absolute predictibility of sorts and also to be able to sort strings composed only of special characters. Since the traces of special characters were removed from the original data to form the three first orders of decomposition, simply putting them in row in the fourth order of decomposition would mean that their position would be lost. These positions are quite important to solve remaining ties and in consequence we must retain here the original positions of these special characters: two quasihomographs could each contain a common special character in different positions and thus be strictly different (ex.:"ab*cd" is still different from "a*bcd" despite they share one and only one common special character).

Example: to have the following order: "coop", "coop", "coop", numbers could be assigned respectively according to the following pattern: «d», «d3» and «d5», where "d" is an alwayspresent delimiter that separates this decomposition from the first three in case all four decompositions are to be concatenated to form a single sorting key based on numeric values (see discussion in the next paragraph). "3" means a dash in position 3 of the original string. "5" means a dash in position 5, and so on.

These four decompositions can be structured using a four level key, concatenating the subkeys from the highest significance to the lowest. If coded assignment of numbers is done properly, instead of necessitating a cumbersome exception process for dealing with homographs, all decompositions may be made at once and resulting strings concatenated and passed through a standard sort program sorting in numeric order. To attain this result, it is sufficient that numbers chosen for the first decomposition code set be greater than numbers chosen for the second one, the second one's greater than the third one's, and that the delimiter chosen for the fourth decomposition be less than the lowest possible number coded elsewhere for the sort (delimiter called logical zero), in which case no restriction applies to the content of the fourth decomposition. An easier implementation might just choose to put the lowest value possible as a delimiter between each subkey, in which case no restriction ever applies.

This method has been fully described with tables for the first time in Règles du classement alphabétique en langue française et procédure informatisée pour le tri, Alain LaBonté, Ministère des Communications du Québec, 19 août 1988, ISBN 2550190467.

Reduction techniques have been designed to considerably shorten space requirements. As no implementation is required to use specific numbers for weights and does not require reduction nor compression, this issue is outside the scope of this standard but it is interesting to note that implementation can be optimized. This has been improved over time and is highly feasible.

A plublic-domain reduction technique is described in details (with ample examples) in Technique de réduction Tris informatiques à quatre clés, Alain LaBonté, Ministère des Communications du Québec, June 1989 (ISBN 2550199650).

vi. For a certain number of languages, the default presented in this standard will need to be adapted, both in the table values for the four orders of keys and in the potential context analysis processing necessary to achieve culturally correct results for users of these languages. To illustrate this, examples of dictionary sequences are given here for two languages which native order is not in the default table:

Traditional Spanish (note "ch" greater than "cu" and "ña" greater than "no"):

cuneo<cúneo<chapeo<nodo<ñaco

(Comparative French/English/German sort:

chapeo<cuneo<cúneo<ñaco<nodo)

Danish (note "a" less than "c", "cz" less than "cæ" and "cø", and "aa" equivalent to "å" greater than "z"):

Alzheimer<czar<cæsium<cølibat<Aalborg<Århus

(Comparative French/English/German sort:

Aalborg<Alzheimer<Århus<cæsium<cølibat<czar)

vii. It is important that in all coding environments, and in all programming environments, the order be consistent so that sort programs can give reliable results reuseable in programs; conversely, comparisons of two character strings where an order is expected should be be in line with results given by sort programs. Hence it is advisable that all processes which expect a given order all use the same comparison engine. This standard has built on this requirement that was not respected before.

Furthermore it should be possible to have access, externally, to the ultimate binary strings on which real comparison is made. This will allow old processes which can not be changed easily but which are able to sort raw binary data, to sort in a consistent way with new processes. This standard allows this.

Title ISO/IEC WD4.3 14651 - International String Ordering -Method for comparing Character Strings and Description of a Default Tailorable Ordering, for Characters Strings Using the repertoire (or subrepertoires) of ISO/IEC 10646

[ISO/CEI DT4.3 14651 - Classement international de chaînes de caractères - Méthode de comparaison de chaînes de caractères et description d'un ordre implicite adaptable pour les chaînes de caractères utilisant le répertoire (ou des sous-répertoires) de l'ISO/CEI 10646]

1 Scope

This international standard defines:

- a method for doing deterministic and internationalized character string comparisons. The method is applicable on strings that exploit the full repertoire of ISO/IEC 10646 (independently of coding) or subsets, so that these comparisons be applicable for subrepertoires such as those of ISO 8859 variants, and in a given set of languages for each script

- a specific default order description using the preceding specification for the ISO/IEC 10646 characters; this default is based, for each given script, on an order which is culturally acceptable to a maximum of users of that script.

It is to be considered normal practice that this default order be modified with a minimum of efforts to suit the needs of a local environment. The main benefit, worldwide, is that for other scripts, no modification will be required and that the order will remain consistent and predictable from an international point of view.

Note : the remaining of this clause will be removed: it was originally in the first working drafts: it is no longer the intention of SC22/WG20 to have data specification remain in the present standard; as soon as CD 14652 will be harmonized with the syntax used in this standard, this will go away and be replaced by a normative reference. So it is possible that some raw elements of data specifications be left, incomplete, at this stage, in the present working draft, which define:

- a data specification for describing ordering tables

- a tailoring specification to complete the data specification; This tailoring provision will allow modification of the default order data for a specific set of languages in each script in a reasonably compact way, without the burden of having to modify other scripts' definitions. In this way, the default order can be used as a template to define culture-specific orders that are similar to one another as much as possible.

2 Normative References

ISO/IEC WD 14652 LOCALE data specification (exact title to come later on: note: this specification is intended to be a normative, complete in itself, excerpt of current ISO/IEC 9945-2, with extensions to fit the present standard's requirements)

ISO/IEC 10646-1 Information technology -- Universal Multiple-Octe Coded Character Set (UCS) -- Part 1: Architecture and Basic Multilingual Plane

3 Definitions

For the purposes of this International Standard, the following definitions apply:

API Application Program Interface, a standard application process described alongside with its input and output

canonical form the coding of a UCS character in 4 octet binary form according to ISO/IEC 10646-1

character string A data type defined by the concatenation of a series of characters in logical sequence

classification An operation done prior to comparison or ordering, the outcome of which, for the purposes of this International Standard, could lead to producing many character strings or a modified character string, out of an original one to be sorted, indexed or compared; it is modifying the data in a kind of one or many explicit classes, new explicit formats that may be different from original character strings (ex.: given the number "5", the outcome of classification as defined here may result in triplicating the original into the string "cinq" in French, the string "fünf" in German and the string "five" in English, the three strings then being ready for ordering)

collation ordering of elements

collating symbol a symbol used to specify weights assigned to a character in a symbolic fashion rather than absolutely

concatenation logical operation which consist in adding an element at the end of a string to consider the result as a new string, longer than the firts one (it is like adding a new wagon to the tail of a train)

engine a set of APIs

field for the purposes of this International Standard, a single character string or any other data type which may be ordered alone or in conjunction with other fields of a record, each field of a record being compared to the same field of another record; in case of absolute equality of two equivaelnt fields, other fields of the records will have to be compared to eventualy solve a tie.

first order token an absolute number used as a comparison element, obtained out of tables for the first level that describes a character; note that some characters, such as ligatures, may lead to more than one token for a character at one given level

fourth order token level an absolute number used as a comparison element, obtained either out of tables for the fourth level that describes a special character, or precising the position of the special character represented in the original string; tokens of the fourth order level are always in pairs, the first token being a position, the second one being a weight for the character represented

graphic character a character, other than a control function, that has a visual representation normally handwritten, printed, or displayed

level degree of precision of a comparison; normally a weight is assigned to each character of a character string which must be compared to another one at a given level of precision; when comparison does not break ties at this level, then another weight is assigned to each character of the string at the next level of precision (see actual example in 5.2.1.1)

numeric relative value the relative value of a given weight, or token, compared to other ones, in its final numeric and processable form

ordering a process in which a set of fields composing a record are assigned a given order relative to any other set of fields composing other records of a file

ordering key a series of bits, the numerical value of which determines its order; to a character string may be allocated a series of tokens, which correspond, level by level, to weights assigned to characters

ordering subkey a sub-series of bits in an ordering key - an ordering subkey corresponds to the set of tokens corresponding to the weights of a character string assigned to a given level of precision

posthandling a process in which an ordering subkey is processed internally after the straightforward comparisons done according to the APIs defined in this standard

prehandling a process in which character strings are modified internally to lead to straightforward comparisons according to the APIs defined in this standard

record the exhaustive structured set of fields that form a monolithic block in a file, this set belonging together according to specific application-defined requirements

reference string in a comparison operation, the string which serves as a base reference for comparison; the string to which another string is compared

referenced string in a comparison operation, the string which is being compared to a reference string

second order token an absolute number used as a comparison element, obtained out of tables for the second level that describes a character; note that some characters, such as ligatures, may lead to more than one token for a character at one given level

string a series of individual elements which form a whole, when they are concatenated, i.e. linked together like wagons in a train; a character string is a series of characters, a bit string is a series of bits

symbolic relative value the relative value of a given weight, or token, compared to other ones, in its symbolic, human-readable text format

telephone-book-type classification a specific type of classification in which fields are rebuilt internally before a straightforward ordering can be done; this process may involve replication of fields in many different forms for the purposes of multiple indexing

text element a final form element of a script which is considered a whole entity for ordering purposes in a dictionary; hence in Indic script, up to five graphic characters can be ligatured, for example, to form a text element that has a precise dictionary entry; a complete combining sequence, conformant to ISO/IEC 10646, also lead to presentation of a text element to a human reader.

third order token token an absolute number used as a comparison element, obtained out of tables for the third level that describes a character; note that some characters, such as ligatures, may lead to more than one token for a character at one given level

4 Symbols and abbreviations

Identification of characters of the ISO/IEC 10646 repertoire will be by means of symbols of the form <U[xxxx]xxxx>. The occurrences of xxxx which follow the letter "U" represent the hexadecimal value of a coded character as defined in ISO/IEC 10646. This is a means to be code-independent (the same value being possibly used even if the coded character set in use in a given implementation is not ISO/IEC 10646). At the same time, this is a means to keep a straightforward link with the Universal Multiple-Octet Coded Character Set, which is assumed to contain all the coded graphic characters ever defined by ISO/IEC.

Whenever possible, in the ordering table, glyphs will be used in comments alongside with character ordering definitions. This will give a more accurate understanding of characters in question.

The letter U stands for UCS, which itself stands for Universal multiple-octet Coded Character set.

The collating-symbol statements will include declarations of symbols used as intermediary values for:

- possible text elements that are composed of sequences of graphic characters. An example is tailoring the default to Danish. The digraph "aa" is composed of a sequence of two graphic characters which, in Danish, are considered as a single letter of the alphabet and require a single ordering definition.

- possible text elements that require an intermediate definition for other reasons

For easy cross-referencing the various weights, numeric relative values (informative) will be shown in the table as comments. A system of short mnemonics intended to replace glyphs when it is not possible to transmit them will also be used in tables alongside with glyphs representing characters, whenever possible.

5 Requirements

5.1 Prehandling phase (external to the comparison operation engine)

5.1.1 Prehandling of the symbolic table data

The symbolic tables, as provided in annex 1 or as modified in a tailoring process, shall be presented in a numeric form to the comparison operation engine described hereafter. The table normally handled by the comparison operation engine logically consists of a matrix of n lines by m columns, n being the number of characters in the UCS and m being the number of levels provided in the symbolic data, each element of the matrix being a numerical weight. The usage optimization of such tables and the exact internal format of these tables is beyond the scope of this standard and is application-dependent.

5.1.2 Prehandling of character strings provided to the comparison operation engine

It may be necessary to transform a field before the actual ordering process can begin. This process is called prehandling. The implementor is responsible for ensuring that prehandling has been done prior to the ordering process. For examples of how applications can take advantage of prehandling, see Annex B. This is a global operation that may involve exploding records before ordering them. Therefore, the prehandling phase, unlike its posthandling counterpart, shall be done on a whole set of input records before any comparison is made. Thus, prehandling is not part of the comparison operation engine. The comparison operation engine will not contain any default code related to prehandling. However prehandling functionality shall be provided to the user by the application developer for allowing the use of this international standard in higher layers of the application.

The prehandling phase shall, as a minimum, transform the actual coded characters used on input in their UCS canonical form for their processing by the comparison operation, and UCS combining characters shall not be used whenever this is possible (a fully composed character shall be prefered to a series of characters having the same presentation form). Any combining characters will be processed as special characters acording to the present standard.

5.2 Comparison operation engine

5.2.1 Multi-field key comparison

This operation shall be done through three API which are described in what follows.

Note: In the following descriptions, numbers (integers) are used. These numbers are not mandatory values to be used in actual implementations. The developer may choose the values that fit the application best.

The API names proposed in this standard shall be used for binding these APIs with actual implementations.

5.2.1.1 API 1 - Comparison done directly on character strings (COMPCAR)

This API shall operate as follows:

0. Name of this API for binding purposes: COMPCAR

1. Parameters

Set of parameters a (input parameters):

string 1: referenced coded character string

string 2: reference coded character string

Parameter b (input parameter):

level of precision:

a value between 0 and n; this value determines after which level the binary comparison should stop if equivalence (approximate equality) is going to be detected, in case the two parameters of set a are not absolutely equal. In all cases of inequality, the operation is performed on all available levels to determine if the referenced character string 1 comes before or after the reference character string 2.

If the specified level of precision is zero or greater than the last available level, each of the missing levels is implicitly considered to bear the smallest value possible for the latter levels for comparison purposes. The meaning of values less than zero for this parameter is reserved in this standard for future use.

As an example, consider the words "alpha" and "ALPHA". These words are equivalent at level 1 (alphabetic) and level 2 (diacritical marks). However, the words are different at level 3, where case is taken into account. If comparisons are requested up to level 2, then approximate equality will result. If level 3 or greater is required, then the two character strings will be considered different, and unequivalent.

Parameter c (output parameter):

A number is returned whose sign determines an order even in case of equality. It is to be noted that the values mentioned here are to be considered as logical values. Implementations are free to use other values. However functionality shall be the same as described here.

The sign has the following meaning:

negative sign: referenced character string comes before reference.

positive sign: referenced character string comes after reference.

Note: In the case of absolute equality, a negative sign is returned by convention. In the case of equivalence, both a negative or a positive sign are possible, because character strings are also unequal in addition to being equivalent.

The absolute value of the number determines the possible following cases:

case 1: absolute equality (even if case equivalence has been detected; this goes beyond equivalence)

case 2: equivalence (at precision level required by parameter c)

case 3: values compared significantly unequal or unequivalent

Set of parameters d (output parameters):

Two bit strings are returned. Each of these bit strings shall be coded in such a way that in the hardware and software environment where it is used, each of the bit strings can be compared in a straightforward binary fashion without further analysis and the order of the comparison be equivalent to the order obtained by the sign result of parameter c. The structure of the bits strings chosen by the implementation shall also allow an external process to delimit the different levels, and eventually to revert the operation, so that original character strings can be recomposed from the table. It is the responsibility of the implementer to choose the appropriate method to meet this requirement. It is to be noted that such a reversibilty process will require careful preservation of script properties as returned in parameter e.

The bit strings are:

bit string 1: processable binary string corresponding to the character string 1 of the set of parameters a.

bit string 2: processable binary string corresponding to the character string 2 of the set of parameters a.

Parameters e (output parameter):

Two vectors of script id tokens are returned: this vector indicates the numeric identifier of the script of each character in an input character string; each token is applicable in the forward direction of an input character string to each character of this character string. These vectors serve only in cases where a higher layer of application wants to reconstitute original character strings from processable bit strings. This requires reverse engineering of the bit strings in relation with the reverse engineering of the table. This process requires script properties attached to each script identifier for each character.

The two vectors are:

vector 1: vector of identifier of the script for each character of character string 1 of parameter a

vector 2: vector of identifier of the script for each character of character string 2 of parameter a

Process of API 1 (COMPCAR)

This API shall be processed to give results equivalent to the following:

1. Convert character string 1 and character string 2 through API 3 (CARABIN) into bit string 1 and bit string 2. Return the result of the conversion the set of parameters d. The API 3 (CARABIN) will also return for each chaacter string a vector of script identifiers for each character of the input character strings. These will re returned as vector 1 and vector 2 of the set of parameters e.

2. Operate API 2 (COMPBIN) to get comparison result in parameter c.

5.2.1.2 API 2 - Comparison done on predigested processable bit strings (COMPBIN)

This API shall operate as follows:

0. Name of this API for binding purposes: COMPBIN

1. Parameters

Set of parameters a (input parameters):

bit string 1: referenced predigested bit string

bit string 2: reference predigested bit string

Parameter b (input parameter):

level of precision:

a value between 1 and n; this value determines after which level the binary comparison should stop if equivalence (approximate equality) is going to be detected, in case the two parameters b are not absolutely equal. In all cases of inequality, the operation is performed on all available levels to determine if the referenced parameter b1 comes before or after the reference parameter b2. If the specified level of precision is greater than the last available level, each of the missing levels is implicitly considered to bear the smallest value possible for the latter levels for comparison purposes.

See discussion in description of parameter b of API 1.

Parameter c (output parameter):

A number is returned whose sign determines an order even in case of equality. It is to be noted that the values mentioned here are to be considered as logical values. Implementations are free to use other values, provided. However Functionality shall be the same as described here.

The sign has the following meaning:

negative sign: referenced string 1 comes before reference b2

positive sign: referenced string 1 comes after reference b2

Note: In the case of absolute equality, a negative sign is returned by convention. In the case of equivalence, both a negative or a positive sign are possible, because strings are also unequal in addition to being equivalent.

The absolute value of the number determines the possible following cases:

case 1: absolute equality (even if case equivalence has been detected; this goes beyond equivalence)

case 2: equivalence (at precision level required by parameter c)

case 3: values compared significantly unequal or unequivalent

Process of API 2 (COMPBIN)

This API shall be processed to give results equivalent to the following:

1. Compare bit string 1 to bit string 2 numerically.

2. In case of absolute equality, return case 1 with a negative value.

3. In case of inequality retain which string comes before the other one.

4. Remake comparisons in ignoring the levels that are not significant according to parameter b.

5. In case of equality, return case 2 after adjustment of the sign in function of the order retained in step 3. This indicates an equivalence acccording to programmed specifications.

6. In case of unequality, return case 3 after adjustment of the sign in function of the order retained in step 3.

Note: there are more efficient ways to accomplish what precedes. The steps indicated here are logical steps to help clarify functionality. No specific process is mandated by this standard. What shall be respected is that results shall be the same as what is indicated above.

Binding considerations

This API can be used to perform a function suited to the exact specifications of the C language functions strcmp() or strncmp() which are less general in scope. In the same way, it can be used directly or interfaced to perform similar functions for the specific requirements of other programming languages.

5.2.1.3 API 3 - Conversion of a character string to a comparable bit string (CARABIN)

This API shall operate as follows:

0. Name of this API for binding purposes: CARABIN

1. Parameters

Parameter a (input parameter):

a coded character string

Parameter b (output parameter):

a structured binary string

This bit string shall be coded in a way that is equivalent to formation of the multilevel binary keys described in 5.3. Such a bit string is the result of the digestion of the input character string through the transformation table described in normative annex 1, with the provision that this table can have been tailored according to 5.5.

The structure of the bit strings chosen by the implementation shall also allow an external process to delimit the different levels, and eventually to revert the operation, so that original strings can be recomposed from LOCALEs. It is the responsibility of the implementer to choose the appropriate method to meet this requirement. It is to be noted that such a reversibilty process will require careful preservation of script properties as returned in parameter c.

Parameter c (output parameter):

A vector of script id tokens are returned: this vector indicates the numeric identifier of the script of each character in the input string; each token is applicable in the forward direction of the input string to each character of this string. These vectors serve only in cases where a higher layer of application wants to reconstitute the original string from the processable bit string returned. This requires reverse engineering of the bit string in relationship with the reverse engineering of the ordering table. This process requires information about the script properties attached to each script identifier for each character. This vector allows to eventually be able to retrieve that information.

The vector returned can be described shortly as follows:

vector 1: vector of identifier of the script for each character of the string of parameter a

It is the responsibility of the implementer to code that vector in a way appropriate for the application.

Process of API 3

This API shall be processed to give results equivalent to the following:

1. Digest input character string into a binary character string respecting the requirements of clause 5.3

2. Return the binary string obtained as parameter b.

3. For each character of the input character string obtain the corresponding script identifier from the table, as described in clause 5.3 (in particular 5.3.2.1). Return a vector of identifiers, which are built always in the logical forward direction relative to the input character string.

Binding considerations

This API can be used to perform a function suited to the exact specs of the C language function strxfrm(). In the same way, it can be used directly or interfaced to perform the same function for the specific requirements of other programming languages.

5.3 Multilevel key building

5.3.1 Preliminary considerations

5.3.1.0 Assumptions

The user is responsible for tailoring the ordering table to the application's requirements. If there is no tailoring done, then the default table shall be used. The default table is acceptable for one or more natural languages of many of the writing systems dealt with by the ISO/IEC 10646 repertoire. Adaptations may be necessary for specific languages using one or many of these scripts.

See section 5.8 for the tailoring mechanism whose results are used by the comparison operation engine.

The character transformation table can be considered as a matrix of n lines. N is the number of characters in the repertoire. In each line 4 levels are described by default. This default can be extended in the tailoring phase by the end-user. Any conforming implementation shall have provisions for handling a depth level of at least 7 levels. The user shall take care that in case of tailoring, levels be adjusted so that script <SPECIAL>, whose ordering is done at the last level in the default, be normally processed separately. This will avoid collisions with eventual extra levels added by tailoring. It is highly recommended that only four levels be used in tailoring, the fourth one being the level reserved to special characters. This is the only way this standard can guarantee that nothing will be broken; otherwise thorough and skillfull thinking by the implementer will be required, the minimum being that special characters have to be processed at the last level.

5.3.1.1 Table sections and processing properties

The table is separated into sections, one section for each script. Each section is assigned a sequential number corresponding to its order of apparition. The header of each section is named for clarity. The header describes transformation properties for each level of the script. These properties are tailored for the peculiarities of the script relative to the ordering process.

One of the tailoring possibilities is to change the relative order of a whole script relative to other scripts. Separation of the table into named sections will simplify that requirement, as well as serving to describe script properties.

The scanning direction (forward or backward) used to process the string at each level is a property of each script. These properties can be changed according to the language. Clause 5.5 describes tailoring.

One of the properties is also the possibility to assign a comparison on the numerical value representing the position of each character of two strings, before comparing weights assigned to the characters.

Note: The scanning direction (forward or backward) is not normally related to the natural writing direction of a script. The scanning direction applies only to the order processing in relation with the logical sequence of the coded character string.

According to ISO/IEC 10646, for scripts written right to left, such as Arabic, the lowest positions in the logical sequence of characters correspond to the rightmost characters of a string (from the point of view of their natural sequence). Conversely, for the Latin script, written left to right, the lowest positions in the logical sequence of characters correspond to the leftmost characters of the string (from the point of view of their natural presentation sequence).

Therefore, scanning forward starts with the lowest positions in the logical sequence, while scanning backward starts from the highest positions.

Now, in order to precise what was just said, in ISO/IEC 10646, Arabic is artificially separated in two scripts: the logical, intrinsic Arabic, coded independently of shapes, and the presentation forms. Both allow to code Arabic completely, but intrinsic Arabic is normally prefered for better processing, while the second is prefered by some presentation-oriented applications.

Intrinsic Arabic is coded in the logical order, while presentation forms are coded in presentation order. The first of these two scripts is described in the default under the header <ARABINT>, standing for the normal coding, called intrinsic Arabic. The second one is described under the header <ARABFOR>, standing for Arabic forms. Scanning properties of these two artificial sections differ, the firts one being csanned forward, the second one being scanned backward, for the first three default levels.

5.3.2 Key composition

A series of m subkeys is formed out of a character string composing a comparison field ; m is the maximum number of levels described in either the default ordering table or the tailored ordering table. The following paragraphs describe these formations. In the default table, m is equal to 4.

5.3.2.1 Formation of properties vector

For each character string, a corresponding vector is built (another bit string) which is not used in the comparison process and which describes to which script each character of the input character string belongs. This data will be used subsequently to determine how each token of each subkey is formed.

During forward scanning of each character of the input character string, a token is concatenated to the script identifier vector, which is initially empty. The token corresponds to the value assigned to the script to which the character definition of the character in process belongs. The value of the script is the logical number assigned implicitly to the script name header of the table section in which is located the character definition. If, due to tailoring, the character definition is moved before or after another character definition, it becomes part of the script whose name header comes before the new character definition.

5.3.2.2 Formation of subkey level 1 through m minus 1 (level i; m=4 in the default)

For i varying from 1 to m minus 1 (from 1 to 3 if the default is used), form subkey level i in the following way:

During forward scanning of each character of the input character string, a token is obtained. The token corresponds to the transformation value of that character at level i.

Note: In the default definition, characters of script <SPECIAL> are ignored from level 1 through 3. The definition of these characters can be been tailored to make them any of these characters a part of another script. The script <SPECIAL> is the first script to be defined in the default table. It contains special characters that are not, stricto sensu, a specific part of any natural language script - for example, "dingbats" of ISO/IEC 10646, or punctuation for most scripts.

The scanning properties for the level i being processed requires to be carefully monitored. When there is a change in scanning direction at level i and the new direction is backward, stacking of the token will be done at the position where the change of direction has occurred. Therefore when such a condition occurs, the application shall retain the current position in the output subkey i as position p (push position).

According to scanning direction assigned to the level i of the script whose identification corresponds to the character being processed, the obtained token is either added (concatenated) at the end of subkey i (which behaves like a list), or pushed at position p of subkey i (which then behaves like a stack). Subkey i is initially empty.

This is the equivalent of backward or forward scanning of the input string for that level. This property of scanning direction is given for each level of each script and is a script property. Each script header gives, for each level, the scanning direction property of the script.

Normally, in alphabetic scripts (and in the default), levels represent the following decomposition for each character:

level 1: base level of each script. This level corresponds to the basic letters of the alphabet for that script, if the script is alphabetic, and to each character of the script if the script is ideographic or syllabic;

level 2: the level corresponding to diacritical marks affecting each basic character of the script. For some scripts, diacritics are always considered an integral part of the basic letters of the alphabet, and are not considered at this second level, but rather at the first. For example, N TILDE in Spanish is considered a basic letter of the Latin script. Therefore, tailoring for Spanish will change the definition of N TILDE from "the weight of an N in the first level and a tilde weight in the second level" to "the weight of an N TILDE (placed after N and before O) in the first level, and indication of the absence of extra diacritics in the second level"

level 3: the level corresponding to case or to variant character shape that affects each basic character of the script

5.3.2.3 Formation of subkey level m (m=4 in the default table)

During forward scanning of each character of the input character string, a pair of tokens is concatenated to subkey level m. The first token of the pair corresponds to the logical position in the original character string of the character being processsed. The second token in the pair corresponds to the value assigned that character at level m of the table. When the character is not assigned at level m in the table, it is ignored for the formation of subkey level m and no pair is concatenated. The pair of tokens is concatenated immediately after subkey level m. Subkey level m is initially empty.

This level represents the level common to all scripts. In this standard, this level is considered as the first script (under the header <SPECIAL>). The property of this level is positional in an absolute way. This means that the numerical value of the position in the original string has precedence over the weight assigned to the special character which occupies this position. This means that subkey level m is composed of a pair of values for each such character (the character string being always scanned forward in the logical string sequence). The first value of the pair corresponds to the sequential position of the character in the input string. The second value of the pair corresponds to the weight assigned to the character according to level m in script <SPECIAL>.

In the table, this behaviour is described using the parameter couple "forward, position". To be conformant to this international standard, the parameter couple "backward, position" shall never be specified for level m. These two parameters shall be considered mutually exclusive.

In the default table, the first script (whose header is named <SPECIAL>) exclusively includes characters that are not considered part of the set of basic characters of any script - for example, special characters such as SPACE, HYPHEN, and "dingbats" of ISO/IEC 10646.

In the default table, definitions of these characters for levels 1 to 3 are such that they are ignored at these levels and values are exclusively assigned to level m (m being equal to 4 in the default).

5.3.2.4 Formation of subkey level 5

This extra clause has been removed from the previous draft. It was intended for processing combining characters dynamically. There are more static solutions possible which will require tailoring if ever SC22 wants to go beyond level 1 conformance of ISO/IEC 10646.

5.3.2.5 Posthandling

The posthanding phase is part of the formation of a binary comparison key. Once the binary key has been formed out of the data specified in the table, the posthandling phase shall be invoked (see discussion about the potential purposes of such a phase in annex B). The result of the posthandling phase shall be returned as subkey level m-1.

5.4 Table formation

Table 1 through 4 are formed out of the LC_COLLATE specification data described in the following paragraphs. Each of the text element definition of the default contains 4 explicit values. Each value corresponds to an internally-used token.

5.5 Default table

Normative Annex 1 gives the international default ordering table used as a template for tailoring localized applications working on the full repertoire of ISO/IEC 10646 (the Universal multi-octed coded character set).

6. Conformance

An application conforming to this international standard shall respect all the requirements of clause 5 of this document.

7. Provisional text on data specification (incomplete to be removed)

Reminder: Excerpt from the scope clause (to explain that 7.1, 7.2 and Annex H will be removed)

Note : [7.1, 7.2 and Annex H] will be removed: [...] it is no longer the intention of SC22/WG20 to have data specification remain in the present standard; as soon as CD 14652 will be harmonized with the syntax used in this standard, this will go away and be replaced by a normative reference. So it is possible that some raw elements of data specifications be left, incomplete, at this stage, in the present working draft, which define:

- a data specification for describing ordering tables

- a tailoring specification to complete the data specification; This tailoring provision will allow modification of the default order data for a specific set of languages in each script in a reasonably compact way, without the burden of having to modify other scripts' definitions. In this way, the default order can be used as a template to define culture-specific orders that are similar to one another as much as possible.

7.1 Data specification

The following is a recapitulation of the POSIX syntax and its enhancements. Lines preceded by * correspond exactly to present POSIX syntax; others are enhancements necessary for a more flexible, complete and tailorable default.

* LC_COLLATE

* collating-symbol <arbitrary> [from <character sequence>]

* order_start [forward|backward|position][;[...]]

to be replaced at the beginning of each script by:

<ScriptName> order [forward|backward|forward,position][;[...]]

* <U[xxxx]xxxx> [[collating-symbol | <U[xxxx]xxxx> | IGNORE] [;[...]]

This statement is a character definition

redefine <U[xxxx]xxxx> [[before|after] <U[xxxx]xxxx>]

This latter statement shall precede a new character definition

* order_end

* END LC_COLLATE

To specify a range of characters in UCS order, everywhere where one symbol representing a character can be used in an expression, the following ellipsis form is used:

<U[xxxx]xxxx>......<U[xxxx]xxxx>

This statement defines a whole range of characters

move <script-name> [before|after] <script-name>|<U[xxxx]xxxx>

create <script-name> order [forward|backward|position][;[...]]

[after|before] [<script-name>|<U[xxxx]xxxx>

7.2 Tailoring Mechanism

Essentially, this section describes how the previous new statements are handled to form an updated table. The tailoring described in this standard consists only of a table updating mechanism (which results in a new table replacing the default).

Note and questions from editor: input is expected here from Keld specification standard (see also current annex H which describes standard syntax used in 5.4 below. Should we separate: 5.4 which is the vanilla flavour of POSIX plus the script header addition and 5.5 specialized in tailoring aspects (while pointing at informative annexes [like current annex H] for explanations included in other standards)?

Normative annexes

Note: In this draft, annexes identified with a digit are intended to be normative. Annexes identified with a letter are intended to be informative.

Annex 1 (normative) International Default Table

LC_COLLATE

# Déclaration des symboles internes / Declaration of internal symbols

#

# SYMB N° Expl.

#

collatingsymbol <RES1>

#

# <ARABINT>/<ARABFOR>

#

# collating-symbol <ANO> # 2 normal --> voir/see <MIN>

collating-symbol <AIS> # 3 isol.

collating-symbol <AFI> # 4 final

collating-symbol <AII> # 5 initial

collating-symbol <AME> # 6 medial/m<e'>dian

#

collatingsymbol <MIN> # 7 minuscule/minuscule (bas de casse/lower case)

collatingsymbol <IMI> # 8 inférieur min./subscript min. (indice/index)

collatingsymbol <EMI> # 9 supér. min./superscript min. (exposant/exponent)

collatingsymbol <CAP> # 10 capitale/capital (haut de casse/upper case)

collatingsymbol <AMI> # 8 minuscule grecque/Greek lower case

collatingsymbol <ICA> # 11 inférieur en capitale/subscript capital

collatingsymbol <ECA> # 12 supérieur en capitale/superscript capital

#

# <ARABINT>/<ARABFOR>

#

collating-symbol <AMA> # 13 accent madda

collating-symbol <AHA> # 14 accent hamza

collating-symbol <AHW> # 14-1 accent hamza/waw

collating-symbol <AHS> # 14-2 accent hamza under / hamza souscrit

collating-symbol <AYE> # 14-3 accent under yeh / accent souscrit du ya'

collating-symbol <YBA> # 14-4 accent hamza/yeh barree

#

collatingsymbol <BAS> # 15 de base/basic (non accentué/nonaccented)

#

collatingsymbol <PCL> # 16 particulier/peculiar

collatingsymbol <LIG> # 17 ligature/ligature

collatingsymbol <ACA> # 18 accent aigu/acute accent

collatingsymbol <GRA> # 20 accent grave/grave accent

collatingsymbol <BRE> # 21 brève/breve

collatingsymbol <CIR> # 22 accent circonflexe/circumflex accent

collatingsymbol <CAR> # 23 caron/caron

collatingsymbol <RNE> # 24 rond supérieur/ring above

collatingsymbol <REU> # 25 tréma/diaeresis (ou/or umlaut)

collatingsymbol <DAC> # 26 double ac. aigu/double acute ac.

collatingsymbol <TIL> # 27 tilde/tilde

collatingsymbol <PCT> # 28 point/dot

collatingsymbol <OBL> # 29 barre oblique/oblique

collatingsymbol <CDI> # 30 cédille/cedilla

collatingsymbol <OGO> # 31 ogonek/ogonek

collatingsymbol <MAC> # 32 macron/macron

#

collatingsymbol <0>

collatingsymbol <1>

collatingsymbol <2>

collatingsymbol <3>

collatingsymbol <4>

collatingsymbol <5>

collatingsymbol <6>

collatingsymbol <7>

collatingsymbol <8>

collatingsymbol <9>

#

collatingsymbol <a>

collatingsymbol <b>

collatingsymbol <c>

collatingsymbol <d>

collatingsymbol <e>

collatingsymbol <f>

collatingsymbol <g>

collatingsymbol <h>

collatingsymbol <i>

collatingsymbol <j>

collatingsymbol <k>

collatingsymbol <l>

collatingsymbol <m>

collatingsymbol <n>

collatingsymbol <o>

collatingsymbol <p>

collatingsymbol <q>

collatingsymbol <r>

collatingsymbol <s>

collatingsymbol <t>

collatingsymbol <u>

collatingsymbol <v>

collatingsymbol <w>

collatingsymbol <x>

collatingsymbol <y>

collatingsymbol <z>

#

# <ARABINT>/<ARABFOR>

#

collating-symbol <hamza>

collating-symbol <alef>

collating-symbol <beh>

collating-symbol <peh>

collating-symbol <teh_marbuta>

collating-symbol <teh>

collating-symbol <tteh>

collating-symbol <theh>

collating-symbol <jeem>

collating-symbol <tcheh>

collating-symbol <hah>

collating-symbol <khah>

collating-symbol <dal>

collating-symbol <ddal>

collating-symbol <thal>

collating-symbol <reh>

collating-symbol <rreh>

collating-symbol <zain>

collating-symbol <jeh>

collating-symbol <seen>

collating-symbol <sheen>

collating-symbol <sad>

collating-symbol <dad>

collating-symbol <tah>

collating-symbol <zah>

collating-symbol <ain>

collating-symbol <ghain>

collating-symbol <feh>

collating-symbol <qaf>

collating-symbol <kaf>

collating-symbol <keheh>

collating-symbol <gaf>

collating-symbol <lam>

collating-symbol <meem>

collating-symbol <noon>

collating-symbol <noon_ghunna>

collating-symbol <heh>

collating-symbol <heh_yeh>

collating-symbol <waw>

collating-symbol <alef_maksura>

collating-symbol <yeh_barree>

#

# <HEBREU>

#

collating-symbol <alef>

collating-symbol <bet>

collating-symbol <gimel>

collating-symbol <dalet>

collating-symbol <he>

collating-symbol <vav>

collating-symbol <zayin>

collating-symbol <het>

collating-symbol <tet>

collating-symbol <yod>

collating-symbol <kaf_fin>

collating-symbol <kaf>

collating-symbol <lamed>

collating-symbol <mem_fin>

collating-symbol <mem>

collating-symbol <nun_fin>

collating-symbol <nun>

collating-symbol <samekh>

collating-symbol <ayin>

collating-symbol <pe_fin>

collating-symbol <pe>

collating-symbol <tsad_fin>

collating-symbol <tsadi>

collating-symbol <qof>

collating-symbol <resh>

collating-symbol <shin>

collating-symbol <tav>

# Ordre des symboles internes / Order of internal symbols

#

# SYMB. N°

#

<RES1>

<MIN> # forme de base (bas de casse, arabe intrinsèque,

# hébreu intrinsèque, etc.

# basic form (lower case, intrinsic Arabic

# intrinsic Hebrew and so on) # 7

#

# <ARABINT>/<ARABFOR>

#

#<ANO> # voir <MIN>

<AIS> # isol. # 3

<AFI> # final # 4

<AII> # initial # 5

<AME> # medial/m<e'>dian # 6

#

<IMI> # 7

<EMI> # 8

<CAP> # 9

<ICA> # 10

<ECA> # 11

<AMI> #alternate lower case/ # 12

# #minuscules spéciales après majuscules

# <ARABINT>/<ARABFOR>

#

<AMA> # accent madda #13

<AHA> # accent hamza #14

<AHW> # accent hamza/waw #14 1

<AHS> # accent hamza under / hamza souscrit #14 2

<AYE> # accent under yeh / accent souscrit du ya' #14 3

<YBA> # accent hamza/yeh barree #14 4

#

<BAS> # 15

#

<PCL> # 16

<LIG> # 17

<ACA> # 18

<GRA> # 19

<BRE> # 20

<CIR> # 21

<CAR> # 22

<RNE> # 23

<REU> # 24

<DAC> # 25

<TIL> # 26

<PCT> # 27

<OBL> # 28

<CDI> # 29

<OGO> # 30

<MAC> # 31

#

<0> # 48

<1> # 49

<2> # 50

<3> # 51

<4> # 52

<5> # 53

<6> # 54

<7> # 55

<8> # 56

<9> # 57

#

<a> # 97

<b> # 98

<c> # 99

<d> # 100

<e> # 101

<f> # 102

<g> # 103

<h> # 104

<i> # 105

<j> # 106

<k> # 107

<l> # 108

<m> # 109

<n> # 110

<o> # 111

<p> # 112

<q> # 113

<r> # 114

<s> # 115

<t> # 116

<u> # 117

<v> # 118

<w> # 119

<x> # 120

<y> # 121

<z> # 122

<th> # 122b

#

# <ARABINT>/<ARABFOR>

#

<hamza>

<alef>

<beh>

<peh>

<teh_marbuta>

<teh>

<tteh>

<theh>

<jeem>

<tcheh>

<hah>

<khah>

<dal>

<ddal>

<thal>

<reh>

<rreh>

<zain>

<jeh>

<seen>

<sheen>

<sad>

<dad>

<tah>

<zah>

<ain>

<ghain>

<feh>

<qaf>

<kaf>

<keheh>

<gaf>

<lam>

<meem>

<noon>

<noon_ghunna>

<heh>

<heh_yeh>

<waw>

<alef_maksura>

<yeh_barree>

#

# <HEBREU>

#

<alef>

<bet>

<gimel>

<dalet>

<he>

<vav>

<zayin>

<het>

<tet>

<yod>

<kaf_fin>

<kaf>

<lamed>

<mem_fin>

<mem>

<nun_fin>

<nun>

<samekh>

<ayin>

<pe_fin>

<pe>

<tsad_fin>

<tsadi>

<qof>

<resh>

<shin>

<tav>

<SPECIAL> order_start forward;backward;forward;forward,position

#

# Tout caractère non précisément défini sera considéré comme caractère spécial

# et considéré uniquement au dernier niveau.

#

# Any character not precisely specified will be considered as a special

# character and considered only at the last level.

#

<U0000>......<UFFFF> IGNORE;IGNORE;IGNORE;<U0000>......<UFFFF>

#

# SYMB. N° GLY

#

<U0020> IGNORE;IGNORE;IGNORE;<U0020> # 32 <SP>

<U005F> IGNORE;IGNORE;IGNORE;<U005F> # 33 _

<U0332> IGNORE;IGNORE;IGNORE;<U0332> # 34 <"_>

<U00AF> IGNORE;IGNORE;IGNORE;<U00AF> # 35 - (MACRON)

<U00AD> IGNORE;IGNORE;IGNORE;<U00AD> # 36 <SHY>

<U002D> IGNORE;IGNORE;IGNORE;<U002D> # 37

<U002C> IGNORE;IGNORE;IGNORE;<U002C> # 38 ,

<U003B> IGNORE;IGNORE;IGNORE;<U003B> # 39 ;

<U003A> IGNORE;IGNORE;IGNORE;<U003A> # 40 :

<U0021> IGNORE;IGNORE;IGNORE;<U0021> # 41 !

<U00A1> IGNORE;IGNORE;IGNORE;<U00A1> # 42 ¡

<U003F> IGNORE;IGNORE;IGNORE;<U003F> # 43 ?

<U00BF> IGNORE;IGNORE;IGNORE;<U00BF> # 44 ¿

<U002F> IGNORE;IGNORE;IGNORE;<U002F> # 45 /

<U0338> IGNORE;IGNORE;IGNORE;<U0338> # 46 <"/>

<U002E> IGNORE;IGNORE;IGNORE;<U002E> # 47 .

<U00B7> IGNORE;IGNORE;IGNORE;<U00B7> # 58 ×

<U00B8> IGNORE;IGNORE;IGNORE;<U00B8> # 59 ¸

<U0328> IGNORE;IGNORE;IGNORE;<U0328> # 60 <";>

<U0027> IGNORE;IGNORE;IGNORE;<U0027> # 61 '

<U2018> IGNORE;IGNORE;IGNORE;<U2018> # 62 <'6>

<U2019> IGNORE;IGNORE;IGNORE;<U2019> # 63 <'9>

<U0022> IGNORE;IGNORE;IGNORE;<U0022> # 64 "

<U201C> IGNORE;IGNORE;IGNORE;<U201C> # 65 <"6>

<U201D> IGNORE;IGNORE;IGNORE;<U201D> # 66 <"9>

<U00AB> IGNORE;IGNORE;IGNORE;<U00AB> # 67 «

<U00BB> IGNORE;IGNORE;IGNORE;<U00BB> # 68 »

<U0028> IGNORE;IGNORE;IGNORE;<U0028> # 69 (

<U207D> IGNORE;IGNORE;IGNORE;<U207d> # 70 <(S>

<U0029> IGNORE;IGNORE;IGNORE;<U0029> # 71 )

<U207E> IGNORE;IGNORE;IGNORE;<U207E> # 72 <)S>

<U005B> IGNORE;IGNORE;IGNORE;<U005B> # 73 [

<U005D> IGNORE;IGNORE;IGNORE;<U005D> # 74 ]

<U007B> IGNORE;IGNORE;IGNORE;<U007B> # 75 {

<U007D> IGNORE;IGNORE;IGNORE;<U007D> # 76 }

<U00A7> IGNORE;IGNORE;IGNORE;<U00A7> # 77 §

<U00B6> IGNORE;IGNORE;IGNORE;<U00B6> # 78 ¶

<U00A9> IGNORE;IGNORE;IGNORE;<U00A9> # 79 ©

<U00AE> IGNORE;IGNORE;IGNORE;<U00AE> # 80 ®

<U2122> IGNORE;IGNORE;IGNORE;<U2122> # 81 <TM>

<U0040> IGNORE;IGNORE;IGNORE;<U0040> # 82 @

<U00A4> IGNORE;IGNORE;IGNORE;<U00A4> # 83 ¤

<U00A2> IGNORE;IGNORE;IGNORE;<U00A2> # 84 ¢

<U0024> IGNORE;IGNORE;IGNORE;<U0024> # 85 $

<U00A3> IGNORE;IGNORE;IGNORE;<U00A3> # 86 £

<U00A5> IGNORE;IGNORE;IGNORE;<U00A5> # 87 ¥

<U002A> IGNORE;IGNORE;IGNORE;<U002A> # 88 *

<U005C> IGNORE;IGNORE;IGNORE;<U005C> # 89 \

<U0026> IGNORE;IGNORE;IGNORE;<U0026> # 90 &

<U0023> IGNORE;IGNORE;IGNORE;<U0023> # 91 #

<U0025> IGNORE;IGNORE;IGNORE;<U0025> # 92 %

<U207B> IGNORE;IGNORE;IGNORE;<U207D> # 93 <S>

<U002B> IGNORE;IGNORE;IGNORE;<U002B> # 94 +

<U207A> IGNORE;IGNORE;IGNORE;<U207E> # 95 <+S>

<U00B1> IGNORE;IGNORE;IGNORE;<U00B1> # 96 ±

<U00B4> IGNORE;IGNORE;IGNORE;<0> # 123 ´

<U0060> IGNORE;IGNORE;IGNORE;<1> # 124 `

<U0306> IGNORE;IGNORE;IGNORE;<2> # 125 <"(>

<U005E> IGNORE;IGNORE;IGNORE;<3> # 126 ^

<U030C> IGNORE;IGNORE;IGNORE;<4> # 127 <"<>

<U030A> IGNORE;IGNORE;IGNORE;<5> # 128 <"0>

<U00A8> IGNORE;IGNORE;IGNORE;<6> # 129 ¨

<U030B> IGNORE;IGNORE;IGNORE;<7> # 130 <"">

<U007E> IGNORE;IGNORE;IGNORE;<8> # 131 ~

<U0307> IGNORE;IGNORE;IGNORE;<9> # 132 <".>

<U00F7> IGNORE;IGNORE;IGNORE;<a> # 133 ¸

<U00D7> IGNORE;IGNORE;IGNORE;<b> # 134 ´

<U2260> IGNORE;IGNORE;IGNORE;<c> # 135 <!=>

<U003C> IGNORE;IGNORE;IGNORE;<d> # 136 <

<U2264> IGNORE;IGNORE;IGNORE;<e> # 137 <=<>

<U003D> IGNORE;IGNORE;IGNORE;<f> # 138 =

<U2265> IGNORE;IGNORE;IGNORE;<g> # 139 </>=>

<U003E> IGNORE;IGNORE;IGNORE;<h> # 140 >

<U00AC> IGNORE;IGNORE;IGNORE;<i> # 141 ¬

<U007C> IGNORE;IGNORE;IGNORE;<j> # 142 |

<U00A6> IGNORE;IGNORE;IGNORE;<k> # 143 |

<U00B0> IGNORE;IGNORE;IGNORE;<l> # 144 °

<U00B5> IGNORE;IGNORE;IGNORE;<m> # 145 m

<U2126> IGNORE;IGNORE;IGNORE;<n> # 146 <Om>

<U220E> IGNORE;IGNORE;IGNORE;<o> # 147 <FP>

<U250C> IGNORE;IGNORE;IGNORE;<p> # 148 <_V/>>

<U252C> IGNORE;IGNORE;IGNORE;<q> # 149 <_V>

<U2510> IGNORE;IGNORE;IGNORE;<r> # 150 <_V<w>

<U251C> IGNORE;IGNORE;IGNORE;<s> # 151 <_!/>>

<U253C> IGNORE;IGNORE;IGNORE;<t> # 152 <_!>

<U2524> IGNORE;IGNORE;IGNORE;<u> # 153 <_!<>

<U2514> IGNORE;IGNORE;IGNORE;<v> # 154 <_A/>>

<U2534> IGNORE;IGNORE;IGNORE;<w> # 155 <_A>

<U2518> IGNORE;IGNORE;IGNORE;<x> # 156 <_A<>

<U2502> IGNORE;IGNORE;IGNORE;<y> # 157 <_!>

<U2500> IGNORE;IGNORE;IGNORE;<z> # 158 <_>

#

<U2501> IGNORE;IGNORE;IGNORE;<U2501> # 159 <_=>

<U2190> IGNORE;IGNORE;IGNORE;<U2190> # 160 <<>

<U2192> IGNORE;IGNORE;IGNORE;<U2192> # 161 </>>

<U20D1> IGNORE;IGNORE;IGNORE;<U20D1> # 162 <"7>

<U2191> IGNORE;IGNORE;IGNORE;<U2191> # 163 <!>

<U2193> IGNORE;IGNORE;IGNORE;<U2193> # 164 <v>

<U266A> IGNORE;IGNORE;IGNORE;<U266A> # 165 <_d!>

<U2571> IGNORE;IGNORE;IGNORE;<U2571> # 166 <_/>//>

<U2572> IGNORE;IGNORE;IGNORE;<U2572> # 167 <_<\>

<U25E2> IGNORE;IGNORE;IGNORE;<U25E2> # 168 <_./>//>

<U25E3> IGNORE;IGNORE;IGNORE;<U25E3> # 169 <_.<\>

#

# <ARABINT>/<ARABFOR>

#

<U060C> IGNORE;IGNORE;IGNORE;<U060C>

<U061B> IGNORE;IGNORE;IGNORE;<U061B>

<U061F> IGNORE;IGNORE;IGNORE;<U061F>

<U0640> IGNORE;IGNORE;IGNORE;<U0640>

<U066A> IGNORE;IGNORE;IGNORE;<U066A>

<U066B> IGNORE;IGNORE;IGNORE;<U066B>

<U066C> IGNORE;IGNORE;IGNORE;<U066C>

<U066D> IGNORE;IGNORE;IGNORE;<U066D>

<U064B> IGNORE;IGNORE;IGNORE;<U064B> #<fathatan_no>

<UFE70> IGNORE;IGNORE;IGNORE;<UFE70> #<fathatan_is>

<UFE71> IGNORE;IGNORE;IGNORE;<UFE71> #<fathatan_me>

<U064C> IGNORE;IGNORE;IGNORE;<U064C> #<dammatan_no>

<UFE72> IGNORE;IGNORE;IGNORE;<UFE72> #<dammatan_is>

<U064D> IGNORE;IGNORE;IGNORE;<U064D> #<kasratan_no>

<UFE74> IGNORE;IGNORE;IGNORE;<UFE74> #<kasratan_is>

<U064E> IGNORE;IGNORE;IGNORE;<U064E> #<fatha_no>

<UFE76> IGNORE;IGNORE;IGNORE;<UFE76> #<fatha_is>

<UFE77> IGNORE;IGNORE;IGNORE;<UFE77> #<fatha_me>

<U064F> IGNORE;IGNORE;IGNORE;<U064F> #<damma_no>

<UFE78> IGNORE;IGNORE;IGNORE;<UFE78> #<damma_is>

<UFE79> IGNORE;IGNORE;IGNORE;<UFE79> #<damma_me>

<U0650> IGNORE;IGNORE;IGNORE;<U0650> #<kasra_no>

<UFE7A> IGNORE;IGNORE;IGNORE;<UFE7A> #<kasra_is>

<UFE7B> IGNORE;IGNORE;IGNORE;<UFE7B> #<kasra_me>

<U0651> IGNORE;IGNORE;IGNORE;<U0651> #<shadda_no>

<UFE7C> IGNORE;IGNORE;IGNORE;<UFE7C> #<shadda_is>

<UFE7D> IGNORE;IGNORE;IGNORE;<UFE7D> #<shadda_me>

<U0652> IGNORE;IGNORE;IGNORE;<U0652> #<sukun_no>

<UFE7E> IGNORE;IGNORE;IGNORE;<UFE7E> #<sukun_is>

<UFE7F> IGNORE;IGNORE;IGNORE;<UFE7F> #<sukun_me>

#

# <HEBREU>

#

<U05B0> IGNORE;IGNORE;IGNORE;<U05B0> #point_sheva

<U05B1> IGNORE;IGNORE;IGNORE;<U05B1> #point_hataf_segol

<U05B2> IGNORE;IGNORE;IGNORE;<U05B2> #point_hataf_patah

<U05B3> IGNORE;IGNORE;IGNORE;<U05B3> #point_hataf_qamats

<U05B4> IGNORE;IGNORE;IGNORE;<U05B4> #point_hiriq

<U05B5> IGNORE;IGNORE;IGNORE;<U05B5> #point_tsere

<U05B6> IGNORE;IGNORE;IGNORE;<U05B6> #point_segol

<U05B7> IGNORE;IGNORE;IGNORE;<U05B7> #point_patah

<U05B8> IGNORE;IGNORE;IGNORE;<U05B8> #point_qamats

<U05B9> IGNORE;IGNORE;IGNORE;<U05B9> #point_holam

<U05BB> IGNORE;IGNORE;IGNORE;<U05BB> #point_qubuts

<U05BC> IGNORE;IGNORE;IGNORE;<U05BC> #point_dagesh

<U05BD> IGNORE;IGNORE;IGNORE;<U05BD> #point_meteg

<U05BE> IGNORE;IGNORE;IGNORE;<U05BE> #maqaf

<U05BF> IGNORE;IGNORE;IGNORE;<U05BF> #point_rafe

<U05C0> IGNORE;IGNORE;IGNORE;<U05C0> #paseq

<U05C1> IGNORE;IGNORE;IGNORE;<U05C1> #point_shin_dot

<U05C2> IGNORE;IGNORE;IGNORE;<U05C2> #point_sin_dot

<U05C3> IGNORE;IGNORE;IGNORE;<U05C3> #sof pasuq

<LATIN> order_start forward;backward;forward;forward,position

#

<U00A0> U0020;<BAS>;<MIN>;IGNORE # 170 <NBSP>

#

<U0030> <0>;<BAS>;<MIN>;IGNORE # 171 0

<U0031> <1>;<BAS>;<MIN>;IGNORE # 172 1

<U0032> <2>;<BAS>;<MIN>;IGNORE # 173 2

<U0033> <3>;<BAS>;<MIN>;IGNORE # 174 3

<U0034> <4>;<BAS>;<MIN>;IGNORE # 175 4

<U0035> <5>;<BAS>;<MIN>;IGNORE # 176 5

<U0036> <6>;<BAS>;<MIN>;IGNORE # 177 6

<U0037> <7>;<BAS>;<MIN>;IGNORE # 178 7

<U0038> <8>;<BAS>;<MIN>;IGNORE # 179 8

<U0039> <9>;<BAS>;<MIN>;IGNORE # 180 9

#

<U215B> <0>;<GRA>;<MIN>;IGNORE # 181 <18>

<U00BC> <0>;<BRE>;<MIN>;IGNORE # 182 ¼

<U215C> <0>;<CIR>;<MIN>;IGNORE # 183 <38>

<U215D> <0>;<RNE>;<MIN>;IGNORE # 184 <58>

<U215E> <0>;<DAC>;<MIN>;IGNORE # 185 <78>

<U00BD> <0>;<CAR>;<MIN>;IGNORE # 186 ½

<U00BE> <0>;<REU>;<MIN>;IGNORE # 187 ¾

<U2070> <0>;<BAS>;<EMI>;IGNORE # 188 <0S>

<U00B9> <1>;<BAS>;<EMI>;IGNORE # 189 ¹

<U00B2> <2>;<BAS>;<EMI>;IGNORE # 190 ²

<U00B3> <3>;<BAS>;<EMI>;IGNORE # 191 ³

<U2074> <4>;<BAS>;<EMI>;IGNORE # 192 <4S>

<U2075> <5>;<BAS>;<EMI>;IGNORE # 193 <5S>

<U2076> <6>;<BAS>;<EMI>;IGNORE # 194 <6S>

<U2077> <7>;<BAS>;<EMI>;IGNORE # 195 <7S>

<U2078> <8>;<BAS>;<EMI>;IGNORE # 196 <8S>

<U2079> <9>;<BAS>;<EMI>;IGNORE # 197 <9S>

#

<U0061> <a>;<BAS>;<MIN>;IGNORE # 198 a

<U00AA> <a>;<PCL>;<EMI>;IGNORE # 199 ª

<U00E1> <a>;<ACA>;<MIN>;IGNORE # 200 á

<U00E0> <a>;<GRA>;<MIN>;IGNORE # 201 à

<U00E2> <a>;<CIR>;<MIN>;IGNORE # 202 â

<U00E3> <a>;<TIL>;<MIN>;IGNORE # 203 ã

<U00E4> <a>;<REU>;<MIN>;IGNORE # 204 ä

<U00E5> <a>;<RNE>;<MIN>;IGNORE # 205 å

<U0103> <a>;<BRE>;<MIN>;IGNORE # 206 <a(>

<U0105> <a>;<OGO>;<MIN>;IGNORE # 207 <a;>

<U0101> <a>;<MAC>;<MIN>;IGNORE # 208 <a>

<U00E6> <a><e>;<LIG><LIG>;<MIN><MIN>;IGNORE # 209 æ

<U0062> <b>;<BAS>;<MIN>;IGNORE # 210 b

<U0063> <c>;<BAS>;<MIN>;IGNORE # 211 c

<U00E7> <c>;<CDI>;<MIN>;IGNORE # 212 ç

<U0107> <c>;<ACA>;<MIN>;IGNORE # 213 <c'>

<U0109> <c>;<CIR>;<MIN>;IGNORE # 214 <c/>>

<U010D> <c>;<CAR>;<MIN>;IGNORE # 215 <c<>

<U010B> <c>;<PCT>;<MIN>;IGNORE # 216 <c.>

<U0064> <d>;<BAS>;<MIN>;IGNORE # 217 d

<U00F0> <d>;<PCL>;<MIN>;IGNORE # 218 ð

<U010F> <d>;<CAR>;<MIN>;IGNORE # 219 <d<>

<U0111> <d>;<OBL>;<MIN>;IGNORE # 220 <d//>

<U0065> <e>;<BAS>;<MIN>;IGNORE # 221 e

<U00E9> <e>;<ACA>;<MIN>;IGNORE # 222 é

<U00E8> <e>;<GRA>;<MIN>;IGNORE # 223 è

<U00EA> <e>;<CIR>;<MIN>;IGNORE # 224 ê

<U00EB> <e>;<REU>;<MIN>;IGNORE # 225 ë

<U011B> <e>;<CAR>;<MIN>;IGNORE # 226 <e<>

<U0117> <e>;<PCT>;<MIN>;IGNORE # 227 <e.>

<U0119> <e>;<OGO>;<MIN>;IGNORE # 228 <e;>

<U0113> <e>;<MAC>;<MIN>;IGNORE # 229 <e>

<U0066> <f>;<BAS>;<MIN>;IGNORE # 230 f

<U0067> <g>;<BAS>;<MIN>;IGNORE # 231 g

<U011F> <g>;<BRE>;<MIN>;IGNORE # 232 <g(>

<U011D> <g>;<CIR>;<MIN>;IGNORE # 233 <g/>>

<U0121> <g>;<PCT>;<MIN>;IGNORE # 234 <g.>

<U0123> <g>;<CDI>;<MIN>;IGNORE # 235 <g,>

<U0068> <h>;<BAS>;<MIN>;IGNORE # 236 h

<U0125> <h>;<CIR>;<MIN>;IGNORE # 237 <h/>>

<U0127> <h>;<OBL>;<MIN>;IGNORE # 238 <h//>

<U0069> <i>;<BAS>;<MIN>;IGNORE # 239 i

<U00ED> <i>;<ACA>;<MIN>;IGNORE # 240 í

<U00EC> <i>;<GRA>;<MIN>;IGNORE # 241 ì

<U00EE> <i>;<CIR>;<MIN>;IGNORE # 242 î

<U00EF> <i>;<REU>;<MIN>;IGNORE # 243 ï

<U0131> <i>;<PCL>;<MIN>;IGNORE # 244 <i.>

<U0129> <i>;<TIL>;<MIN>;IGNORE # 245 <i?>

<U012F> <i>;<OGO>;<MIN>;IGNORE # 246 <i;>

<U012B> <i>;<MAC>;<MIN>;IGNORE # 247 <i>

<U0133> <i><j>;<LIG><LIG>;<MIN><MIN>;IGNORE # 248 <ij>

<U006A> <j>;<BAS>;<MIN>;IGNORE # 249 j

<U0135> <j>;<CIR>;<MIN>;IGNORE # 250 <j/>>

<U006B> <k>;<BAS>;<MIN>;IGNORE # 251 k

<U0138> <k>;<PCL>;<MIN>;IGNORE # 252 <kk>

<U0137> <k>;<CDI>;<MIN>;IGNORE # 253 <k,>

<U006C> <l>;<BAS>;<MIN>;IGNORE # 254 l

<U013A> <l>;<ACA>;<MIN>;IGNORE # 255 <l'>

<U013E> <l>;<CAR>;<MIN>;IGNORE # 256 <l<>

<U0142> <l>;<OBL>;<MIN>;IGNORE # 257 <l//>

<U013C> <l>;<CDI>;<MIN>;IGNORE # 258 <l,>

<U0140> <l>;<PCT>;<MIN>;IGNORE # 259 <l.>

<U006D> <m>;<BAS>;<MIN>;IGNORE # 260 m

<U006E> <n>;<BAS>;<MIN>;IGNORE # 261 n

<U00F1> <n>;<TIL>;<MIN>;IGNORE # 262 ñ

<U0149> <n>;<PCL>;<MIN>;IGNORE # 263 <'n>

<U0144> <n>;<ACA>;<MIN>;IGNORE # 264 <n'>

<U0148> <n>;<CAR>;<MIN>;IGNORE # 265 <n<>

<U0146> <n>;<CDI>;<MIN>;IGNORE # 266 <n,>

<U014B> <n><g>;<LIG><LIG>;<MIN><MIN>;IGNORE # 267 <ng>

<U006F> <o>;<BAS>;<MIN>;IGNORE # 268 o

<U00BA> <o>;<PCL>;<EMI>;IGNORE # 269 º

<U00F3> <o>;<ACA>;<MIN>;IGNORE # 270 ó

<U00F2> <o>;<GRA>;<MIN>;IGNORE # 271 ò

<U00F4> <o>;<CIR>;<MIN>;IGNORE # 272 ô

<U00F5> <o>;<TIL>;<MIN>;IGNORE # 273 õ

<U00F6> <o>;<REU>;<MIN>;IGNORE # 274 ö

<U00F8> <o>;<OBL>;<MIN>;IGNORE # 275 ø

<U0151> <o>;<DAC>;<MIN>;IGNORE # 276 <o">

<U014D> <o>;<MAC>;<MIN>;IGNORE # 277 <o>

<U0153> <o><e>;<LIG><LIG>;<MIN><MIN>;IGNORE # 278 <oe>

<U0070> <p>;<BAS>;<MIN>;IGNORE # 279 p

<U0071> <q>;<BAS>;<MIN>;IGNORE # 280 q

<U0072> <r>;<BAS>;<MIN>;IGNORE # 281 r

<U0155> <r>;<ACA>;<MIN>;IGNORE # 282 <r'>

<U0159> <r>;<CAR>;<MIN>;IGNORE # 283 <r<>

<U0157> <r>;<CDI>;<MIN>;IGNORE # 284 <r,>

<U0073> <s>;<BAS>;<MIN>;IGNORE # 285 s

<U015B> <s>;<ACA>;<MIN>;IGNORE # 286 <s'>

<U015D> <s>;<CIR>;<MIN>;IGNORE # 287 <s/>>

<U0161> <s>;<CAR>;<MIN>;IGNORE # 288 <s<>

<U015F> <s>;<CDI>;<MIN>;IGNORE # 289 <s,>

<U00DF> <s><s>;<LIG><LIG>;<MIN><MIN>;IGNORE # 290 ß

<U0074> <t>;<BAS>;<MIN>;IGNORE # 291 t

<U0165> <t>;<CAR>;<MIN>;IGNORE # 292 <t<>

<U0167> <t>;<OBL>;<MIN>;IGNORE # 293 <t//>

<U0163> <t>;<CDI>;<MIN>;IGNORE # 294 <t,>

<U0075> <u>;<BAS>;<MIN>;IGNORE # 296 u

<U00FA> <u>;<ACA>;<MIN>;IGNORE # 297 ú

<U00F9> <u>;<GRA>;<MIN>;IGNORE # 298 ù

<U00FB> <u>;<CIR>;<MIN>;IGNORE # 299 û

<U00FC> <u>;<REU>;<MIN>;IGNORE # 300 ü

<U016D> <u>;<BRE>;<MIN>;IGNORE # 301 <u(>

<U016F> <u>;<RNE>;<MIN>;IGNORE # 302 <u0>

<U0171> <u>;<DAC>;<MIN>;IGNORE # 303 <u">

<U0169> <u>;<TIL>;<MIN>;IGNORE # 304 <u?>

<U0173> <u>;<OGO>;<MIN>;IGNORE # 305 <u;>

<U016B> <u>;<MAC>;<MIN>;IGNORE # 306 <u>

<U0076> <v>;<BAS>;<MIN>;IGNORE # 307 v

<U0077> <w>;<BAS>;<MIN>;IGNORE # 308 w

<U0175> <w>;<CIR>;<MIN>;IGNORE # 309 <w/>>

<U0078> <x>;<BAS>;<MIN>;IGNORE # 310 x

<U0079> <y>;<BAS>;<MIN>;IGNORE # 311 y

<U00FD> <y>;<ACA>;<MIN>;IGNORE # 312 ý

<U00FF> <y>;<REU>;<MIN>;IGNORE # 313 _

<U0177> <y>;<CIR>;<MIN>;IGNORE # 314 <y/>>

<U007A> <z>;<BAS>;<MIN>;IGNORE # 315 z

<U017A> <z>;<ACA>;<MIN>;IGNORE # 316 <z'>

<U017E> <z>;<CAR>;<MIN>;IGNORE # 317 <z<>

<U017C> <z>;<PCT>;<MIN>;IGNORE # 318 <z.>

<U00FE> <th>;<BAS>;<MIN>;IGNORE # 318b Þ

#

<U0041> <a>;<BAS>;<CAP>;IGNORE # 319 A

<U00C1> <a>;<ACA>;<CAP>;IGNORE # 320 Á

<U00C0> <a>;<GRA>;<CAP>;IGNORE # 321 À

<U00C2> <a>;<CIR>;<CAP>;IGNORE # 322 Â

<U00C3> <a>;<TIL>;<CAP>;IGNORE # 323 Ã

<U00C4> <a>;<REU>;<CAP>;IGNORE # 324 Ä

<U00C5> <a>;<RNE>;<CAP>;IGNORE # 325 Å

<U0102> <a>;<BRE>;<CAP>;IGNORE # 326 <A(>

<U0104> <a>;<OGO>;<CAP>;IGNORE # 327 <A;>

<U0100> <a>;<MAC>;<CAP>;IGNORE # 328 <A>

<U00C6> <a><e>;<LIG><LIG>;<CAP><CAP>;IGNORE # 329 Æ

<U0042> <b>;<BAS>;<CAP>;IGNORE # 330 B

<U0043> <c>;<BAS>;<CAP>;IGNORE # 331 C

<U00C7> <c>;<CDI>;<CAP>;IGNORE # 332 Ç

<U0106> <c>;<ACA>;<CAP>;IGNORE # 333 <C'>

<U0108> <c>;<CIR>;<CAP>;IGNORE # 334 <C/>>

<U010C> <c>;<CAR>;<CAP>;IGNORE # 335 <C>>

<U010A> <c>;<PCT>;<CAP>;IGNORE # 336 <C.>

<U0044> <d>;<BAS>;<CAP>;IGNORE # 337 D

<U00D0> <d>;<PCL>;<CAP>;IGNORE # 338 Ð

<U010E> <d>;<CAR>;<CAP>;IGNORE # 339 <D<>

<U0110> <d>;<OBL>;<CAP>;IGNORE # 340 <D//>

<U0045> <e>;<BAS>;<CAP>;IGNORE # 341 E

<U00C9> <e>;<ACA>;<CAP>;IGNORE # 342 É

<U00C8> <e>;<GRA>;<CAP>;IGNORE # 343 È

<U00CA> <e>;<CIR>;<CAP>;IGNORE # 344 Ê

<U00CB> <e>;<REU>;<CAP>;IGNORE # 345 Ë

<U011A> <e>;<CAR>;<CAP>;IGNORE # 346 <E<>

<U0116> <e>;<PCT>;<CAP>;IGNORE # 347 <E.>

<U0118> <e>;<OGO>;<CAP>;IGNORE # 348 <E;>

<U0112> <e>;<MAC>;<CAP>;IGNORE # 349 <E>

<U0046> <f>;<BAS>;<CAP>;IGNORE # 350 F

<U0047> <g>;<BAS>;<CAP>;IGNORE # 351 G

<U011E> <g>;<BRE>;<CAP>;IGNORE # 352 <G(>

<U011C> <g>;<CIR>;<CAP>;IGNORE # 353 <G/>>

<U0120> <g>;<PCT>;<CAP>;IGNORE # 354 <G.>

<U0122> <g>;<CDI>;<CAP>;IGNORE # 355 <G,>

<U0048> <h>;<BAS>;<CAP>;IGNORE # 356 H

<U0124> <h>;<CIR>;<CAP>;IGNORE # 357 <H/>>

<U0126> <h>;<OBL>;<CAP>;IGNORE # 358 <H//>

<U0049> <i>;<BAS>;<CAP>;IGNORE # 359 I

<U00CD> <i>;<ACA>;<CAP>;IGNORE # 360 Í

<U00CC> <i>;<GRA>;<CAP>;IGNORE # 361 Ì

<U00CE> <i>;<CIR>;<CAP>;IGNORE # 362 Î

<U00CF> <i>;<REU>;<CAP>;IGNORE # 363 Ï

<U0130> <i>;<PCL>;<CAP>;IGNORE # 364 <I.>

<U0128> <i>;<TIL>;<CAP>;IGNORE # 365 <I?>

<U012E> <i>;<OGO>;<CAP>;IGNORE # 366 <I;>

<U012A> <i>;<MAC>;<CAP>;IGNORE # 367 <I>

<U0132> <i><j>;<LIG><LIG>;<CAP><CAP>;IGNORE # 368 <IJ>

<U004A> <j>;<BAS>;<CAP>;IGNORE # 369 J

<U0134> <j>;<CIR>;<CAP>;IGNORE # 370 <J/>>

<U004B> <k>;<BAS>;<CAP>;IGNORE # 371 K

<U0136> <k>;<CDI>;<CAP>;IGNORE # 372 <K,>

<U004C> <l>;<BAS>;<CAP>;IGNORE # 373 L

<U0139> <l>;<ACA>;<CAP>;IGNORE # 374 <L'>

<U013D> <l>;<CAR>;<CAP>;IGNORE # 375 <L<>

<U0141> <l>;<OBL>;<CAP>;IGNORE # 376 <L//>

<U013B> <l>;<CDI>;<CAP>;IGNORE # 377 <L,>

<U013F> <l>;<PCT>;<CAP>;IGNORE # 378 <L.>

<U004D> <m>;<BAS>;<CAP>;IGNORE # 379 M

<U004E> <n>;<BAS>;<CAP>;IGNORE # 380 N

<U00D1> <n>;<TIL>;<CAP>;IGNORE # 381 Ñ

<U0143> <n>;<ACA>;<CAP>;IGNORE # 382 <N'>

<U0147> <n>;<CAR>;<CAP>;IGNORE # 383 <N<>

<U0145> <n>;<CDI>;<CAP>;IGNORE # 384 <N,>

<U014A> <n><g>;<LIG><LIG>;<CAP><CAP>;IGNORE # 385 <NG>

<U004F> <o>;<BAS>;<CAP>;IGNORE # 386 O

<U00D3> <o>;<ACA>;<CAP>;IGNORE # 387 Ó

<U00D2> <o>;<GRA>;<CAP>;IGNORE # 388 Ò

<U00D4> <o>;<CIR>;<CAP>;IGNORE # 389 Ô

<U00D5> <o>;<TIL>;<CAP>;IGNORE # 390 Õ

<U00D6> <o>;<REU>;<CAP>;IGNORE # 391 Ö

<U00D8> <o>;<OBL>;<CAP>;IGNORE # 392 Ø

<U0150> <o>;<DAC>;<CAP>;IGNORE # 393 <O">

<U014C> <o>;<MAC>;<CAP>;IGNORE # 394 <O>

<U0152> <o><e>;<LIG><LIG>;<CAP><CAP>;IGNORE # 395 <OE>

<U0050> <p>;<BAS>;<CAP>;IGNORE # 396 P

<U0051> <q>;<BAS>;<CAP>;IGNORE # 397 Q

<U0052> <r>;<BAS>;<CAP>;IGNORE # 398 R

<U0154> <r>;<ACA>;<CAP>;IGNORE # 399 <R'>

<U0158> <r>;<CAR>;<CAP>;IGNORE # 400 <R<>

<U0156> <r>;<CDI>;<CAP>;IGNORE # 401 <R,>

<U0053> <s>;<BAS>;<CAP>;IGNORE # 402 S

<U015A> <s>;<ACA>;<CAP>;IGNORE # 403 <S'>

<U015C> <s>;<CIR>;<CAP>;IGNORE # 404 <S/>>

<U0160> <s>;<CAR>;<CAP>;IGNORE # 405 <S<>

<U015E> <s>;<CDI>;<CAP>;IGNORE # 406 <S,>

<U0054> <t>;<BAS>;<CAP>;IGNORE # 407 T

<U0164> <t>;<CAR>;<CAP>;IGNORE # 408 <T<>

<U0166> <t>;<OBL>;<CAP>;IGNORE # 409 <T//>

<U0162> <t>;<CDI>;<CAP>;IGNORE # 410 <T,>

<U0055> <u>;<BAS>;<CAP>;IGNORE # 412 U

<U00DA> <u>;<ACA>;<CAP>;IGNORE # 413 Ú

<U00D9> <u>;<GRA>;<CAP>;IGNORE # 414 Ù

<U00DB> <u>;<CIR>;<CAP>;IGNORE # 415 Û

<U00DC> <u>;<REU>;<CAP>;IGNORE # 416 Ü

<U016C> <u>;<BRE>;<CAP>;IGNORE # 417 <U(>

<U016E> <u>;<RNE>;<CAP>;IGNORE # 418 <U0>

<U0170> <u>;<DAC>;<CAP>;IGNORE # 419 <U">

<U0168> <u>;<TIL>;<CAP>;IGNORE # 420 <U?>

<U0172> <u>;<OGO>;<CAP>;IGNORE # 421 <U;>

<U016A> <u>;<MAC>;<CAP>;IGNORE # 422 <U>

<U0056> <v>;<BAS>;<CAP>;IGNORE # 423 V

<U0057> <w>;<BAS>;<CAP>;IGNORE # 424 W

<U0174> <w>;<CIR>;<CAP>;IGNORE # 425 <W/>>

<U0058> <x>;<BAS>;<CAP>;IGNORE # 426 X

<U0059> <y>;<BAS>;<CAP>;IGNORE # 427 Y

<U00DD> <y>;<ACA>;<CAP>;IGNORE # 428 Ý

<U0176> <y>;<CIR>;<CAP>;IGNORE # 429 <Y/>>

<U0178> <y>;<REU>;<CAP>;IGNORE # 430 <Y:>

<U005A> <z>;<BAS>;<CAP>;IGNORE # 431 Z

<U0179> <z>;<ACA>;<CAP>;IGNORE # 432 <Z'>

<U017D> <z>;<CAR>;<CAP>;IGNORE # 433 <Z<>

<U017B> <z>;<PCT>;<CAP>;IGNORE # 434 <Z.>

<U00DE> <th>;<BAS>;<CAP>;IGNORE # 411 þ

<ARABINT> order_start forward;forward;forward;forward,position

<U0660> <0>;<BAS>;<MIN>;IGNORE

<U06F0> <0>;<PCL>;<MIN>;IGNORE

<U0661> <1>;<BAS>;<MIN>;IGNORE

<U06F1> <1>;<PCL>;<MIN>;IGNORE

<U0662> <2>;<BAS>;<MIN>;IGNORE

<U06F2> <2>;<PCL>;<MIN>;IGNORE

<U0663> <3>;<BAS>;<MIN>;IGNORE

<U06F3> <3>;<PCL>;<MIN>;IGNORE

<U0664> <4>;<BAS>;<MIN>;IGNORE

<U06F4> <4>;<PCL>;<MIN>;IGNORE

<U0665> <5>;<BAS>;<MIN>;IGNORE

<U06F5> <5>;<PCL>;<MIN>;IGNORE

<U0666> <6>;<BAS>;<MIN>;IGNORE

<U06F6> <6>;<PCL>;<MIN>;IGNORE

<U0667> <7>;<BAS>;<MIN>;IGNORE

<U06F7> <7>;<PCL>;<MIN>;IGNORE

<U0668> <8>;<BAS>;<MIN>;IGNORE

<U06F8> <8>;<PCL>;<MIN>;IGNORE

<U0669> <9>;<BAS>;<MIN>;IGNORE

<U06F9> <9>;<PCL>;<MIN>;IGNORE

<U0621> <hamza>;<BAS>;<MIN>;IGNORE

<U0622> <alef>;<AMA>;<MIN>;IGNORE

<U0623> <alef>;<AHA>;<MIN>;IGNORE

<U0625> <alef>;<AHS>;<MIN>;IGNORE

<U0627> <alef>;<BAS>;<MIN>;IGNORE

<U0628> <beh>;<BAS>;<MIN>;IGNORE

<U067E> <peh>;<BAS>;<MIN>;IGNORE

<U0629> <teh_marbuta>;<BAS>;<MIN>;IGNORE

<U062A> <teh>;<BAS>;<MIN>;IGNORE

<U0679> <tteh>;<BAS>;<MIN>;IGNORE

<U062B> <theh>;<BAS>;<MIN>;IGNORE

<U062C> <jeem>;<BAS>;<MIN>;IGNORE

<U0686> <tcheh>;<BAS>;<MIN>;IGNORE

<U062D> <hah>;<BAS>;<MIN>;IGNORE

<U062E> <khah>;<BAS>;<MIN>;IGNORE

<U062F> <dal>;<BAS>;<MIN>;IGNORE

<U0688> <ddal>;<BAS>;<MIN>;IGNORE

<U0630> <thal>;<BAS>;<MIN>;IGNORE

<U0631> <reh>;<BAS>;<MIN>;IGNORE

<U0691> <rreh>;<BAS>;<MIN>;IGNORE

<U0632> <zain>;<BAS>;<MIN>;IGNORE

<U0698> <jeh>;<BAS>;<MIN>;IGNORE

<U0633> <seen>;<BAS>;<MIN>;IGNORE

<U0634> <sheen>;<BAS>;<MIN>;IGNORE

<U0635> <sad>;<BAS>;<MIN>;IGNORE

<U0636> <dad>;<BAS>;<MIN>;IGNORE

<U0637> <tah>;<BAS>;<MIN>;IGNORE

<U0638> <zah>;<BAS>;<MIN>;IGNORE

<U0639> <ain>;<BAS>;<MIN>;IGNORE

<U063A> <ghain>;<BAS>;<MIN>;IGNORE

<U0641> <feh>;<BAS>;<MIN>;IGNORE

<U0642> <qaf>;<BAS>;<MIN>;IGNORE

<U0643> <kaf>;<BAS>;<MIN>;IGNORE

<U06A9> <keheh>;<BAS>;<MIN>;IGNORE

<U06AF> <gaf>;<BAS>;<MIN>;IGNORE

<U0644> <lam>;<BAS>;<MIN>;IGNORE

<U0645> <meem>;<BAS>;<MIN>;IGNORE

<U0646> <noon>>;<BAS>;<MIN>;IGNORE

<U06BA> <noon_ghunna>;<BAS>;<MIN>;IGNORE

<U0647> <heh>;<BAS>;<MIN>;IGNORE

<U06C0> <heh_yeh>;<BAS>;<MIN>;IGNORE

<U0624> <waw>;<AHW>;<MIN>;IGNORE

<U0648> <waw>;<BAS>;<MIN>;IGNORE

<U0649> <alef_maksura>;<BAS>;<MIN>;IGNORE

<U0626> <alef_maksura><hamza>;<BAS><BAS>;<MIN><MIN>;IGNORE

<U064A> <alef_maksura>;<AYE>;<MIN>;IGNORE

<U06D3> <yeh_barree>;<YBA>;<MIN>;IGNORE

<U06D2> <yeh_barree>;<BAS>;<MIN>;IGNORE

<ARABFOR> order_start backward;backward;backward;forward,position

<UFE80> <hamza>;<BAS>;<AIS>;IGNORE

<UFE81> <alef>;<AMA>;<AIS>;IGNORE

<UFE82> <alef>;<AMA>;<AFI>;IGNORE

<UFE83> <alef>;<AHA>;<AIS>;IGNORE

<UFE84> <alef>;<AHA>;<AFI>;IGNORE

<UFE87> <alef>;<AHS>;<AIS>;IGNORE

<UFE88> <alef>;<AHS>;<AFI>;IGNORE

<UFE8D> <alef>;<BAS>;<AIS>;IGNORE

<UFE8E> <alef>;<BAS>;<AFI>;IGNORE

<UFE8F> <beh>;<BAS>;<AIS>;IGNORE

<UFE90> <beh>;<BAS>;<AFI>;IGNORE

<UFE91> <beh>;<BAS>;<AII>;IGNORE

<UFE92> <beh>;<BAS>;<AME>;IGNORE

<UFB56> <peh>;<BAS>;<AIS>;IGNORE

<UFB57> <peh>;<BAS>;<AFI>;IGNORE

<UFB58> <peh>;<BAS>;<AII>;IGNORE

<UFB59> <peh>;<BAS>;<AME>;IGNORE

<UFE93> <teh_marbuta>;<BAS>;<AIS>;IGNORE

<UFE94> <teh_marbuta>;<BAS>;<AFI>;IGNORE

<UFE95> <teh>;<BAS>;<AIS>;IGNORE

<UFE96> <teh>;<BAS>;<AFI>;IGNORE

<UFE97> <teh>;<BAS>;<AII>;IGNORE

<UFE98> <teh>;<BAS>;<AME>;IGNORE

<UFB66> <tteh>;<BAS>;<AIS>;IGNORE

<UFB67> <tteh>;<BAS>;<AFI>;IGNORE

<UFB68> <tteh>;<BAS>;<AII>;IGNORE

<UFB69> <tteh>;<BAS>;<AME>;IGNORE

<UFE99> <theh>;<BAS>;<AIS>;IGNORE

<UFE9A> <theh>;<BAS>;<AFI>;IGNORE

<UFE9B> <theh>;<BAS>;<AII>;IGNORE

<UFE9C> <theh>;<BAS>;<AME>;IGNORE

<UFE9D> <jeem>;<BAS>;<AIS>;IGNORE

<UFE9E> <jeem>;<BAS>;<AFI>;IGNORE

<UFE9F> <jeem>;<BAS>;<AII>;IGNORE

<UFEA0> <jeem>;<BAS>;<AME>;IGNORE

<UFB7A> <tcheh>;<BAS>;<AIS>;IGNORE

<UFB7B> <tcheh>;<BAS>;<AFI>;IGNORE

<UFB7C> <tcheh>;<BAS>;<AII>;IGNORE

<UFB7D> <tcheh>;<BAS>;<AME>;IGNORE

<UFEA1> <hah>;<BAS>;<AIS>;IGNORE

<UFEA2> <hah>;<BAS>;<AFI>;IGNORE

<UFEA3> <hah>;<BAS>;<AII>;IGNORE

<UFEA4> <hah>;<BAS>;<AME>;IGNORE

<UFEA5> <khah>;<BAS>;<AIS>;IGNORE

<UFEA6> <khah>;<BAS>;<AFI>;IGNORE

<UFEA7> <khah>;<BAS>;<AII>;IGNORE

<UFEA8> <khah>;<BAS>;<AME>;IGNORE

<UFEA9> <dal>;<BAS>;<AIS>;IGNORE

<UFEAA> <dal>;<BAS>;<AFI>;IGNORE

<UFB88> <ddal>;<BAS>;<AIS>;IGNORE

<UFB89> <ddal>;<BAS>;<AFI>;IGNORE

<UFEAB> <thal>;<BAS>;<AIS>;IGNORE

<UFEAC> <thal>;<BAS>;<AFI>;IGNORE

<UFEAD> <reh>;<BAS>;<AIS>;IGNORE

<UFEAE> <reh>;<BAS>;<AFI>;IGNORE

<UFB8C> <rreh>;<BAS>;<AIS>;IGNORE

<UFB8D> <rreh>;<BAS>;<AFI>;IGNORE

<UFEAF> <zain>;<BAS>;<AIS>;IGNORE

<UFEB0> <zain>;<BAS>;<AFI>;IGNORE

<UFB8A> <jeh>;<BAS>;<AIS>;IGNORE

<UFB8B> <jeh>;<BAS>;<AFI>;IGNORE

<UFEB1> <seen>;<BAS>;<AIS>;IGNORE

<UFEB2> <seen>;<BAS>;<AFI>;IGNORE

<UFEB3> <seen>;<BAS>;<AII>;IGNORE

<UFEB4> <seen>;<BAS>;<AME>;IGNORE

<UFEB5> <sheen>;<BAS>;<AIS>;IGNORE

<UFEB6> <sheen>;<BAS>;<AFI>;IGNORE

<UFEB7> <sheen>;<BAS>;<AII>;IGNORE

<UFEB8> <sheen>;<BAS>;<AME>;IGNORE

<UFEB9> <sad>;<BAS>;<AIS>;IGNORE

<UFEBA> <sad>;<BAS>;<AFI>;IGNORE

<UFEBB> <sad>;<BAS>;<AII>;IGNORE

<UFEBC> <sad>;<BAS>;<AME>;IGNORE

<UFEBD> <dad>;<BAS>;<AIS>;IGNORE

<UFEBE> <dad>;<BAS>;<AFI>;IGNORE

<UFEBF> <dad>;<BAS>;<AII>;IGNORE

<UFEC0> <dad>;<BAS>;<AME>;IGNORE

<UFEC1> <tah>;<BAS>;<AIS>;IGNORE

<UFEC2> <tah>;<BAS>;<AFI>;IGNORE

<UFEC3> <tah>;<BAS>;<AII>;IGNORE

<UFEC4> <tah>;<BAS>;<AME>;IGNORE

<UFEC5> <zah>;<BAS>;<AIS>;IGNORE

<UFEC6> <zah>;<BAS>;<AFI>;IGNORE

<UFEC7> <zah>;<BAS>;<AII>;IGNORE

<UFEC8> <zah>;<BAS>;<AME>;IGNORE

<UFEC9> <ain>;<BAS>;<AIS>;IGNORE

<UFECA> <ain>;<BAS>;<AFI>;IGNORE

<UFECB> <ain>;<BAS>;<AII>;IGNORE

<UFECC> <ain>;<BAS>;<AME>;IGNORE

<UFECD> <ghain>;<BAS>;<AIS>;IGNORE

<UFECE> <ghain>;<BAS>;<AFI>;IGNORE

<UFECF> <ghain>;<BAS>;<AII>;IGNORE

<UFED0> <ghain>;<BAS>;<AME>;IGNORE

<UFED1> <feh>;<BAS>;<AIS>;IGNORE

<UFED2> <feh>;<BAS>;<AFI>;IGNORE

<UFED3> <feh>;<BAS>;<AII>;IGNORE

<UFED4> <feh>;<BAS>;<AME>;IGNORE

<UFED5> <qaf>;<BAS>;<AIS>;IGNORE

<UFED6> <qaf>;<BAS>;<AFI>;IGNORE

<UFED7> <qaf>;<BAS>;<AII>;IGNORE

<UFED8> <qaf>;<BAS>;<AME>;IGNORE

<UFED9> <kaf>;<BAS>;<AIS>;IGNORE

<UFEDA> <kaf>;<BAS>;<AFI>;IGNORE

<UFEDB> <kaf>;<BAS>;<AII>;IGNORE

<UFEDC> <kaf>;<BAS>;<AME>;IGNORE

<UFB8E> <keheh>;<BAS>;<AIS>;IGNORE

<UFB8F> <keheh>;<BAS>;<AFI>;IGNORE

<UFB90> <keheh>;<BAS>;<AII>;IGNORE

<UFB91> <keheh>;<BAS>;<AME>;IGNORE

<UFB92> <gaf>;<BAS>;<AIS>;IGNORE

<UFB93> <gaf>;<BAS>;<AFI>;IGNORE

<UFB94> <gaf>;<BAS>;<AII>;IGNORE

<UFB95> <gaf>;<BAS>;<AME>;IGNORE

<UFEDD> <lam>;<BAS>;<AIS>;IGNORE

<UFEDE> <lam>;<BAS>;<AFI>;IGNORE

<UFEDF> <lam>;<BAS>;<AII>;IGNORE

<UFEE0> <lam>;<BAS>;<AME>;IGNORE

<UFEE1> <meem>;<BAS>;<AIS>;IGNORE

<UFEE2> <meem>;<BAS>;<AFI>;IGNORE

<UFEE3> <meem>;<BAS>;<AII>;IGNORE

<UFEE4> <meem>;<BAS>;<AME>;IGNORE

<UFEE5> <noon>;<BAS>;<AIS>;IGNORE

<UFEE6> <noon>;<BAS>;<AFI>;IGNORE

<UFEE7> <noon>;<BAS>;<AII>;IGNORE

<UFEE8> <noon>;<BAS>;<AME>;IGNORE

<UFB9E> <noon_ghunna>;<BAS>;<AIS>;IGNORE

<UFB9F> <noon_ghunna>;<BAS>;<AFI>;IGNORE

<UFEE9> <heh>;<BAS>;<AIS>;IGNORE

<UFEEA> <heh>;<BAS>;<AFI>;IGNORE

<UFEEB> <heh>;<BAS>;<AII>;IGNORE

<UFEEC> <heh>;<BAS>;<AME>;IGNORE

<UFBA4> <heh_yeh>;<BAS>;<AIS>;IGNORE

<UFBA5> <heh_yeh>;<BAS>;<AFI>;IGNORE

<UFE85> <waw>;<AHW>;<AIS>;IGNORE

<UFE86> <waw>;<AHW>;<AFI>;IGNORE

<UFEED> <waw>;<BAS>;<AIS>;IGNORE

<UFEEE> <waw>;<BAS>;<AFI>;IGNORE

<UFEEF> <alef_maksura>;<BAS>;<AIS>;IGNORE

<UFEF0> <alef_maksura>;<BAS>;<AFI>;IGNORE

<UFE89> <alef_maksura><hamza>;<BAS><BAS>;<AIS><AIS>;IGNORE

<UFE8A> <alef_maksura><hamza>;<BAS><BAS>;<AFI><AIS>;IGNORE

<UFE8B> <alef_maksura><hamza>;<BAS><BAS>;<AII><AIS>;IGNORE

<UFE8C> <alef_maksura><hamza>;<BAS><BAS>;<AME><AIS>;IGNORE

<UFEF1> <alef_maksura>;<AYE>;<AIS>;IGNORE

<UFEF2> <alef_maksura>;<AYE>;<AFI>;IGNORE

<UFEF3> <alef_maksura>;<AYE>;<AII>;IGNORE

<UFEF4> <alef_maksura>;<AYE>;<AME>;IGNORE

<UFBB0> <yeh_barree>;<YBA>;<AIS>;IGNORE

<UFBB1> <yeh_barree>;<YBA>;<AFI>;IGNORE

<UFBAE> <yeh_barree>;<BAS>;<AIS>;IGNORE

<UFBAF> <yeh_barree>;<BAS>;<AFI>;IGNORE

<UFEF5> <lam><alef>;<BAS><AMA>;<AIS><AFI>;IGNORE

<UFEF6> <lam><alef>;<BAS><AMA>;<AFI>;<AFI>;IGNORE

<UFEF7> <lam><alef>;<BAS><AHA>;<AIS>;<AFI>;IGNORE

<UFEF8> <lam><alef>;<BAS><AHA>;<AFI>;<AFI>;IGNORE

<UFEF9> <lam><alef>;<BAS><AHS>;<AIS>;<AFI>;IGNORE

<UFEFA> <lam><alef>;<BAS><AHS>;<AFI><AFI>;IGNORE

<UFEFB> <lam><alef>;<BAS><BAS>;<AIS><AFI>;IGNORE

<UFEFC> <lam><alef>;<BAS><BAS>;<AFI><AFI>;IGNORE

<HEBREU> order_start forward;forward;forward;forward,position

<U05D0> <alef>;<BAS>;IGNORE;IGNORE

<U05D1> <bet>;<BAS>;IGNORE;IGNORE

<U05D2> <gimel>;<BAS>;IGNORE;IGNORE

<U05D3> <dalet>;<BAS>;IGNORE;IGNORE

<U05D4> <he>;<BAS>;IGNORE;IGNORE

<U05D5> <vav>;<BAS>;IGNORE;IGNORE

<U05D6> <zayin>;<BAS>;IGNORE;IGNORE

<U05D7> <het>;<BAS>;IGNORE;IGNORE

<U05D8> <tet>;<BAS>;IGNORE;IGNORE

<U05D9> <yod>;<BAS>;IGNORE;IGNORE

<U05DA> <kaf_fin>;<BAS>;IGNORE;IGNORE

<U05DB> <kaf>;<BAS>;IGNORE;IGNORE

<U05DC> <lamed>;<BAS>;IGNORE;IGNORE

<U05DD> <mem_fin>;<BAS>;IGNORE;IGNORE

<U05DE> <mem>;<BAS>;IGNORE;IGNORE

<U05DF> <nun_fin>;<BAS>;IGNORE;IGNORE

<U05E0> <nun>;<BAS>;IGNORE;IGNORE

<U05E1> <samekh>;<BAS>;IGNORE;IGNORE

<U05E2> <ayin>;<BAS>;IGNORE;IGNORE

<U05E3> <pe_fin>;<BAS>;IGNORE;IGNORE

<U05E4> <pe>;<BAS>;IGNORE;IGNORE

<U05E5> <tsadi_fin>;<BAS>;IGNORE;IGNORE

<U05E6> <tsadi>;<BAS>;IGNORE;IGNORE

<U05E7> <qof>;<BAS>;IGNORE;IGNORE

<U05E8> <resh>;<BAS>;IGNORE;IGNORE

<U05E9> <shin>;<BAS>;IGNORE;IGNORE

<U05EA> <tav>;<BAS>;IGNORE;IGNORE

<HAN> order_start forward;forward;forward;forward,position

<U4E00>......<U9FA5> <U4E00>......<U9FA5>;IGNORE;IGNORE;IGNORE

#

order_end

#

END LC_COLLATE

Annex 2 (normative) Benchmark

1 List with required result of the default

2 List with required result after example of tailoring

Informative annexes

Note: In this draft, annexes identified with a digit are intended to be normative. Annexes identified with a letter are intended to be informative.

Annex A (informative) - Criteria used initially to prepare the standard

Note: these criteria have been subject to change. They represented an optimum. Compromises had to be done according to diverse circumstances later on.

1. The mechanism must provide a deterministic way to collate graphic character strings. Thus, if two strings of graphic characters are different when directly compared in binary, the order assigned by the mechanism should be always the same and the strings will be considered different even if they are externally considered equivalent by humans.

2. For each script, if this is possible, the order assigned will be culturally acceptable to a majority of users of that script.

3. The repertoire of characters supported should be at least the one defined by Level three implementation (the richest possible) of ISO/IEC 10646.

4. The ordering table will be defined keeping in mind the following points concerning internal string transformation number assignments:

- the assignments are processed as efficiently as possible if they are stored in a permanent way, and

- the assignments allow direct and correct one-pass binary comparisons between two resultant number sequences.

The table is defined this way because it is always possible to define an order between two strings by whatever complex method is used. However, real systems must have a minimum level of performance. Once assignment is made on original strings, the result must be storable without modification. Also, the result must be directly reusable for comparison purposes, without having to redo the conversion process each time. This will also enable existing systems to make comparisons with minimum changes and sometimes without having to change programs.

5. There must be a mechanism to use the table as a template, primarily to optimise the process for the user's language. In the template, the order of a series of characters may be modified by simple a posteriori declaration, without having to specify the whole table again.

6. Given the reusable comparison keys obtained (see 4), it must be possible to reconstitute the original as is without the need to preserve it. This means that the reversibility of the process must be available to applications if required.

As valuable information, this list of requirements can already be satisfied by Canadian Standard CAN/CSA Z243.4.1 for West European languages, except that this standard is monoscript and does not support composite sequences as defined in ISO/IEC 10646. However, preliminary studies suggest that it is possible to extend the Canadian method to take into account both the multi-script requirement and the presence of composite sequences.

ISO/IEC 9945-2 (POSIX-2) allows the Canadian standard CAN/CSA Z243.4.1-1992 to be described. However, it could require modifications of the model to handle both the multi-script requirements and the need for composite sequences if an infinite repertoire is necessary for a given environment.

The application of this standard will not require full POSIX-2 conformance, but will be as compatible as possible with the POSIX LOCALE LC_COLLATE specification model. Otherwise, this standard will build on this specification model in attempting to make as few modifications as possible (particularly structural modifications).

Annex B (informative)

Description of the prehandling phase

Prehandling is essentially for modification and/or duplication of original records to render their fields context-independent prior to the comparison phase. Examples are:

- duplicating a string such as "41" for phonetic ordering into 3 strings for trilingual phonetic ordering usage (French, English and German"):

QUARANTE-ET-UN

FORTY-ONE

EINUNDVIERZIG

- removing or rotating characters that are a nuisance for special requirements of ordering; for example, in France, removing "de" in "de Gaulle" and not removing "De" in "De Gaulle" according to nobiliar origin or not, to give:

Gaulle (de)

De Gaulle

- transform incomplete data into full form; for example, transform "Mc Arthur" to give "Mac Arthur"

- transform numbers so that the result will be ordered in numerical order and not positionally or according to phonetics, for example:

Given the strings "100" and "15",

- either separate each of these numbers in different fields from the rest of text and convert them entirely in standard numeric (binary) data to be ordered numerically and not textually, or

- pad/align numbers to make sure the one-phase default ordering mechanism will process them correctly:

"015"

"100"

- transform Roman numerals into Arabic numbers after having determined the context (perhaps with the help of human interactive intervention or an expert system), as in the following French example:

CHAPITRE DIX might mean CHAPTER 010 or CHAPTER 509 ("dix" is the French word for 10, it is also the Roman numeral for 509). This generally requires context to be solved with total certainty.

Description of the Posthandling Phase

Post-processing is essentially for modifying resulting keys, or appending the original string to keys so that the results of comparisons can determine differences in the case of homography when the prehandling phase, particularly, has been done. For example, there could be equivalencies if numerical values (for example, "010" and "10") have been standaredized in the prehandling phase. The default ordering mechanism has no knowledge that the original strings are different in such cases, but the predictability requirement still exists.

In particular, where different coding methods have been used in the original strings to be ordered in the same process, the posthandling phase can determine internal differences which would appear exactly the same on paper for end-users (for example, an ISO 2022 input stream intermixing ISO/IEC 6937 and ISO/IEC 8859).

The Default-Tailorable Ordering Mechanism does not cover the prehandling and posthandling phases. However, the mechanism does describe these phases. The presence of the phases is mandatory even if empty processes must be defined. These empty processes can be replaced if the need occurs.

Annex C Sources for methods and data gathering

CAN/CSA Z243.4.1 Canadian ordering standard

CAN/CSA Z243.230 Canadian minimum software localization parameters

IBM NLTC Volume 2 reference manual

IBM Egypt and Egypt Standards

Stefan Fuchs and Israel Standards

CEN TC304 Multilingual sorting standard project

LOCALES provisionally registered in x/Open or in SC22/WG15 (DKUUG.DK Internet site)

Règles du classement alphabétique en langue française et procédure informatisée pour le tri, Alain LaBonté, Ministère des Communications du Québec, 1988 -- ISBN 2-550-19046-7

Technique de réduction - Tris informatiques à quatre clés, Alain LaBonté, Ministère des Communications du Québec, 1989 -- ISBN 2-550-19965-0

Fonctions de systèmes - Soutien des langues nationales, Alain LaBonté, Ministère des Communications du Québec, 1988

National Language Architecture - Klaus daube, SHARE EUROPE White Paper, 1990

Annex D (informative) Preliminary principles of table assignments

The principles of numeric table assignments are the following:

a) All characters are assigned a value corresponding to the identification of the script. Each script header is given a name mainly for the purposes of tailoring. However, conceptually, a number corresponding to the identification of the script can be assigned to this name, which then serves as a variable. This script identification data is informative only and does not serve in the comparison process. However, the identification data may be necessary for determining the scanning direction of diacritics for that script. This data must sometimes be retained alongside with the ordering strings to meet the reversibility requirements above (capacity to reconstitute the original strings given the different subkeys that are a result of the multilevel transformation).

b) Each letter is assigned a basic normalised letter value (or a pair or a triad for ligatures). The assignment is made as first level (ideographic characters are assigned their standardised CJK order, corresponding to the order they have in ISO/IEC 10646). The assignment is in the order of the alphabet to which they belong - for example, LATIN CAPITAL LETTER E WITH CIRCUMFLEX ACCENT is assigned a numerical value corresponding to the same value attributed to LATIN SMALL LETTER E. Such a definition is valid for most Latin-script-based languages. Vietnamese would require a different definition, E CIRCUMFLEX being a base letter in this language.

c) Each letter is assigned an n-plet of values (or 2 n-plets or 3 n-plets for ligatures) as 2nd level, which corresponds to the maximum realistic number of combining characters encountered in all world scripts for a given basic letter to which it applies. When there is only one diacritic, the second and third elements of the triplet are place holders. When there is no diacritic, three place holders are provided in each triplet, and so on. For each diacritic of a triplet, a flag is put in the next-to-last level to indicate an integrated diacritic (as opposed to a combining character). Note that for level 1 conformance to ISO 10646 (or if composite sequences are all predefined by "collating- symbol" statements), the n-plet of values for each character can be made equal to a single token because no analysis of combining diacritics will ever be necessary (and the next-to-last level, reserved for future use, will be empty).

Ideographs are assigned no value for this level according to ISO/IEC 10646 level 1 of conformance. This is because ideographs will be compared against completely different values simultaneously at the first level, and thus there will be no collision in comparison operations at this level. (Ideographs are not assigned equivalencies at the first level). Levels 2 or 3 of conformance could be processed with the same model as the one for letters, for theoretical combinations.

d) Each letter is assigned a value (or a pair or a triad for ligatures) as 3rd level, corresponding to the form of the letter (for example, upper or lower case for Latin, or free-standing, initial, medial, or ending form for Arabic). Ideographs are assigned no value for this level.

e) This paragraph was removed from the previous version.

f) Each special character (a character not specifically belonging to a specific script, such as COPYRIGHT SIGN, or COMMA) is assigned a value as 4th level value. This is a world-wide common numerical value that is preceded with the position it occupies in the original string to be processed. Currently, no other level value is assigned in the default table.

g) this paragraph was removed from the previous draft.

Given such table assignments, a table of scanning directions will be provided for each script and for each of the levels. Note that scanning direction is not linked to the natural script direction, since the characters are already linearly coded according to their script direction (logical direction). This is linked to the direction in which each level is processed for ordering. For example, in French, diacritical marks are scanned backward in case of first level homography: accents are not considered for ordering in French except for specifying the order of quasi-homographs. In this case, the last difference in the words determines the order, thus explaining the retrograde scanning (an example of an ordered list is: "cote", "côte", "coté", "côté"). When string direction is retrograde for a character in a given level, the value assigned to this level is placed in front of the resulting key instead of at the end for this level.

Given that each subkey is established at all levels, and provided that a low-value delimiter is placed between each subkey , all subkeys can be concatenated at once and used for subsequent comparisons. (If values are carefully chosen for table-building, no low-value delimiter is necessary). Given that all the information is present, the original string provided can be reconstituted from the subkeys.

Reduction techniques exist to minimise the amount of storage requirements for that method without affecting the comparison process if keys are to be preserved for maximum performance reasons. (see References).

Annex E (informative) - Principles of the comparison engine

The basic philosophy behind the culturally-correct character string comparison engine is the following:

1. No comparison mechanism is culturally correct when it assumes that the order is based on numerical internal values of raw character strings, and with any standard character set coding scheme.

2. If two strings are different, there must be a fully predictable order assigned to each one relative to each other one.

3. Ordering rules are language-related in a given script.

4. Whatever the language, the ordering rules are based on lexical order at the lowest level. Higher level classification (done in a prehandling phase) produces character strings whose ordering is to be made as for any other lexical entry.

5. Each rule tentatively determines an order between two different character strings by operating a single binary comparison on binary strings that represent the result of a straightforward and context-independent transformation of the characters of each string. (Transformations typically involve ignoring, or giving a specific or generic weight to each character, or retaining the position of a character as a weight while assigning it a second weight depending on the character itself. Such transformations may be done by scanning the string forward or backward in the logical string sequence, except for the positional case which only implies the logical positions of a string).

6. Transformations can typically produce equivalencies for two different character strings transformed into two identical binary strings. Thus, when such cases are encountered, other sequential series of transformation are necessary until, at a final level, all ties are solved (at the last level, binary strings are necessarily different if two original character strings to be compared are different). If the only goal of a comparison is to determine equivalence up to a certain level of precision, then character transformation is required up to a certain level only.

7. The default table will define as many levels as necessary to produce a fully predictable order for two different character strings. This involves up to five comparison levels if characters of ISO/IEC 10646 level 1 are used, and up to six comparison levels if characters of ISO/IEC 10646 level 3 are used. An extra level (used for data management and not of particular significance for comparisons) is also defined (see 9 below).

8. A whole character string is transformed as many times as necessary into up to six different levels. Thus, it must be possible to deduce the original character string from all the different binary transformations concatenated into one binary string (reversibility property of the transformation process).

9. Different scripts may have different properties as to the way each level is processed. Thus, to ensure the operation will be reversed, an extra level transformation table is necessary to identify the script to which each character belongs.

Annex F. Revised (if necessary) SC22/WG20 N 174 From a requirement to its implementation Compare, Sort, Search

Removed from the previous version

Annex G. Discussion on the number of levels for each script and their harmonization

Text will be added if necessary

Annex H. Example of national classification standards and how they can be harmonized to the international standard

AFNOR Z.44-001

ANSI/NISO Z39.75-199X (project at time of editing WD3)

DIN 5007

Annex I. Standard LOCALE parameters definitions - unextended

Text obtained from: ISO/IEC 99452

Locale:

A locale is the definition of the subset of a user's environment

that depends on language and cultural conventions. It is made up

from one or more categories. Each category is identified by its name

and controls specific aspects of the behavior of components of the

system. Category names correspond to the following environment

variable names:

LC_CTYPE Character classification and case conversion.

LC_COLLATE Collation order.

LC_TIME Date and time formats.

LC_NUMERIC Numeric, nonmonetary formatting.

LC_MONETARY Monetary formatting.

LC_MESSAGES Formats of informative and diagnostic messages and

interactive responses.

Category Specifications:

LC_CTYPE

The LC_CTYPE category shall define character classification, case

conversion, and other character attributes. In addition, a series

of characters can be represented by three adjacent periods representing

an ellipsis symbol ("..."). The ellipsis specification shall be

interpreted as meaning that all values between the values preceding

and following it represent valid characters. The following keywords

shall be recognized:

copy Specify the name of an existing locale to be used

for the definition of the category. If this keyword is

specified, no other keyword shall be specified.

upper Define characters to be classified as uppercase letters.

No character specified for the keywords cntrl, digit, punct,

or space shall be specified.

lower Define characters to be classified as lowercase letters.

No character specified for the keywords cntrl, digit, punct,

or space shall be specified.

alpha Define characters to be classified as letters.

No character specified for the keywords cntrl, digit, punct,

or space shall be specified. Characters classified as

either upper or lower are automatically included in this

class.

digit Define characters to be classified as numeric digits. Only

the digits 0, 1, 2, 3, 4, 5, 6, 7, 8, and 9 shall be

specified, and in contiguous ascending sequence by

numerical value.

space Define characters to be classified as whitespace

characters. No character specified for the keywords upper,

lower, alpha, digit, graph, or xdigit shall be specified.

The characters <space>, <formfeed>, <newline>,

<carriagereturn>, <tab>, and <verticaltab>, and any

characters included in the class blank, are automatically

included in this class.

cntrl Define characters to be classified as control characters.

No character specified for the keywords upper, lower, alpha,

digit, punct, graph, print, or xdigit shall be specified.

punct Define characters to be classified as punctuation

characters.

No character specified for the keywords upper, lower, alpha,

digit, cntrl, xdigit, or as the <space> character

shall be specified.

graph Define characters to be classified as printable characters,

not including the <space> character. Characters

specified for the keywords upper, lower, alpha, digit,

xdigit, and punct are automatically included in this class.

No character specified for the keyword cntrl shall be

specified.

print Define characters to be classified as printable characters,

including the <space> character. Characters specified

for the keywords upper, lower, alpha, digit, xdigit, punct,

and the <space> character are automatically include in this

class. No character specified for the keyword cntrl shall

be specified.

xdigit Define the characters to be classified as hexadecimal

digits. Only the characters defined for the class digit

shall be specified, in contiguous ascending sequence by

numerical value, followed by one ore more sets of six

characters representing the hexadecimal digits 10 through

15, with each set in ascending order.

blank Define characters to be classified as <blank> characters.

The characters <space> and <tab> are automatically included

in this class.

toupper Define the mapping of lowercase letters to uppercase letters.

The operand shall consist of character pairs, separated by

semicolons. The characters in each character pair shall be

separated by a comma and the pair enclosed by parentheses.

The first character in each pair shall be the lowercase

letter, the second the corresponding uppercase letter. Only

characters specified for the keywords lower and upper shall

be specified.

tolower Define the mapping of uppercase letters to lowercase letters.

The operand shall consist of character pairs, separated by

semicolons. The characters in each character pair shall be

separated by a comma and the pair enclosed by parentheses.

The first character in each pair shall be the uppercase

letter, the second the corresponding lowercase letter. Only

characters specified for the keywords lower and upper shall

be specified. If the tolower keyword is omitted from the

locale definition, the mapping shall be the reverse

mapping of the one specified for toupper.

LC_COLLATE

A collation sequence definition shall define the relative order

between collating elements (characters and multicharacter collating

elements) in the locale. This order is expressed in terms of collation

values; i.e., by assigning each element on or more collation values

(also known as collation weights). This does not imply that

implementations shall assign such values, but that ordering of

strings using the resultant collation definition in the locale shall

behave as if such assignment is done and used in the collation process.

The collation sequence definition shall be used by regular expressions,

pattern matching, and sorting. The following capabilities are

provided:

(1) Multicharacter collating elements. Specification of multicharacter

collating elements (e.e., sequences of two or more characters to

be collated an an entity).

(2) Userdefined ordering of collating elements. Each collating.element shall be assigned a collation value defining its order in the.character (or basic) collating sequence. This ordering in used by.regular expressions and pattern matching and, unless collation.weights are explicitly specified, also as the collation weight to.be used in sorting.

(3) Multiple weights and equivalence classes. Collating elements can be assigned one or more (up to the limit {COLL_WEIGHTS_MAX}) collating weights for use in sorting. The first weight is hereafter referred to as the primary weight.

(4) OnetoMany mapping. A single character is mapped into a string of collating elements.

(5) Equivalence class definition. Two or more collating elements have the same collation value (primary weight).

(6) Order by weights. When two string are compared to determine.their relative order, the two strings are first broken up into a.series of collating elements, and each successive pair of elements.are compared according to the relative primary weights for the.elements. If equal, and more than one weight has been assigned,.then the pairs of collating elements are recompared according to.the relative subsequent weights, until either a pair of collating.elements compare unequal or the weights are exhausted.

The following keywords shall be recognized in a collation sequence

definition. They are described in detail in the following

subclauses.

copy Specify the name of an existing locale to be used

for the definition of the category. If this keyword

is specified, no other keyword shall be specified.

collatingelement Define a collatingelement symbol representing

a multicharacter collating element. This keyword

is optional.

collatingsymbol Define a collating symbol for use in collation order

statements. This keyword is optional.

order_start Define collation rules. This statement is followed

by one or more collation order statements,

assigning character collation values and

collation weights to collating elements.

order_end Specify the end of the collationorder statements.

collatingelement Keyword

In addition to the collating elements in the character set, the

collatingelement keyword shall be used to define multicharacter

collating elements.

collatingsymbol Keyword

This keyword shall be used to define symbols for use in collation

sequence statements; i.e., between the order_start and the

order_end keywords.

The collatingsymbol keyword defines a symbolic name that can be

associated with a relative position in the character order sequence.

While such a symbolic name does not represent any collating element, it

can be used as a weight.

order_start Keyword

The order_start keyword shall precede collation order entries. It

defines the number of weights for this collation sequence definition

and other collation rules.

The operands to the order_start keyword are optional. If present, the

operands define rules to be applied when strings are compared. The

number of operands define how many weights each element is assigned;

if no operands are present, one forward operand is assumed. If present,

the first operand defines rules to be applied when comparing strings

using the first (primary) weight; the second when comparing strings

using the second weight, and so on. Operands shall be separated by

semicolons (;). Each operand shall consist of one or more collation

directives, separated by commas (,). If the number of operands

exceeds the {COLL_WEIGHTS_MAX} limit, the utility shall issue a

warning message. The following directives shall be supported:

forward Specifies that comparison operations for the weight level

shall precede from start of string towards the end of the

string.

backward Specifies that comparison operations for the weight level

shall precede from end of string towards the beginning of

string.

position Specifies that comparison operations for the weight

level will consider the relative position of nonIGNOREd

elements in the strings. The string containing a

nonIGNOREd element after the fewest IGNOREd collating

elements from the start of the compare shall collate first.

If both strings contain a nonIGNOREd character in the same

relative position, the collating values assigned to the

elements shall determine the ordering. In case of

equality, subsequent nonIGNOREd characters shall be

considered in the same manner.

The directives forward and backward are mutually exclusive.

Other sections' descriptions are irrelevant for this standard. Titles of other sections are given here as an indication.

LC_TIME

LC_NUMERIC

LC_MONETARY

LC_MESSAGES

Caractères hébreux non encore publiés dans l'ISO/CEI 10646_1

<U0591> IGNORE;IGNORE;IGNORE;<U0591> #accent_etnahta

<U0592> IGNORE;IGNORE;IGNORE;<U0592> #accent_segol

<U0593> IGNORE;IGNORE;IGNORE;<U0593> #accent_shalshelet

<U0594> IGNORE;IGNORE;IGNORE;<U0594> #accent_zaqef_qatan

<U0595> IGNORE;IGNORE;IGNORE;<U0595> #accent_zaqef_gadol

<U0596> IGNORE;IGNORE;IGNORE;<U0596> #accent_tipeha

<U0597> IGNORE;IGNORE;IGNORE;<U0597> #accent_revia

<U0598> IGNORE;IGNORE;IGNORE;<U0598> #accent_zarqa

<U0599> IGNORE;IGNORE;IGNORE;<U0599> #accent_pashta

<U059A> IGNORE;IGNORE;IGNORE;<U059A> #accent_yetiv

<U059B> IGNORE;IGNORE;IGNORE;<U059B> #accent_tevir

<U059C> IGNORE;IGNORE;IGNORE;<U059C> #accent_geresh

<U059D> IGNORE;IGNORE;IGNORE;<U059D> #accent_geresh_muqdam

<U059E> IGNORE;IGNORE;IGNORE;<U059E> #accent_gershayim

<U059F> IGNORE;IGNORE;IGNORE;<U059F> #accent_qarney_para

<U05A0> IGNORE;IGNORE;IGNORE;<U05A0> #accent_telisha_gedola

<U05A1> IGNORE;IGNORE;IGNORE;<U05A1> #accent_pazer

<U05A3> IGNORE;IGNORE;IGNORE;<U05A3> #accent_munah

<U05A4> IGNORE;IGNORE;IGNORE;<U05A4> #accent_mahapakh

<U05A5> IGNORE;IGNORE;IGNORE;<U05A5> #accent_merkha

<U05A6> IGNORE;IGNORE;IGNORE;<U05A6> #accent_merkha_kefula

<U05A7> IGNORE;IGNORE;IGNORE;<U05A7> #accent_darga

<U05A8> IGNORE;IGNORE;IGNORE;<U05A8> #accent_qadma

<U05A9> IGNORE;IGNORE;IGNORE;<U05A9> #accent_telisha_qetana

<U05AA> IGNORE;IGNORE;IGNORE;<U05AA> #accent_yerah_ben_yomo

<U05AB> IGNORE;IGNORE;IGNORE;<U05AB> #accent_ole

<U05AC> IGNORE;IGNORE;IGNORE;<U05AC> #accent_iluy

<U05AD> IGNORE;IGNORE;IGNORE;<U05AD> #accent_dehi

<U05AE> IGNORE;IGNORE;IGNORE;<U05AE> #accent_zinor

<U05AF> IGNORE;IGNORE;IGNORE;<U05AF> #mark_masora_circle

<U05C4> IGNORE;IGNORE;IGNORE;<U05C4> #mark_upper_dot