SC22/WG20 N1014

 

From kuester@saphor.de Mon Feb 10 18:18:07 2003

From: Marc Wilhelm Kuester <kuester@saphor.de>

Subject: Sorting text

 

--------------------------------------------------------------

 

A few ideas on sorting

 

 Marc Wilhelm Küster

 

 A few loose terms:

 

  sorting: a well-defined arrangement of two sortables using the

  algorithm that is specified later on

 

  sortable: a sequence of one or more orderables. The type of a

  sortable is defined by the sequence of types of the orderables

 

  orderable: a unit which has a well-defined ordering with regards to

  other units of the same type

 

  ordering: the arrangement of two units of the same type into a

  well-defined sequence

 

  Note: A special case of ordering is string ordering

 

  Note: Two units can be considered to be of the same type if their

  respective values have a comparison function that puts them into a

  well-defined greater-than relationship.

 

  Note: In reality, sortables will be read from some datasource. This

  datasource can be a structured file (e.g. an XML-based file format),

  a relational database, an object-oriented database or something else

  that can deliver structured data.

 

 Sorting vs. ordering

 

  In usual English parlance sorting and ordering are roughly

  equivalent terms. For the purpose of this text they have different

  meanings, however: Ordering is the arrangement of basic units such

  as strings or numbers into a well-defined order whereas sorting is

  the arrangement of sequences of orderable units into a well-defined

  order.

 

  An example for this distinction is the ordering of simple strings

  such as abacus and abc in contrast to the sorting of phone book

  entries where each entry consists of (e.g.) a family name, one or

  more first names, a street name, a house number and finally a

  telephone number.

 

  The rules for sorting are culturally sensitive as well as

  potentially subject to personal preferences. It is likewise

  dependent on the type of sortable at hand and the field of

  application.

 

  A document on this issue must define a sample syntax for a

  unambiguous definition of the sorting rules within this framework.

 

 The multikey algorithm

 

  Any sorting problem can be solved by a the following algorithm,

  usually known as the multikey algorithm. Two sortables can be

  compared by:

 

   1. Taking the first orderable from both sortables

 

   1a. If using "word-by-word ordering" for strings, split the

   orderable along the desired split criterion (usually whitespace)

   and follow this algorithm for each of the resulting parts

 

   2. Apply any required preprocessing to each of the two orderables

 

   3. Comparing the two resulting units. If they have a unique

      ordering, than this is the sequence of the two sortables

 

  Repeat this algorithm for all orderables until a unique sequence

  could be found. If no sequence can be found, the sortables are

  considered equivalent.

 

 Preprocessing

 

  Preprocessing of orderables is for some types of orderables a

  necessity in most fields of application.

 

  For strings, preprocessing can be seen conceptionally as the

  application of UNIX-style regular expressions to a string (though in

  reality, it would rarely be defined in this manner).

 

  Preprocessing is culturally sensitive. Rules for preprocessing are

  traditionally defined in national sorting standards such as NF Z

  44-001 for France, DIN 5007-2 for Germany and (to a lesser extend)

  ANSI/NISO Z39.75 for the US.

 

  A sample syntax must allow for the definition of (potentially

  complex) preprocessing rules.

 

 Ordering

 

   Each of the orderables needs a well-defined comparison method. For

   some orderables such as natural numbers the ordering will rarely be

   disputed and could hardly be considered to be culturally

   sensitive. For other units such as strings the comparison method is

   highly dependend on the culture in question (cf. ISO/IEC

   14651). For yet other types such as complex numbers there is no

   single accepted comparison method. Yet other types may have

   monetary or other unit signs attached to them which will directly

   influence the comparison (e.g. 1 cm compared to 1 inch or USD 1

   compared to EUR 1).

 

   A sample syntax must allow for the definition of ordering rules for

   different types of orderables and potentially for different rules

   for orderables of the same type when they appear in different

   positions of the same sortable.

 

 Syntax

 

  The TR must provide a sample syntax for the customized specification

  of an internationalized sorting algorithm much in the same vein that

  ISO/IEC 14651 provides a sample syntax for defining culturally

  adapted ordering specifications.

 

  This sample syntax will be XML based (and thus also be ISO 8879

  conformant). It should be able to drive an application.

 

 Preparation of a quality document

 

  This paper sketches the basics for an internationalized sorting

  algorithm. The details for the configuration and implementation of

  such an algorithm are still open. In order to get a quality

  document, an open-source reference sample application should be

  developed in sync with the progress of this document.

 

 Sample application

 

  The reference application should use object-oriented techniques

  consistently to minimize the porting effort between OO-languages. C#

  (ISO/IEC 23270) is suggested as a suitable language for the

  reference implementation.

 

------------------------------------------------------------------------------

*************************

Marc Wilhelm Küster

Saphor GmbH

 

Fronländer 22

D-72072 Tübingen

 

Tel.: (+49) / (0)7472 / 949 100

Fax: (+49) / (0)7472 / 949 114

 

E-Mail: kuester@saphor.net

Web: http://www.saphor.net