From ALB@immedia.ca Sun Jun 18 14:46:00 1994 Received: from Clouso.CRIM.CA by dkuug.dk with SMTP id AA08415 (5.65c8/IDA-1.4.4j for ); Fri, 17 Jun 1994 21:41:22 +0200 Received: from immedia.ca by clouso.crim.ca (4.1/SMI-4.1) id AA00935; Fri, 17 Jun 94 15:41:07 EDT Return-Path: Received: by immedia.ca (3.2/2.D) id AA30009; 17 Jun 94 19:47:15 +1900 Date: 17 Jun 94 19:46:00 +1900 From: ALB@immedia.ca Message-Id: <199406171947.AA30009@immedia.ca> To: bealle@torolab6.vnet.ibm.com, cpwg-mail@revcan.ca, paref@vm1.ulaval.ca, umavs@torolab6.vnet.ibm.com Cc: i18n@dkuug.dk, sc22wg20@dkuug.dk, tc304@dkuug.dk Subject: Re: Full-text searching: don't keep it simple and stupid! X-Charset: ASCII X-Char-Esc: 29 ---------- I could not agree more with the annexed contribution from Olle Jarnefors (from Sweden). In fact even "inflections" are included in the SHARE Europe requirement in what we call "fuzzy matching". However the latter is more difficult to implement without expert system technology and it is not possible with POSIX so far (perhaps we could say it is included in the general model of International ordering as there is a mandatory preprocessing phase, even if its content is not specified). So that those English people who do not necessarily understand all the implications of inflections understand, let me cite French: if you search for "oeil" (singular case of the word meaning "eye"), you want up to this point to retrieve "yeux" too (the plural case of the same word), and if you search for "beau" (masculin singular case of the word meaning "pretty", you want to also retrieve "beaux", masculin plural, "belle", feminin singular or "belles", feminin plural). That's what Olle is talking about. In English this would mean retrieving also "women" if you search for "woman", which is simpler but which is also an inflection. Of course in the International ordering model, the level of precision (without taking care of inflections except in preprocessing) up to which you accept equivalences is functionally specified: 1st level: base letters, 2nd level: diacritics [unless the diacritics are included in the base letters]; 3rd level, case, 4th [or 5th; see the standard for an eventual intermediary level for Arabic] level: specials [such as hyphens]. This is also what I have called in the past "significance levels" by analogy to floating point numbers, which basically have 3 levels of significance (sign, order of magnitude and mantissa), or levels of decomposition for computing and comparison purposes. The decomposition of letters is determined by a LOCALE, not by the engine itself. And the international big default LOCALE is tailorable. Alain LaBont Gouvernement du Qubec Secrtariat du Conseil du trsor Message original: ============================================================================== A: RNET ( BEALLE@TOROLAB6.VNET.IBM.COM, CPWG-MAIL@REVCAN.CA, PAREF@VM1.ULAVAL.CA,), RNET ( UMAVS@TOROLAB6.VNET.IBM.COM), ALB CC: RNET ( I18N@DKUUG.DK, SC22WG20@DKUUG.DK, TC304@DKUUG.DK,), RNET ( OLLE JARNEFORS ) De: RNET (ojarnef@admin.kth.se) Objet: Re: Full-text searching: don't keep it simple and stupid! Date: ven 17 jui 94 Heure: 19:14 TU Type: Mail Livraison: Reguliere ============================================================================== Alain LaBont raises the important question about how searching should be adapted to the needs of the language of the searched text/data. The following is what is said about this in the Nordic report on cultural requirements [1]: 2.4.4 Searching ---------------- Because the Nordic languages are more complicated than English, as far as inflection and formation of compound words are concerned, more sophisticated search functions are desirable. To be most useful, interactive searching for strings or words in a text should be available in three modes: 1. Search for exactly given words 2. Search for all words consisting of a given string and possibly an inflectional ending 3. Search for the given string, irrespective of its surroundings. Orthogonal to this, searching should be performed on three levels, defined by the treatment of individual letters: a) Exact search b) Search disregarding the case of letters c) Search also disregarding the distinction between letter variants which in sorting are treated as the same basic letter. For effective searching in text, it is also important that hyphenated words are dehyphenated, since hyphenation is more frequent in the Nordic languages than e.g. English, due to the many long compound words. What is said here about Danish, Faroese, Finnish, Icelandic, Norwegian, and Swedish is also true for German, Dutch and other languages, but of course such properties as -- which basic letters are distinguished -- which other letters are treated as variants of a basic letter -- which capital letter or capital string is equivalent to a small letter -- which characters are allowed in a word -- which inflectional endings are possible vary between languages. [1] INSTA: Nordic Cultural Requirements on Information Technology | Technical Report STRI TS3, 1992. Distributed by: Icelandic Council for Standardization, Keldnaholti, IS-112 Reykjavik, Iceland. Phone: +354-1-877 000. Fax: +354-1-877 409 -- Olle Jarnefors, Royal Institute of Technology, Stockholm