Report of meeting on the Matching and Browsing project

Author: John Clews

Marc Küster (Project Manager/Editor) chaired the meeting, with John Clews (assistant editor/reviewer) acting as secretary. Marc Küster described the project aims, provided an section by section overview of the initial "pre-draft", with additional contributions from John Clews, and then asked for comments, which led to a very productive discussion, reported below.

Hans van der Laan, one of the two assistant editors had had to resign from paid participation in the project due to illness, but planned still to be able to provide a lower level of review, as he had been involved in reviewing some related developments within the IRT group within the Netherlands: the IRT group had also been evaluating length and depth of indexing of test web sites by various search engines.

Most of the following were present at the Matching meeting, as well as at other meetings of CEN/TC304:

Marc Küster (DE); John Clews (GB); Paul Dettmer (DE); Chris Makemson (GB); Keld Simonsen (NO); Johan van Wingen (NL); Yuri Demchenko (NL); V.S. Umamaheswaran (FI); Tadaeus Holka (FR); Bernard Chauvois (FR); Yves Neuville (FR); Grigas Gintautas (LT); Monica Stahl (SE); Wolf Arfvidson (SE); Thorgeir Sigurdsson (IS); Erkki Kolehmainen (FI); Elzbieta Broma-Wrzesie (PL), Mike Ksar (Liaison from ISO/IEC JTC1/SC2/WG2).

1. Aims

Browsing and matching in information retrieval systems helps to develop the Information Society in Europe, particularly on the Internet. However, too often, especially on web search engines, Europeans have to adapt themselves to the conventions of British English or American English, rather than systems being designed to be adapted to cope with user requirements in Europe, whose users have wide-ranging requirements due to the wide range of languages and other cultural conventions in use in Europe. This scoping report is an attempt to address these needs on a European basis.

An statement of European needs in this area could enable Europeans to approach relevant information providers and service providers with the information necessary to enable them to adapt better to the needs of the European marketplace.

There are also other information retrieval (IR) domains where other alternative approaches are of use. For example IR in libraries has always had to deal with multilingual issues, and the experience of European librarians and library users across Europe in design and use of OPACS (Online Public Access Catalogues) may also be of use, particularly as libraries have dealt with very large volumes of information, analogous to the situation with the WWW.

2. Overview of pre-draft

The Executive summary gives an overview and examples of things that information systems should be able to deal with in Europe, which are currently not provided.

The Scope states that "the objective is to investigate the European needs and problems with searching and browsing, in relation to character sets, transliteration, matching and ordering rules and other cultural specific elements. The needs for a European set of requirements ... will be investigated."

The following sections described pattern matching as a basic technique used in IR, and looked at relevant projects and descriptions, cited in the pre-draft, particularly work by the Unicode Consortium (Normalisation report) and W3C (String matching and indexing). John Clews also suggested adding a useful ISO/IEC JTC1/SC22/WG20 by Ken Whistler document summarising the concepts involved.

An examination of search engines shows that they sometimes cannot cope with what are basic letters in some languages, as well as sometimes having limitations in dealing with accented characters, and non-Latin characters, such as Greek and Cyrillic, also used in various European languages. These limitations relate to input, indexing, and accessing information: relevant information can easily be lost.

Linguistically, phonetically, and semantically aware matching is important: thesauri, Soundex and more recent phonetic analyses, and semantic analyses of full text, can be important here, but there may be particular limitations on existing tools in meeting European needs, which should also be investigated.

The Browsing section outlines the usefulness of library-related IR techniques, including the long tradition of abstracting and indexing tools in this field, where European organisations have been particularly active TERENA's current REIS project, relating such techniques to the WWW, was of particular relevance.

The pre-draft concludes that the ideal would be a browsable subject index culled from the WWW itself. This might be part of European requirements.

A market survey would be useful to assess how well any existing providers and services met emerging European requirements.

It is also planned to provide a table of relevant projects in this area, and these tables could also provide details of relevant products and standards.

3. Project deliverable and future actions

The meeting secretary will write up a meeting report (target within one week from 21 October 1999) to include comments.

These comments will be incorporated into a revision of this draft, which will then be placed on the STRI web site (target within two weeks from 21 October 1999).

There will then be a six-week comment period by email on this draft, and a further revision will be prepared, as a result of further comments.

It is anticipated that one more iteration of this last step, or even two more iterations, may be possible, before the next CEN/TC304 meeting in France in May 2000.

On the basis of final comments, the PT would therefore present a final report to the May 2000 meeting of CEN/TC304, and the PT would then be dissolved.

4. Discussion on the draft

There seemed to be general satisfaction with this pre-draft. Thorgeir Sigurdsson noted that the project list table was currently blank (see also the end of section 2). Marc Küster noted that the projects referred to in the text would be added to that table, along with others arising with comments. It may be useful to subdivide the list by function in due course, according the concepts listed in the scope of the project, particularly (but not exclusively) those which related to standards.

5. The wider impact of this project

Marc Küster also suggested that in parallel with producing a specification of European requirements, CEN/TC304 should identify other organisations and/or consortia that CEN/TC304 hoped to influence, in order to persuade them to implement European requirements in this area. Thorgeir Sigurdsson suggested contacting other pan-European institutions and international institutions. Mike Ksar noted W3C in particular, and suggested effective liaison between W3C and CEN/TC304. Other relevant organisations might be IETF and Infoterm, as was noted by Yuri Demchenko and Thorgeir Sigurdsson, respectively.

In general discussion it was pointed out that dealing with legacy data is a problem: mappings to ISO/IEC 10646-1 exist, but somebody needs to organise actual conversions, and possibly find appropriate websites or ISPs to host the relevant large amounts of data involved.

Perhaps the European Commission might see advantages in funding an initial pilot project for UCS conversion, to ensure maximum access to information within Europe.

Yuri Demchenko pointed out that TERENA is a pan-European institution active in some similar areas, particularly through its (X.500/LDAP) project on Multilinguality for cross-border services in Europe, which involved building large scale directories of web and indexing services, and an exchange of indexes.

A multilingual framework for the exchange of information between indexing services was of vital strategic importance to Europe, given the reliance of European research and development on such information sources, both in industrial and academic environments.

He also noted the Dublin Core Working Group meeting next week in Frankfurt, which he would be attending. It was important for them to address multilingual issues in this, as Dublin Core metadata was beginning to play an increasingly important role in large scale indexing, storage and retrieval in both library and web-based systems.

Mike Ksar stated that incorporating internationalisation within the design from the outset was vital, but that it was not often enough done, even within companies with a global marketplace. Sections such as marketing and design did not always have sufficient contact to influence the product.

It would be useful to try and develop a methodology and process so that companies who were starting new design projects were motivated to include internationalisation into the design from the outset, and that they had appropriate information to hand, rather than just adding it on as a later complication.

He wondered whether the European Commission might fund the pilot development of such a mechanism if it would benefit European users of IT systems in Europe. Yuri Demchenko said that he had checked though a number of multilingual projects in the IST programme, for instance in the "REGNARD" project on Digital Library Directories and Telematics for Research projects, adoted by the Commission, software internationalisation seems to have been removed at evaluation, due to lack of expertise by the evaluators. Proper software internationalisation seemed to be lacking from many multilingual projects. He also thought that there was insufficient awareness of internationalisation issues among evaluators in IST projects.

In passing, Marc Küster also noted his own involvement in Tuebingen University's OPAC pilot project to ensure that internationalisation was "designed in" at the outset, and not just added on later.

John Clews noted the upcoming IUC (International Unicode Consortium) conference in Amsterdam in Spring 2000. It may be useful to find contacts to influence there.

Marc Küster had already discussed possible attendance at that conference on behalf of the Computing Centre of the University of Tuebingen.

Erkki Kohlemainen also noted that CEN/ISSS had just had the kick-off meeting of the CEN/ISSS Metadata-Dublin Core Workshop (ISSS/WS MMI-DC). It may be worth PT members registering for this Workshop.

6. Summing up

Marc Küster and John Clews noted that these had been valuable comments, and would be incorporated in the next draft, and would be taken account of in the first draft, and fitted into the timetable noted in section 3 above, as progress towards the final deliverable.

John Clews
21 October 1999