SC22/WG20 N842 From: Ordering@sesame.demon.co.uk Sent: Friday, May 11, 2001 6:40 AM Language codes - information from ISO/TC37/SC2 1. Update since the WG20 meeting, November 2000 Since the meeting of ISO/TC37 (Terminology) in London in August 2000, and since ISO/IEC JTC1/SC22/WG20's last meeting, Gerhard Budin (Austria) had taken over as Chair of ISO/TC37/SC2 from Aat Vervoorn (Netherlands). The 2-letter code language code standard ISO 639-1: Codes for representation of names of languages, has still not yet been published: this was designed to replace ISO 639: Codes for representation of names of languages, by removing some errors and adding codes for more languages. The 3-letter code language code standard ISO 639-2: Codes for representation of names of languages, remains in force. ISO 639 (and its imminent replacement, ISO 639-1) remains a subset of ISO 639-2. In the area of Internet specifications, RFC 1766 (Language tags) had been superseded by RFC 3066 (Language tags). This provided a precedence order which provided that 2-letter codes would be used where they exist in preference to 3-letter codes. It was also planned to freeze ISO 639-1 so that no new 2-letter codes are added, to avoid duplicate codes being in use. This provides an unambiguous coding mechanism for over 400 language codes. 2. Additional practices to consider However, there are areas where confusion may be possible, and care should be taken in case non-normative use may be encountered: 1. for the same repertoire of language codes, variant 3-letter codes are sometimes used (a) in libraries and (b) in some linguistic organizations which use SIL codes for the same values; 2. there is frequently a user demand for additional codes beyond the 400-plus codes in ISO 639-2. In particular, SIL provides 3-letter codes for around 7,000 languages. Just as RFC 3066 provides that (a) 2-letter codes from ISO 639 are used where they exist; (b) 3-letter codes from ISO 639-2 are used where 2-letter ISO 639 codes do not exist; so (c) some users also tend to use 3-letter SIL codes where there are no codes for languages in ISO 639, ISO 639-1 or ISO 639-2. This use is unregulated, and while generally there are no collissions in use, there is a possibility. The UK national member body intends to provide comprehensive information on point (c) above, alligning it with iunformation from the Linguasphere Registry, which documents (but does not provide codes for) around 70,000 languages. This would be a national member body contribution to the upcoming ISO/TC37/SC2 meeting in Toronto, August 2001, to assist in the development of the now approved New Work Item ISO 639-3"Coding systems." 3. Future plans by ISO/TC37/SC2/WG1 ISO/TC37/SC2/WG1 N69 "Coding systems" (2001-01-31) by Haavard Hjulstad (convenor of ISO/TC37/SC2/WG1) describes this NWI, which has now been approved by ISO CS. Currently, three (closely interlinked) projects are planned. 1. Development and maintenance of a database of language coding, (extracts of) which should be freely available on the web. 2. Adding to this those languages that are currently not included in ISO 639-1 or ISO 639-2, without assigning standardized identifiers. 3. Development of an International Standard for coding mechanisms for language variation, including variation through time, geographically determined dialectal variation, writing system, etc. Comment on 1 and 2: the UK is concered that insufficient information is proposed. Currently, ISO 639-1 contains 180 codes, and ISO 639-2 contains 438 entries. As at 2001-01-31, the database currently contains 493 entries. This compares with SIL (7,000 codes) and the Linguasphere Register (around 70,000 codes). Subsetting information from either or both of these sources would be a better basis. Comment on 3: this aims to regulate the possible language combinations where ISO 639 codes can be combined with codes from other sources, e.g. from ISO 3166: Codes for representation of names of countries, and from the draft standard, and ISO 15924: Codes for representation of names of scripts, and potentially other standards too, to provide codes such as "en US" = "English in the USA", "en CA" = "English in Canada", "en US-CA" = "English in the state of California"; or "ku Cyrl" = "Kurdish in Cyrillic script", and "ku RU Cyrl" = Kurdish in Russia in Cyrillic script". The paper also suggests that standardized mechanisms should be developed to specify, e.g. "English in North America" or "English in southern California", and possibly to identify dialects, and a mechanism to specify linking of the ISO 639-2 code "sgn" = "Sign languages" with other elements in order to specify specific sign languages. Also the possibility of adding codes for groups of languages would be investigated: currently this is a partial but not systematic part of ISO 639-2. NB: in discussion, Canadian and US members of SC22/WG20 had considerable opposition to Item 3 above, particularly as N69 announces its intention that this is the only part that is an international standard. for reasons of conformance issues, as the scope of ISO 639 is very general in practice, and affects other aras besides terminology. John Clews agreed to pass information about North American contacts in ISO/TC37/SC2/WG1 to the US delegates to WG20, so that their concerns could be expressed. Best regards John Clews -- John Clews, SESAME Computer Projects, 8 Avenue Rd, Harrogate, HG2 7PG tel: +44 1423 888 432; fax: + 44 1423 889061; Email: Ordering@sesame.demon.co.uk Committee Chair of ISO/TC46/SC2: Conversion of Written Languages; Committee Member of ISO/IEC/JTC1/SC22/WG20: Internationalization; Committee Member of ISO/TC37: Terminology 3