From keld@dkuug.dk Sun Feb 24 18:58:03 1991 Received: by dkuug.dk (5.64+/8+bit/IDA-1.2.8) id AA13206; Sun, 24 Feb 91 18:58:03 +0100 Date: Sun, 24 Feb 91 18:58:03 +0100 From: Keld J|rn Simonsen Message-Id: <9102241758.AA13206@dkuug.dk> To: i18n@dkuug.dk, iso10646@jhuvm.bitnet Subject: Re: (wg14 44) AT&T Bell Labs wishes for shorthand character names Cc: npn@sirius.att.com, wg14@dkuug.dk X-Charset: ASCII X-Char-Esc: 29 Nils-Peter Nelson writes: > I've been working with Brian Kernighan on an ISO 8859-1 version > of troff. Brian has already modified the code to accept the > 8 bit input, and my group is currently working on the ditroff- > to-PostScript conversion for the additional characters. > As a favor to ASCII people we want to preserve the troff convention > of providing ASCII digraphs for the new characters; however, we > now see that the troff conventions differ from the commonly used > VT200 terminal digraphs commonly used world-wide. As an example, > the British Pound sign is \(ps in troff, but is typed as > L - on most ISO 8859-1 terminals. My way of handling > this is to change troff to use the VT200 digraphs in the future. Well, then which is the most common notation: troff or VT200? I would guess (at least among troff users) that the troff convention is the more used. > I've spoken to Dennis Ritchie several times because he faces > similar problems with C-- even if variable names are ASCII he > wants to be able to handle ISO 8859 strings. Yes that would be handy. ISO WG14 is looking at the problem. I am a member of this WG and I have an action item of providing a quite general international C locale, and also we are discussing general strings with access to the symbolic character names. This discussion is also going on in the POSIX i18n forum. I think WG14 would be happy to invite Dennis Richie to participate in this discussion. > What would avoid all the sturm-and-drang is if the ISO committees, > as they agree on things like 8859-1,2,3,... and 10646 would > provide *as part of the standard* the "shorthand" notation in > the next "lower" character set. As an example, in 8859-1, > character 10/03 might be defined by: > 10/03 L- POUND SIGN > (The digraph in this case must be two ISO 646 characters.) > The standard would also recommend that, where 8 bit keyboards > were not available, the sequence L - would be > equivalent to 10/03. could be undefined-- on keyboards > it's a key, in troff it would be \(, in C it might be something > else. > ISO 10646 would require a tetragraph, but again, it should be one > recommended by the standards committee. Yes, this has been proposed within ISO. Actually it was SC22 who proposed this to SC2 - to provide unique naming and short identifiers to all characters provided by SC2. The SC22 requirements were stated in the paper SC22 N622R. SC2 responded by assigning unique (long descriptive) names for characters in the new universal character code ISO (DIS) 10646. But they did not want to provide short identifiers. As SC22 needed this for various purposes, like the C and POSIX standards, a NWI has been proposed and accepted by SC22. This NWI covers internationalization (i18n) and includes character set work on identifying and conversion within character sets. The text is not fully clear on the shorthand requirement, but I think it is sufficiently clear that this work is included. The NWI is assigned to the new SC22 WG20 on internationalization. They have not met yet. The convenor is Dick Weaver of IBM, he is on (at least) the i18n@dkuug.dk list and thus gets these messages. > Since you announced yourself to be the "contact person for > the general public" I'm asking you to bring this to the (Keld: "you" here means "Thom Plum".) > attention of the various committees. If the standardization > is not offered by ISO, we run the risk of different conventions > in troff, TeX, C, MS-DOS, etc. Yes, I share your concerns. And they are shared by the ISO POSIX WG15 RIN rapporteur group on i18n (of which I am a member). We surely would like to avoid this mess. I can give you an overview of what work I know is going on in the field of naming characters. 1. ISO 10646 is defining long descriptive names to be used in all ISO SC2 character work. There are rules for the syntax of the names, including that the letters in the names must be capital. I have a SC2/WG3 paper N65 (April 1989) on these rules, but that might not be the most recent. An example is: CYRILLIC CAPITAL LETTER ER Some of the namings are available freely and electronially from dkuug.dk:i18n/ISO_10646 - this contains also shorthands provided by Danish Standards as described below. 2. ISO 6937-2 & ISO 10367 have a short naming of the Latin characters and also some special characters. These shorthands are four-character with the 2 first being capital letters and the last two being digits. An example: LA12 (which identifies some Latin A with an accent - I cannot remember which) The naming is available freely and electronically in Johan van Wingens work, as noted below. 3. SGML - ISO 8879:1986 the Standard Generalized Markup Language is (one of) ISOs answer to troff, TeX etc. There are quite some shorthands there - I think they are mostly made up from upper and lower case letters. An example is: which means LATIN CAPITAL LETTER A WITH RING (10646 name). I do not know if these specs are available electronically. 4. POSIX has a standard naming for the ASCII characters which are used in the POSIX locale. They may differ a little from the 10646 names, but not much, and then they are in lowercase. An example: The naming is available as part of the Danish POSIX locale as noted below. 5. Danish Standards (the Danish ISO member body) has produced an elaborate "Example Danish National Locale" for POSIX, included in the POSIX.2 draft 10 (published a bit later than the rest of draft 10) and also in the next draft. I have been very active in producing this specification. There are shorthands for a considerable part of ISO 10646, covering many alphabetic and ideographic characters, some 25000 characters in all (1300 non- ideographic). IMHO it is the most elaborate work available today on shorthands. Mostly the shorthands are two-character from the invariant ISO 646 set (ASCII minus 12 characters), but longer names are also permitted and used for ideographic characters. An example: R= for CYRILLIC CAPITAL LETTER ER It is freely and electronically available in dkuug.dk:i18n/ISO_10646 and dkuug.dk:pub/ch.shar* . The work is used as a basis for work on POSIX locale specifiacton, for ISO C international locale, for OSI work and for other communication work (Internet). 6. OSI ISOCHARSTRING - SC21 decided on their meeting in Berlin Oct 1990 to make a new ASN.1 string specification, the ISOCHARSTRING. There the long descriptive names of ISO 10646 are used, stripping spaces in the name and converting all letters except the initial of each word into lowercase. An example is: CyrillicCapitalLetterEr 7. Johan van Wingen from Nederlands Normaliserings Institut (the Dutch ISO member body) has a convention for character naming, which is two-character and drawn from ASCII (I think). It is used in his survey of which languages requires which characters, and also how these characters are collated in each of these languages. The papers are avaliable electronically - one source is the iso10646 archive at jhuvm.bitnet. 8. Troff conventions: The original Ossanna specifications had quite some shorthands for non-ASCII characters. Some other conventions building on the Ossanna specs have been done. I was the coauthor of one (together with Ed Keizer and Jaap Akkerhuis) which was discussed on the net recently. This article is available freely and electronically from dkuug.dk. An example is: \(*a for GREEK SMALL LETTER ALPHA 9. VT200 has the compose character function, which consists of a special compose character and then two characters from the normal ASCII keyboard. 10. TeX and other formatting packages. I do not know too much on these, but TeX does have shorthands. TeX is available freely and electronically from various sources (I am not sure where). WordPerfect also have a shorthand, but that is just numbers. Other word processing pachages surely also have their conventions. 11. C - The ISO WG14 C committee is working on an addendum to the ISO C standard ISO/IEC 9899:1990 (technically equivalent to the ANSI standard). I have an action item on producing a proposal for an international C locale, building on the Danish POSIX work. 12. Alain LaBonte' of the Canadian Standards Association is working on a shorthand, especially for chinese characters, as far as understand. I have not seen this work, though. 13. X windows is not naming characters, but has something that comes close. In the X Input Methods for Japanese, a way of generating Kana and Kanji characters from ASCII is provided. You may consider this as shorthands. There may be other Input Methods defined for other character sets. Also X have written a specification for i18n POSIX locales with the requirement that shorthand names (symbolic character names) should be given in a restricted ASCII, namely the inveriant part of ISO 646, which contains 12 less characters. The X fonts and names are freely available from MIT, also electronically (expo.lcs.mit.edu - as far as I remember). It is HUGE. 14. IBM has a naming of letters which is much like the ISO 6937-2 naming, but is 8 characters. They use it for specification of all their character sets. One document where it is used is SE09-8002-01 on Natural Language Support. Keld Simonsen