From enag@ifi.uio.no Sat Sep 5 01:51:05 1992 Received: from ifi.uio.no by dkuug.dk via EUnet with SMTP (5.64+/8+bit/IDA-1.2.8) id AA28536; Sat, 5 Sep 92 01:51:05 +0200 Received: from gyda.ifi.uio.no by ifi.uio.no with SMTP id ; Sat, 5 Sep 1992 01:51:11 +0200 Received: by gyda.ifi.uio.no ; Sat, 5 Sep 1992 01:51:10 +0200 From: Erik Naggum Message-Id: <23331.05@erik.naggum.no> Date: 05 Sep 1992 01:51:09 +0200 (19920904235109) To: Multi-byte Code Issues Cc: i18n@dkuug.dk In-Reply-To: <"alf.uib.no.695:04.08.92.14.51.45"@uib.no> (Fri, 4 Sep 1992 10:46:53 -0500) Subject: Re: request for feedback on character set identification proposal X-Charset: ASCII X-Char-Esc: 29 Jurgen, and others, This is an important issue, and many parties are involved in finding solutions to it. So, naturally, have I, since I work with SGML, and SGML suffers from a lack of solution to these problems as much as any. My approach has been to use the character names defined in ISO 10646, and to use SGML's character set declarations as the basis for my work. I adhere to ISO 2022, and have under publication a list of private use codes to identify IBM character sets (ASCII-based and EBCDIC-based), Macintosh character sets, and other proprietary, non-registered sets. E.g. the Macintosh character set would be assigned ESC 2/5 2/15 3/3, as a coding system different from that of ISO 2022, with no standard return. Within the coding system and applications adhering to my private use codes, this is a unique identifier. It's highly regrettable that IBM has neglected to register their character sets with ISO, since they do have a comprehensive Character Data Representation Architecture (CDRA) which is quite intelligently designed. It is, however, quite recently developed, and I hear rumors that IBM plan to submit it to ISO as a contribution. IBM's scheme include support for ISO characters set, as well as multi-byte coding schemes, and is largely, but not completely, a duplication of ISO 2022. A character set definition follows the character set declaration in SGML, and consists of a base set identified by its ISO 2022 designator, and a described character set. I have elected to use the empty character set according to ISO 2022, and have used literals for the ISO 10646 name of the character in the described character set. E.g. the right half of the ISO 8859-1 character set looks like this: -- "ISO Registration Number 100//CHARSET Latin 1//ESC 2/13 4/1" -- -- "ISO 8859-1:1987//CHARSET Latin 1//ESC 2/13 4/1" -- BASESET "ISO 2022//CHARSET Empty G1 Set//ESC 2/13 7/14" DESCSET 32 -- 0020 -- 1 "NO-BREAK SPACE" 33 -- 0021 -- 1 "INVERTED EXCLAMATION MARK" 34 -- 0022 -- 1 "CENT SIGN" 35 -- 0023 -- 1 "POUND SIGN" 36 -- 0024 -- 1 "CURRENCY SIGN" . . . . . . . . . . . . 123 -- 007B -- 1 "LATIN SMALL LETTER U WITH CIRCUMFLEX" 124 -- 007C -- 1 "LATIN SMALL LETTER U WITH DIAERESIS" 125 -- 007D -- 1 "LATIN SMALL LETTER Y WITH ACUTE" 126 -- 007E -- 1 "LATIN SMALL LETTER THORN" 127 -- 007F -- 1 "LATIN SMALL LETTER Y WITH DIAERESIS" (Comments are delimited by "--" in SGML.) External to these character set definitions are the ISO 2022 announcers which declare where the virtual character sets designated by escape sequences such as ESC 2/13 4/1 are placed in the code space. E.g. the complete specification of an 8-bit environment using ISO 8859-1 would be as follows: ESC 2/0 4/3 8-bit environment, no shift functions ESC 2/0 4/6 A C1 set shall be used with escape seq ESC 2/1 4/0 Control functions from ISO 646:1991 ESC 2/2 7/14 Empty C1 control set (for completeness) ESC 2/8 4/2 ISO 646:1991 IRV ESC 2/13 4/1 ISO 8859-1:1988 right half IBM has a similar approach to identifying character sets including a long list of identifying values. They also have a short form identifier, the CCSID, which I have used in my private use codes for certain IBM character sets. The above character set declaration in SGML terms enable conversion between any two character sets by name, and also allows named "character entities" to be referenced where a character is used in the source document but does not exist in the target character set. I would like to thank Keld for doing the first 15% of the work in my scheme. I probably would not have started this project if his base had not been provided. However, I would also like to state publicly that the sloppiness in his work has made it an irrevocable requirement for users of his work to check every single character in every single character set, or duplicate the entire work from scratch. Consequently, I have added all missing characters in ISO 10646 to my base name list, have checked all the character sets against the ISO 2375 registry, and added character sets he apparently found uninteresting, and am in the process of completing this work for all IBM supported code pages and character repertoires. In Keld's RFC 1345, I have found more than 3000 (yes, three thousand) errors in the character tables, including EBCDIC control characters (which he naively thinks are identical to ISO 646 control characters in the same positions!), missing characters from Asian character sets, left-shifting whole rows, and an EBCDIC code page with 17 rows (or 272 characters!). These are easier to correct than the other randomly occurring errors, but the attention to detail one would expect from such a work is completely missing. The reason for the lack of accuracy is partly explained by the coding scheme chosen. It is completely impossible for a human reader to debug one of these charmaps and to verify their correctness. Lack of debug- ability and attendant level of crypticity in the charmaps prompted me to find another solution. I have used the full ISO 10646 name, and since these are clearly delimited and are part of an easily parsable sequence of tokens, they can very easily be checked for correctness. It's also near trivial to write a new character set declaration since the names are relatively easy to deal with, compared with Keld's cryptic two-to-five-character-long codes. My scheme is intended to solve this problem in the SGML context, and I have had no intention of massively marketing it on the scale which Keld has been marketing his sloppy work, but it would of course please me if it can be found useful outside of the SGML context, and I would be especially happy if SGML's "formal public identifiers" could be used. ISO 9070 has the details on these, and ISO 8879 has the complete specification of the character set declaration. I have built on the internal characer set declaration to define public text character set declarations. I'm writing a contribution to ISO/IEC JTC1/SC18/WG8 on this, and can cc this list if there is interest. This has all been unpublished volunteer work so far, but I'm looking for companies and organizations who could be willing to fund publication and further work on this as a research project, as well as "ex post facto funding" for the work already done. The results will, of course, be useful only if they are publicly available, so this can't be sold to anyone as a product. However, I have also written a large body of C code to handle the conversion from any two character sets based on these character set declarations, and to read an (or any, actually) ISO 2022 conformant data stream and return ISO 10646 characters to the application, as well as to accept ISO 10646 characters from an application and write an ISO 2022 conformant data stream under a few user-specified constraints. This should probably also be made available on the same condition as the scheme and the character set declarations. Please contact me directly for further information. Best regards, -- Erik Naggum | ISO 8879 SGML | +47 295 0313 | ISO 10744 HyTime | | ISO 10646 UCS | Memento, terrigena. | ISO 9899 C | Memento, vita brevis.