From enag@ifi.uio.no Tue Sep 8 18:40:57 1992 Received: from ifi.uio.no by dkuug.dk via EUnet with SMTP (5.64+/8+bit/IDA-1.2.8) id AA12622; Tue, 8 Sep 92 18:40:57 +0200 Received: from gyda.ifi.uio.no by ifi.uio.no with SMTP id ; Tue, 8 Sep 1992 18:41:04 +0200 Received: by gyda.ifi.uio.no ; Tue, 8 Sep 1992 18:41:03 +0200 From: Erik Naggum Message-Id: <23335.07@erik.naggum.no> Date: 08 Sep 1992 18:41:02 +0200 (19920908164102) To: Multi-byte Code Issues Cc: i18n@dkuug.dk In-Reply-To: <"alf.uib.no.849:08.08.92.15.26.08"@uib.no> (Tue, 8 Sep 1992 10:54:39 EDT) Subject: Re: request for feedback on character set identification proposal X-Charset: ASCII X-Char-Esc: 29 | >It's highly regrettable that IBM has neglected to register their | >character sets with ISO, since they do have a comprehensive Character | >Data Representation Architecture (CDRA) which is quite intelligently | >designed. | > Erik Naggum | | Perhaps we are lucky that we have fewer characters sets that we must | deal with. :-) The number of extant character sets is independent of whether they're registered with ISO or not. What registration would help with is access to their definition. IBM has made this a royal pain in the butt by using a different character identification scheme. | I am not an SGML expert so what follow may not make complete sense if | the following assumption is not true. | | Assumption: SGML requires that it be encoded in ISO 2022. This assumption is wrong. SGML makes no presumption about the character set(s) in which a conforming document is encoded. Rather, a description of the encoding is found at the head of the document, in an "SGML declaration". ISO 10646 can be used just as any other character set. | Erik's solution to the codeset problem for SGML is to register more | codesets. No. You've misunderstood. My solution is to map each character to a descriptive name, using a small character repertoire (whose encoding is system-dependent), so that you can identify a character in a character set by its name, and convert it to another encoding by name lookup. I assign ISO 2022 designating escape sequences to character sets to have a name for them. I could call them by a serial number, code page number, the date and time I finish describing it, the number of cans of Coke I've drunk up to that point in time, or whatever. ISO 2022 is just a very good naming scheme which SGML already supports in its character set identifiers. | I would prefer a solution that registered only one more codeset, ISO | 10646 level 3. Even better would be to change SGML to allow | (require) ISO 10646. We can't "require" character sets as long as people can't be punished by law for not using what we require. What I do, and what SGML is so good at, is _describing_ what people are already doing in a way which makes each document self-contained. SGML already allows ISO 10646, insofar as it allows any other character set. What I'm doing is designing the infrastructure so an SGML parser doesn't need to know what the character set is. ISO 10646 comes in very handy, here. | ISO 10646 is the most complete registry of names we have and it will | become more complete as rarer scripts are added. Exactly! I'm very, very grateful for this, and I'm exploiting it all I can. | I would suggest that we should revisit all the existing codesets and | respecify them as their mapping to ISO 10646. We should strongly | resist the creation of any new codesets. ISO 2375 maps a character in a character set to a registered name in the table for each character set. I've proceeded to unify those names in accordance with the names in ISO 10646. Thus, there is _already_ a mapping between a character and ISO 10646, but _not_ by character number (code point, whatever). Such random number tables are impossible to get right (as witness Keld's cryptic code tables), but we can do a lot of meaningful checking and validation if we go by name. As long as we keep describing character sets in terms of names according to ISO 10646, there's no harm done in creating more of them. A good character set manager would figure out what the character number mapping should be at run-time, by reading the source and target character set descriptions, and an application can see ISO 10646 character numbers. This is what I'm working at. | SGML documents specify the mapping from codesets to "glyphs in fonts". Not at all. I don't know where you got this idea, but it's utterly false. | Thus Illuminated would be just another SGML attribute like Italic. "Italic" is _not_ an "SGML attribute". Where have you picked up this nonsense? SGML doesn't know about fonts; it's a language to represent the structure of information such that an SGML application can map between an element type and some set of formatting rules and states. The current font is typically such a state, but it's not known to the SGML parser. I'll be happy to discuss SGML, but this list is inappropriate for the topic. See SGML-L or USENET comp.text.sgml. Best regards, -- Erik Naggum | ISO 8879 SGML | +47 295 0313 | ISO 10744 HyTime | | ISO 10646 UCS | Memento, terrigena. | ISO 9899 C | Memento, vita brevis.