From enag@ifi.uio.no Tue Sep  8 18:40:57 1992
Received: from ifi.uio.no by dkuug.dk via EUnet with SMTP (5.64+/8+bit/IDA-1.2.8)
	id AA12622; Tue, 8 Sep 92 18:40:57 +0200
Received: from gyda.ifi.uio.no by ifi.uio.no with SMTP 
	id <AAifi.uio.no22339>; Tue, 8 Sep 1992 18:41:04 +0200
Received: by gyda.ifi.uio.no ; Tue, 8 Sep 1992 18:41:03 +0200
From: Erik Naggum <enag@ifi.uio.no>
Message-Id: <23335.07@erik.naggum.no>
Date: 08 Sep 1992 18:41:02 +0200 (19920908164102)
To: Multi-byte Code Issues <ISO10646@JHUVM.bitnet>
Cc: i18n@dkuug.dk
In-Reply-To: <"alf.uib.no.849:08.08.92.15.26.08"@uib.no> (Tue, 8 Sep 1992 10:54:39 EDT)
Subject: Re: request for feedback on character set identification proposal
X-Charset: ASCII
X-Char-Esc: 29

|   >It's highly regrettable that IBM has neglected to register their
|   >character sets with ISO, since they do have a comprehensive Character
|   >Data Representation Architecture (CDRA) which is quite intelligently
|   >designed.
|   > Erik Naggum
|   
|   Perhaps we are lucky that we have fewer characters sets that we must
|   deal with.  :-)

The number of extant character sets is independent of whether they're
registered with ISO or not.  What registration would help with is access
to their definition.  IBM has made this a royal pain in the butt by
using a different character identification scheme.

|   I am not an SGML expert so what follow may not make complete sense if
|   the following assumption is not true.
|   
|   Assumption: SGML requires that it be encoded in ISO 2022.

This assumption is wrong.  SGML makes no presumption about the character
set(s) in which a conforming document is encoded.  Rather, a description
of the encoding is found at the head of the document, in an "SGML
declaration".  ISO 10646 can be used just as any other character set.

|   Erik's solution to the codeset problem for SGML is to register more
|   codesets.

No.  You've misunderstood.  My solution is to map each character to a
descriptive name, using a small character repertoire (whose encoding is
system-dependent), so that you can identify a character in a character
set by its name, and convert it to another encoding by name lookup.  I
assign ISO 2022 designating escape sequences to character sets to have a
name for them.  I could call them by a serial number, code page number,
the date and time I finish describing it, the number of cans of Coke
I've drunk up to that point in time, or whatever.  ISO 2022 is just a
very good naming scheme which SGML already supports in its character set
identifiers.

|   I would prefer a solution that registered only one more codeset, ISO
|   10646 level 3.  Even better would be to change SGML to allow
|   (require) ISO 10646.

We can't "require" character sets as long as people can't be punished by
law for not using what we require.  What I do, and what SGML is so good
at, is _describing_ what people are already doing in a way which makes
each document self-contained.  SGML already allows ISO 10646, insofar as
it allows any other character set.  What I'm doing is designing the
infrastructure so an SGML parser doesn't need to know what the character
set is.  ISO 10646 comes in very handy, here.

|   ISO 10646 is the most complete registry of names we have and it will
|   become more complete as rarer scripts are added.

Exactly!  I'm very, very grateful for this, and I'm exploiting it all I
can.

|   I would suggest that we should revisit all the existing codesets and
|   respecify them as their mapping to ISO 10646.  We should strongly
|   resist the creation of any new codesets.

ISO 2375 maps a character in a character set to a registered name in the
table for each character set.  I've proceeded to unify those names in
accordance with the names in ISO 10646.  Thus, there is _already_ a
mapping between a character and ISO 10646, but _not_ by character number
(code point, whatever).  Such random number tables are impossible to get
right (as witness Keld's cryptic code tables), but we can do a lot of
meaningful checking and validation if we go by name.

As long as we keep describing character sets in terms of names according
to ISO 10646, there's no harm done in creating more of them.  A good
character set manager would figure out what the character number mapping
should be at run-time, by reading the source and target character set
descriptions, and an application can see ISO 10646 character numbers.
This is what I'm working at.

|   SGML documents specify the mapping from codesets to "glyphs in fonts".

Not at all.  I don't know where you got this idea, but it's utterly false.

|   Thus Illuminated would be just another SGML attribute like Italic.

"Italic" is _not_ an "SGML attribute".  Where have you picked up this
nonsense?  SGML doesn't know about fonts; it's a language to represent
the structure of information such that an SGML application can map
between an element type and some set of formatting rules and states.
The current font is typically such a state, but it's not known to the
SGML parser.

I'll be happy to discuss SGML, but this list is inappropriate for the
topic.  See SGML-L or USENET comp.text.sgml.

Best regards,
</Erik>
--
Erik Naggum             |  ISO  8879 SGML     |      +47 295 0313
                        |  ISO 10744 HyTime   |  
<erik@naggum.no>        |  ISO 10646 UCS      |      Memento, terrigena.
<enag@ifi.uio.no>       |  ISO  9899 C        |      Memento, vita brevis.