From enag@ifi.uio.no Sat Sep  5 01:51:05 1992
Received: from ifi.uio.no by dkuug.dk via EUnet with SMTP (5.64+/8+bit/IDA-1.2.8)
	id AA28536; Sat, 5 Sep 92 01:51:05 +0200
Received: from gyda.ifi.uio.no by ifi.uio.no with SMTP 
	id <AAifi.uio.no21774>; Sat, 5 Sep 1992 01:51:11 +0200
Received: by gyda.ifi.uio.no ; Sat, 5 Sep 1992 01:51:10 +0200
From: Erik Naggum <enag@ifi.uio.no>
Message-Id: <23331.05@erik.naggum.no>
Date: 05 Sep 1992 01:51:09 +0200 (19920904235109)
To: Multi-byte Code Issues <ISO10646@JHUVM.bitnet>
Cc: i18n@dkuug.dk
In-Reply-To: <"alf.uib.no.695:04.08.92.14.51.45"@uib.no> (Fri, 4 Sep 1992 10:46:53 -0500)
Subject: Re: request for feedback on character set identification proposal
X-Charset: ASCII
X-Char-Esc: 29

Jurgen, and others,

This is an important issue, and many parties are involved in finding
solutions to it.  So, naturally, have I, since I work with SGML, and
SGML suffers from a lack of solution to these problems as much as any.

My approach has been to use the character names defined in ISO 10646,
and to use SGML's character set declarations as the basis for my work.
I adhere to ISO 2022, and have under publication a list of private use
codes to identify IBM character sets (ASCII-based and EBCDIC-based),
Macintosh character sets, and other proprietary, non-registered sets.

E.g. the Macintosh character set would be assigned ESC 2/5 2/15 3/3, as
a coding system different from that of ISO 2022, with no standard
return.  Within the coding system and applications adhering to my
private use codes, this is a unique identifier.

It's highly regrettable that IBM has neglected to register their
character sets with ISO, since they do have a comprehensive Character
Data Representation Architecture (CDRA) which is quite intelligently
designed.  It is, however, quite recently developed, and I hear rumors
that IBM plan to submit it to ISO as a contribution.  IBM's scheme
include support for ISO characters set, as well as multi-byte coding
schemes, and is largely, but not completely, a duplication of ISO 2022.

A character set definition follows the character set declaration in
SGML, and consists of a base set identified by its ISO 2022 designator,
and a described character set.  I have elected to use the empty
character set according to ISO 2022, and have used literals for the ISO
10646 name of the character in the described character set.

E.g. the right half of the ISO 8859-1 character set looks like this:

-- "ISO Registration Number 100//CHARSET Latin 1//ESC 2/13 4/1" --
-- "ISO 8859-1:1987//CHARSET Latin 1//ESC 2/13 4/1" --
BASESET
"ISO 2022//CHARSET Empty G1 Set//ESC 2/13 7/14"
DESCSET
   32 -- 0020 --     1 "NO-BREAK SPACE"
   33 -- 0021 --     1 "INVERTED EXCLAMATION MARK"
   34 -- 0022 --     1 "CENT SIGN"
   35 -- 0023 --     1 "POUND SIGN"
   36 -- 0024 --     1 "CURRENCY SIGN"
    .       .        . .
    .       .        . .
    .       .        . .
  123 -- 007B --     1 "LATIN SMALL LETTER U WITH CIRCUMFLEX"
  124 -- 007C --     1 "LATIN SMALL LETTER U WITH DIAERESIS"
  125 -- 007D --     1 "LATIN SMALL LETTER Y WITH ACUTE"
  126 -- 007E --     1 "LATIN SMALL LETTER THORN"
  127 -- 007F --     1 "LATIN SMALL LETTER Y WITH DIAERESIS"

(Comments are delimited by "--" in SGML.)

External to these character set definitions are the ISO 2022 announcers
which declare where the virtual character sets designated by escape
sequences such as ESC 2/13 4/1 are placed in the code space.

E.g. the complete specification of an 8-bit environment using ISO 8859-1
would be as follows:

ESC 2/0 4/3		8-bit environment, no shift functions
ESC 2/0 4/6		A C1 set shall be used with escape seq
ESC 2/1 4/0		Control functions from ISO 646:1991
ESC 2/2 7/14		Empty C1 control set (for completeness)
ESC 2/8 4/2		ISO 646:1991 IRV
ESC 2/13 4/1		ISO 8859-1:1988 right half

IBM has a similar approach to identifying character sets including a
long list of identifying values.  They also have a short form
identifier, the CCSID, which I have used in my private use codes for
certain IBM character sets.

The above character set declaration in SGML terms enable conversion
between any two character sets by name, and also allows named "character
entities" to be referenced where a character is used in the source
document but does not exist in the target character set.

I would like to thank Keld for doing the first 15% of the work in my
scheme.  I probably would not have started this project if his base had
not been provided.  However, I would also like to state publicly that
the sloppiness in his work has made it an irrevocable requirement for
users of his work to check every single character in every single
character set, or duplicate the entire work from scratch.  Consequently,
I have added all missing characters in ISO 10646 to my base name list,
have checked all the character sets against the ISO 2375 registry, and
added character sets he apparently found uninteresting, and am in the
process of completing this work for all IBM supported code pages and
character repertoires.  In Keld's RFC 1345, I have found more than 3000
(yes, three thousand) errors in the character tables, including EBCDIC
control characters (which he naively thinks are identical to ISO 646
control characters in the same positions!), missing characters from
Asian character sets, left-shifting whole rows, and an EBCDIC code page
with 17 rows (or 272 characters!).  These are easier to correct than the
other randomly occurring errors, but the attention to detail one would
expect from such a work is completely missing.

The reason for the lack of accuracy is partly explained by the coding
scheme chosen.  It is completely impossible for a human reader to debug
one of these charmaps and to verify their correctness.  Lack of debug-
ability and attendant level of crypticity in the charmaps prompted me to
find another solution.  I have used the full ISO 10646 name, and since
these are clearly delimited and are part of an easily parsable sequence
of tokens, they can very easily be checked for correctness.  It's also
near trivial to write a new character set declaration since the names
are relatively easy to deal with, compared with Keld's cryptic
two-to-five-character-long codes.

My scheme is intended to solve this problem in the SGML context, and I
have had no intention of massively marketing it on the scale which Keld
has been marketing his sloppy work, but it would of course please me if
it can be found useful outside of the SGML context, and I would be
especially happy if SGML's "formal public identifiers" could be used.
ISO 9070 has the details on these, and ISO 8879 has the complete
specification of the character set declaration.  I have built on the
internal characer set declaration to define public text character set
declarations.  I'm writing a contribution to ISO/IEC JTC1/SC18/WG8 on
this, and can cc this list if there is interest.

This has all been unpublished volunteer work so far, but I'm looking for
companies and organizations who could be willing to fund publication and
further work on this as a research project, as well as "ex post facto
funding" for the work already done.  The results will, of course, be
useful only if they are publicly available, so this can't be sold to
anyone as a product.  However, I have also written a large body of C
code to handle the conversion from any two character sets based on these
character set declarations, and to read an (or any, actually) ISO 2022
conformant data stream and return ISO 10646 characters to the
application, as well as to accept ISO 10646 characters from an
application and write an ISO 2022 conformant data stream under a few
user-specified constraints.  This should probably also be made available
on the same condition as the scheme and the character set declarations.

Please contact me directly for further information.

Best regards,
</Erik>
--
Erik Naggum             |  ISO  8879 SGML     |      +47 295 0313
                        |  ISO 10744 HyTime   |  
<erik@naggum.no>        |  ISO 10646 UCS      |      Memento, terrigena.
<enag@ifi.uio.no>       |  ISO  9899 C        |      Memento, vita brevis.