From: Kenneth Whistler [kenw@sybase.com]
Sent: Thursday, August 08, 2002 4:19 PM
To: frank@farance.com
Cc: kenw@sybase.com; keld@dkuug.dk; tplum@plumhall.com; jb@benito.com;
Winkler, Arnold F; nwallace@us.ibm.com; John.Hill@eng.sun.com;
rex@RexJaeschke.com; nobuyoshi.mori@sap.com;
Don.Schricker@microfocus.com; willemw@ace.nl; asmusf@ix.netcom.com;
mark.davis@jtcsv.com
Subject: Re: Agenda for Character set ad-hoc - 26th August

Frank responded:

> My concern is that SC22 have a voice in these "identifier characters".

In principle, that is fine with me. SC22 committees *should* participate
in this discussion.
 
>  I want to make sure that the people choosing the identifier characters 
> have an understanding and involvement in programming language standards. 

Yes -- an admirable goal.

But it should be countered with its equal and opposite: that people
developing the programming language standards should have an
understanding and involvement in the character encoding standards.
 
> WG20 seems to be the place where this happens. 

No, it is not.

One might like to *believe* that is where that happens, but it is
certainly not the case that any significant interactions or
developments have been taking place in that forum regarding this
topic.

The significant interactions, in my experience, have been between
the developers of Java, ECMAScript, C#, and XML, and the architects
of character property definitions in the Unicode Technical Committee.

> Sure it is possible to have some Unicode people in individual SC22 WGs,
>  but I'd rather see a consistent SC22 perspective on this, i.e., 
> representation of SC22's issues (not necessarily Unicode's table).  

Then I would think a reasonable thing to do would be to summarize
SC22's issues and bring them into the Unicode Technical Committee for
discussion. That's what W3C does when there are mutual concerns
about issues such as identifiers.

The way you put this smacks of standards turf defense and wishful
thinking. It might seem more well-behaved for this concern of
SC22 programming language committees to be "controlled" within
an SC22 committee -- namely WG20. But the issue is not controlled
by SC22 -- there are other players out there creating realities,
and it behooves the ISO programming language standardizers to interact
with them, so that their programming language concerns are represented
into the discussions.

> I don't believe that Unicode will be able to address the programming 
> language standardization aspect of these characters (similar to the 
> concerns of XML, ASN.1, SQL, etc..).

I don't think anyone in the UTC expects that to be the forum that
would establish the *particular* identifier syntax rules that
apply in C or C++ or COBOL (or XML, ASN.1, SQL, or anything else). That
is a concern of each relevant standardizer. What the UTC establishes
are consistent, extensible rules for Identifier_Start and Identifier_Extend
properties for all Unicode characters. Those rules can be adapted
and customized, as required for particular formal syntaxes. But
what the formal language committees should not be doing is pawing
through 94,000+ Unicode characters trying to establish all their
properties and sorting them into categories for Identifier_Start
and Identifier_Extend classes. *That* is the expertise of the
UTC, instead.

> It seems fine to me for Unicode to submit a contribution and for SC22 
> to have some review (SC32 would like review, too), but to merely point 
> to a Unicode table without SC22's review does not serve the purpose of 
> SC22's programming languages.

No, you point to the Unicode table(s), and then review the particular
extensions, limitations, or customizations (treatment of "_", "-", "@",
other syntax characters, whatever) that apply to your particular
standards. And you consider case sensitivity. And you consider 
interoperability issues for identifiers which may be used across 
formal languages. And you consider stability
issues for identifiers across versions of Unicode and of your
language standards.

But don't expect the Unicode Consortium to submit all 21+ primary
data files of the Unicode Character Database (including the 26 megabyte
Unihan.txt) to SC22 for review in that context. The review and updating
of those data files takes place in the *UTC* context, by the open
process that the UTC has established for timely, coordinated releases
of those data files -- now needed and required by many, many implementations
of the Unicode Standard. SC22 participants are welcome to participate
in that review and development, along with everybody else -- just don't
expect to turn it into a process that *lives* in SC22 committees
and works by ISO ballotting.

> 
> Regarding the maintainability of this, there's always the possibility 
> of using a registry (a tried and true method) that would satisfy many 
> concerns.  Do you still have objections to use of a registry?

Absolutely. A registry is an absolutely crazy way to try to maintain
a large database of inter-related properties for 94,000+ characters.

The Unicode Technical Committee has a tried and true method for doing
this that is now in practical use by hundreds of major implementations.

Regards,

--Ken