N836: No Extended Characters in External Identifiers

Edits

1.Section 6.1.2 "Identifiers", p 34 par 2: Remove footnote 20 attached to "Annex H", and remove the footnote itself.

2. Section 6.1.2 "Identifiers", p 34 par 6 (Implementation Limits): Change the second sentence from: "The implementation may further restrict the significance of an external name (an identifier that has external linkage) to 31 characters." to: "The implementation may further restrict the significance of an external name (an identifier that has external linkage) to 31 characters and restrict the characters used in such a name to members of the basic character set."

3. Annex K "Implementation-defined behavior", p549 K.3.3 "Identifiers": Add a third bullet:
"-- Whether extended characters are permitted in an identifier with external linkage (6.1.2)."

Rationale

The C90 standard permitted implementations to restrict identifiers with external linkage to be monocase with only six characters of significance, reflecting the lower bounds of software tools in then-current use. Recognizing this as outdated, and in practice imposing a burden for maximally portable programs, the committee drafted C9X to require case distinction and at least 31 characters of significance. These limits reflect the lower bounds of software environment tools in current industry practice, and they are adequate in practice (except for C++, discussed below). The C9X change to allow the use of extended characters in identifiers is a significant step to making the language more usable in non-latin-alphabet cultures. However, it involves new technology that is not fully mature (witness the lively discussions on UCNs). While this is not a compelling reason not to go ahead with it in the language, it needs to be recognized that if extended characters appear in external identifiers, there is suddenly a requirement to inject this new technology into linkers, loaders, post-link optimizers, and any other programming environment tools that look at object code. Footnote 20 offered a simple "kludge" to work around the problem. Following that kludge, each extended character in an external identifier would be represented by either six or ten basic characters in the object module. The allowed limit on identifier significance is only in terms of characters, and so as written the standard would require an implementation to support 31 extended characters - and using the kludge of footnote 20 this would be mapped to an identifier of either 186 or 310 basic characters within the object code. This is not useful on systems like OpenVMS that have a hard upper limit in the linker of 31 characters.

A kludge that could be used to address the problem would be to shorten such names using a checksum. This is the approach used to accommodate C++ mangled names on OpenVMS. But C++ is not C. C++ external names have traditionally involved algorithmic mapping of the identifier spelling used in the source code to obtain the spelling used in object code, and so a name-shortening algorithm is not really a new burden for C++ programmers. But it would be for C programmers. Perhaps more importantly, what the users of extended characters really want is for them to be fully supported everywhere. There is no reason why the linkers, loaders, and other object code tools could not directly support extended character identifiers, and do so much more efficiently and with better system integration (i.e. for display) than the kludge-upon-kludge approach - except for time. It takes substantially more time to work this kind of change throughout all the tools and infrastructure of a mature operating system than it does to do it within a limited set such as a compiler and debugger. If C9X requires support for this not-fully-mature feature within external names, vendors wishing to support the standard as soon as possible will be forced to use a kludge like this. They will then have to carry the kludge around forever (binary compatibility) even after they have built fully integrated support into the rest of the system.

The C language has survived quite well even with its 6-character monocase limit. C9X has made a huge improvement there by requiring 31-character case-distinct. Let's let the kinks get worked out of extended-character identifiers within the confines of the compiler and debugger without imposing such a sudden huge capacity leap on other tools that only a horrible kludge can make it work in reasonable market time. Let customers who are not satisifed with the limitation demand support from their vendors, rather than requiring it in the standard on day one.