ISO/IEC JTC 1/SC22 Programming languages, their environments and system software interfaces Secretariat: U.S.A. (ANSI) ISO/IEC JTC 1/SC22 N3758 TITLE: SC 22/WG 9 Request for National Body Contributions on Implementation of Coded Character Sets in Ada DATE ASSIGNED: 2004-07-08 SOURCE: SC 22/WG 9 Convenor (J. Moore) BACKWARD POINTER: DOCUMENT TYPE: Request for Comments PROJECT NUMBER: STATUS: This document is circulated to SC 22 National Bodies for a sixty day comment period. National Bodies are requested to send any comments to the SC 22 Secretariat (sseitz@ansi.org) by 6 September 2004. ACTION IDENTIFIER: COM DUE DATE: 2004-09-06 DISTRIBUTION: TEXT CROSS REFERENCE: DISTRIBUTION FORM: Def Address reply to: ISO/IEC JTC 1/SC22 Secretariat Sally Seitz ANSI 25 West 43rd Street New York, NY 10036 Telephone: (212) 642-4918 Fax: (212) 840-2298 Email: sseitz@ansi.org _____________END OF COVER PAGE, BEGINNING OF DOCUMENT______________ 8 July 2004 To: Secretariat of ISO/IEC JTC 1/SC 22 cc: Chairman of ISO/IEC JTC 1/SC 22 Subject: Implementation of Coded Character Sets in Ada In behalf of SC 22/WG 9, I request that the appended document be circulated to SC 22 member bodies well in advance of the forthcoming plenary meeting. This document is provided in support of the following resolution which is proposed for consideration at the plenary meeting: "Resolution 04-xx: Implementation of Coded Character Sets in the Ada Language JTC 1/SC 22 agrees that the proposed implementation of coded character set support described in document 22N3758 agrees with the principles for coded character set support previously adopted by JTC 1/SC 22, notably resolution 02-24." Warmest regards, James W. Moore Convener, ISO/IEC JTC 1/SC 22/WG 9 = = = Document ISO/IEC JTC 1/SC 22 N 3758 Implementation of Coded Character Sets in Ada Contributed by James W. Moore, Convener of ISO/IEC JTC 1/SC 22/WG 9, 4 July 2004 at the direction of SC 22/WG 9. In September 2002, SC 22 adopted the following resolution at its plenary meeting: "Resolution 02-24: Recommendation on Coded Character Sets Support JTC 1/SC 22 believes that programming languages should offer the appropriate support for ISO/IEC 10646, and the Unicode character set where appropriate." WG9 is currently preparing an amendment to the Ada Language Standard, ISO/IEC 8652:1995. The amendment will include changes in coded character support. Because the effect of changes in this area will pervade the entire standard and affect the treatment of other issues in the amendment, it is important to understand whether the proposed approach complies with the direction of SC22. Therefore, WG9 requests that the member bodies of SC22 review the approach described in this paper and provide comments to WG9. Furthermore, WG9 requests that the following resolution be approved at the forthcoming plenary meeting of SC22: "Resolution 04-xx: Implementation of Coded Character Sets in the Ada Language JTC 1/SC 22 agrees that the proposed Ada implementation of coded character set support described by SC 22/WG 9 in document 22 N 3758 agrees with the principles for coded character set support previously adopted by JTC 1/SC 22, notably resolution 02-24." Approach Proposed by SC 22/WG 9 [Note: This description is written at the level of principles. For those familiar with Ada, the details of the proposed approach can be found in "AI-285", which can be obtained from the convener of WG9, James Moore, James.W.Moore@ieee.org.] This proposal is based on ISO/IEC 10646:2003. While this proposal contains references to Unicode, the amendment text will be carefully phrased to avoid such mentions. The essence of this proposal is to allow the source of the program to be written using 16-bit characters (from the BMP) or 32-bit characters. Also, it makes it possible to operate on 32-bit characters at run-time. The main difficulty in supporting characters beyond Row 00 of the BMP in the program text is to define how identifiers and literals are built (which characters are letters, digits, etc.) and to define the lower/upper case equivalence rules. Fortunately, the implementation properties and algorithms referenced by 10646 will do most of the work for us. Unicode defines a "character database" which describes all the properties of each character. The most important property for our purposes is the "General Category". General categories are disjoint. The following categories are of interest for describing Ada program text: * Letter, Uppercase -- e.g., LATIN CAPITAL LETTER A * Letter, Lowercase -- e.g., LATIN SMALL LETTER A * Letter, Titlecase -- e.g., LATIN CAPITAL LETTER L WITH SMALL LETTER J * Letter, Modifier -- e.g., MODIFIER LETTER APOSTROPHE * Letter, Other -- e.g., HEBREW LETTER ALEF * Mark, Non-Spacing -- e.g., COMBINING GRAVE ACCENT * Mark, Spacing Combining -- e.g., MUSICAL SYMBOL COMBINING AUGMENTATION DOT * Number, Decimal Digit -- e.g., DIGIT ZERO * Number, Letter -- e.g., ROMAN NUMERAL TWO * Other, Control -- e.g., NULL * Other, Format -- e.g., ACTIVATE ARABIC FORM SHAPING * Other, Private Use -- e.g., * Other, Surrogate -- e.g., * Punctuation, Connector -- e.g., LOW LINE * Separator, Space -- e.g., SPACE * Separator, Line -- e.g., LINE SEPARATOR * Separator, Paragraph -- e.g., PARAGRAPH SEPARATOR (See : http://www.unicode.org/Public/4.0-Update/UCD-4.0.0.html#General_Category_Values for details on the categorization.) In the definition of the lexical elements of Ada, we define a non-terminal of the grammar for each of the above categories, e.g., letter_uppercase, letter_lowercase, etc. The characters in the category other_format are effectively ignored in most lexical elements, with the exception that they are illegal in string_literals and character_literals. Throughout the syntax rules, we specify which characters are allowed for the lexical elements. For instance, the E in the exponent part of a numeric literal may not be a "GREEK CAPITAL LETTER EPSILON", even though a capital E and a capital epsilon look very much the same. Similar considerations apply to the extended digits, the point, etc. Unicode proposes to define identifiers for programming languages as follows (see annex 7 of UAX #15 at http://www.unicode.org/reports/tr15/tr15-23.html#Programming_Language_Identifiers): identifier ::= identifier_start {identifier_start | identifier_extend} identifier_start ::= letter_uppercase | letter_lowercase | letter_titlecase | letter_modifier | letter_other | number_letter identifier_extend ::= mark_non_spacing | mark_spacing_combining | number_decimal_digit | punctuation_connector | other_format This definition was made with C in mind, and is not exactly appropriate for Ada, as it would allow consecutive underlines. Because the underline is the only character of Row 00 of the BMP which is a punctuation_connector, it seems sensible to remain close to the existing Ada syntax rules, and to use the following definitions: identifier_start ::= letter_uppercase | letter_lowercase | letter_titlecase | letter_modifier | letter_other | number_letter identifier_extend ::= identifier_letter | mark_non_spacing | mark_spacing_combining | number_decimal_digit | other_format identifier ::= identifier_start {[punctuation_connector] identifier_extend} Unicode recommends that, before storing or comparing identifiers, the following transformations be applied: * Characters in category other_format are filtered out. * For languages which have case insensitive identifiers, Normalization Form KC is applied (see http://www.unicode.org/reports/tr15/tr15-23.html#Specification). This is to ensure that identifiers which look visually the same are considered as identical, even if they are composed of different characters. * _Full_ case folding, as described in the table http://www.unicode.org/Public/4.0-Update/CaseFolding-4.0.0.txt, is used to find the uppercase version of each character. We decided not to apply Normalization Form KC, as there seems to be insufficient experience on using normalization forms. Without normalization, texts that look alike don't have the same meaning; with normalization the widely available text tools like grep, awk, etc. don't work. We follow the lead of C# (ECMA-334) in specifying that a program which is not in Normalization Form KC has an implementation-defined effect. This ensures that a program text which is normalized is portable. It also allows an implementation to provide useful support for non-normalized texts if appropriate in a particular computing environment (in that case, the implementation must document how it handles such texts). Unicode doesn't provide guidance for the composition of numeric literals, so we don't change them. The use of the digits at positions 16#30# to 16#39# is universal in information technology, and allowing digits from other cultures could cause confusion while bringing little benefits. The definition and role of format_effectors is modified to include the characters at positions 16#85#, 16#2028# and 16#2029#. These characters may be used to terminate lines, as recommended by section 5.8 of Unicode 4.0 (see http://www.unicode.org/versions/Unicode4.0.0/ch05.pdf#G10213). Note that characters in category other_format are forbidden in character_literals and string_literals, because their sole purpose is to affect the presentation of characters. If a program needs to operate on these characters, it can do that by using Wide_Wide_Character'Val (...). Private use characters are not considered to be graphic characters (even though for some applications they may actually turn out to be graphic). The reason is that we wouldn't be able to define the case folding rules for these characters, so it seems better to disallow them, except in comments where they cannot do any harm. In order to represent 32-bit characters at run-time, we add new declarations and new predefined packages for 32-bit characters. These packages are similar to their Wide_ equivalents, with Wide_Wide_ substituted for Wide_ everywhere. In addition a declaration provides a character set containing every Wide_Wide_Character value in the BMP of ISO/IEC 10646:2003. We also add attributes permitting conversion between the images and values of Wide_Wide_Character. Note that the dynamic semantics of a number of operations are defined in terms of "space" and "blank". A space is the character at position 16#20# and a blank is either a space or a horizontal tabulation. We are not changing the definition of space or blank, so characters like NO-BREAK SPACE or IDEOGRAPHIC SPACE are not considered to be space or blank in this context of Text_IO. SC22/WG14 is considering the inclusion of support for Unicode 16- and 32-bit characters in C. Their proposal is presented in ISO/IEC TR 19769 (http://std.dkuug.dk/JTC1/SC22/WG14/www/docs/n1040.pdf). At the time of this writing, this technical report is in the FDIS ballot stage. In order to provide compatibility with the upcoming C standard, new types are added to Interfaces.C that correspond to C char16_t and char32_t. It is recognized that adding new declarations to predefined units can cause incompatibilities, but it is thought that the new identifiers are unlikely to conflict with existing code.