From keld Fri Feb 9 13:50:52 1996 Received: (from keld@localhost) by dkuug.dk (8.6.12/8.6.12) id NAA17229 for sc22wg11; Fri, 9 Feb 1996 13:50:52 +0100 Message-Id: <199602091250.NAA17229@dkuug.dk> From: keld@dkuug.dk (Keld J|rn Simonsen) Date: Fri, 9 Feb 1996 13:50:50 +0100 X-Charset: ISO-8859-1 X-Char-Esc: 29 Mime-Version: 1.0 Content-Type: Text/Plain; Charset=ISO-8859-1 Content-Transfer-Encoding: 8bit Mnemonic-Intro: 29 X-Mailer: Mail User's Shell (7.2.2 4/12/91) To: sc22wg11 Subject: WG20 N423: Guidelines on character data types I would like WG11 to comment on the following paper - scheduled to be included in TR 10176. Willem, please give it a WG11 doc number and give it time on the København agenda. Keld --- SC22/WG20 N423 Guidelines on Character Data Types in Programming Language Design The character data types support should be done at 3 levels: abstract character level, encoded character level, and text level (combining sequences). The sequence of three levels indicates how WG20 sees the importance of the data type support in the programming language design. A set of APIs needs to be defined to transform between the 3 levels. Abstract Character Level The programming language standard should facilitate a character data type design that is independent of the actual encoding of characters. This abstract character level should be the main form of the national character data type in programming language as it facilitates the portability among application programs across platforms. This level corresponds to the "character" term in SC2. The specification of the encoding should be hidden and transparent to application programs, thus the encoding is implementation defined. The character is presented in exactly one integral unit, therefore the indexing on the character array is permissible. Encoded Character Level The programming language standard may provide a data type to support the encoded character level, where the encoding storage requirement of the abstract characters is known. This level corresponds to the "coded character" term from SC2. This form of encoding can be used to meet the explicit storage requirements, and is useful in programming with multiple coded character sets. One multioctet data type is sufficient, where the storage requirement is not determined by the string termination delimiter defined by programming language. The actual data storage can cater for the coded character sets support by the implementation and for a set of ISO character sets based the nature of the coded charcter set such as single byte single octet (i.e. ISO 8859-1), single byte multi-octet (i.e. ISO 10646 UCS-2), multi-byte single octet (i.e. ISO 10646 UTF-8), and multi-byte multi-octet (i.e. ISO 10646 UTF-16), where the encoding and date storage may be defined by POSIX charmap definitions or by other means. Only one set of the multioctet APIs needs to be defined. For example, a data definition could be done in the following pseudo statement: encoded "UCS-2" character_string = (encoded "UCS-2") "literal string" Here "encoded" is a data type, and "UCS-2" is a reference to the actual data storage requirement defined in the charmap or by the standards to be found in the file system or similar places. The list of reserved keywords on the coded character sets which are based on the ISO SC2/WG3 and ISO SC2/WG2 should be provided to ensure the minimum portability among the ISO character sets. Once the data type with referenced storage requirement (such as þencodedþ and þUCS-2þ) is specified, the programming language standard should provide the necessary code conversion between the machine coded character set and storage coded character set. Text Level The programming language standard may provide the data type support for the text level, which corresponds to the "combining sequences" term from SC2. The behavior of this level is currently not so well developed, and it is difficult to advise on required functionality. Literal There is a need to have literal in the programming language design. In the scope of the internationalization, the literal design should be on the abstract level, i.e. the abstract character level. For example, a specification of literal "This is a literal" should be in abstract data type and the encoding of th literal string should be determined according the runtime codeset not the codeset at the compilation time. This may possibly be dependent on the actual language specification to transform such literal strings into the appropriate data type accordingly. For languages not having such specific requirement, the encoded representation can be specified explicitly. For example, in pseudo programming language, (encoded "UCS-2" "This is a literal").