JTC1/SC22/WG14 N998

 

 

 

 

 

Extensions for the programming language C to support new character data types

 

Suggestion for the Pre-Draft Version

 

Contents

 

1           Introduction 

2           General

2.1        Scope 

2.2        References 

3          The new character data types 

4          Encoding.

5          String and character literals.

5.1       String and character literal notations.

5.2       The string concatenation 

6          Library functions.

6.1       The mbtoc16 function 

6.2       The c16tomb function 

6.3       The mbtoc32 function 

6.4       The c32tomb function 

7          ANNEX A  Unicode encoding forms: UTF-16, UTF-32 


 

1          Introduction

The C language has matured over the last decades, yet the definition of the character has remained stable. Various code pages and multibyte libraries have been introduced in the past, however, the character data type in the C language has remained 8 bit based.  Today, the introduction and the success of the Unicode standard and of its implementation in modern computer languages creates ever increasing demands on the C language to give Unicode better support.  This paper addresses the introduction of new character data types in the C language in order to support future character encoding forms, including Unicode.

The Unicode standard supports 3 encoding forms:

·        UTF-8

·        UTF-16

·        UTF-32

Each encoding form has advantages and disadvantages, so that the choice of the encoding form should be left to the application.  Currently, some C applications implement UTF-8 using char type, UTF-16 using unsigned short or wchar_t, and UTF-32 using unsigned int or wchar_t.  The current situation, however, faces the following major problems:  

·        The size of wchar_t is implementation defined.  While Unicode offers the possibility to write platform independent applications, wchar_ t does not offer platform portability for C applications or platform idependent data format. 

·        There is no string literal for 16- or 32-bit based integer types, but the Unicode encoding forms require string literals.   

It is sensible to give all the Unicode encoding forms appropriate data type support.
UTF-8 is normally considered as one of character sets for char. This paper suggests the implementation of 16 and 32 bit based character data types: char16_t and char32_t.  The new data types guarantee program portability through clearly defined character width.  The encoding of the new data types should be as generic as possible in order to support not only Unicode but also future character encodings.

It is desirable in general that the C applications process strings rather than a character and character arrays.  This paper does not specify the detail of library functions for the new data types, except one set of character conversion functions. 

@

2          General

 

2.1        Scope

 

This Technical Report specifies two character data types as the extensions of the programming language C, specified by the international standard ISO/IEC 9899:1999.

 

2.2        References

 

The following standards contain provisions which, through reference in this text, constitute provisions of Technical Report. For dated references, subsequent amendments to, or revisions of, any of these publications do not apply. However, parties to agreements based on this Technical Report are encouraged to investigate the possibility of applying the most recent editions of the normative documents indicated below. For undated references, the latest edition of the normative document referred applies. Members of IEC and ISO maintain registers of current valid International Standards.

 

ISO/IEC 9899:1999, Information technology – Programming languages, their environments and

system software interfaces – Programming Language C.

 

3          The new character data types

 

This Technical Report introduces the following two new data types char16_t and char32_t :

 

typedef          uint_least16_t         char16_t;

typedef          uint_least32_t         char32_t;

 

The data types guarantee the certain width of the data types whereas the width of wchar_t was implementation defined.  The data values are unsigned while char could take singed values.  

4          Encoding

C99 subclause 6.10.8 specifies that the value of the macro __STDC_ISO_10646__ shall be "an integer constant of the form yyyymmL (for example, 199712L), intended to indicate that values of type wchar_t are the coded representations of the characters defined by ISO/IEC 10646, along with all amendments and technical corrigenda as of the specified year and month." C99 subclause 6.4.5p5 specifies that wide string literals are initialized with a sequence of wide characters as defined by the mbstowcs function with an implementation-defined current locale.

There shall be a macro __STDC_UTF_16__  to indicate that char16_t uses UTF-16. This also allows the use of UTF-16 in char16_t even if wchar_t uses a non-Unicode encoding. In certain cases the compile-time conversion to UTF-16 may be restricted to members of the basic character set and universal character names (\Unnnnnnnn and \unnnn) because for these the conversion to UTF-16 is defined unambiguously. 

The encoding of char32_t can be defined in the same manner using __STDC_UTF_32__. 

The encoding of new data types and string literals become implementation defined when the macro __STDC_ UTF_nn __  is not set. 

5          String and character literals

 

5.1        String and character literal notations

The notations of String and character literal constants for char16_t is derived in analogue to the wide string literal notation L"s-char-sequenceopt",  

u"s-char-sequenceopt"

The literal is used to initialize an array of char16_t. The corresponding character constants are

 

 u'c-char-sequence'

and have the type char16_t.  Accordingly, the string and character literal constants for char32_t are,  

 

U"s-char-sequenceopt" and

 

U 'c-char-sequence'.

 

5.2        The string concatenation

The new string literal formats (u"str" and U"str") should follow the same concatenation rules as the existing L"str" strings; i.e., when adjacent literals of the same format are concatenated the result is widened to the representation of the other string literal also if one of the adjacent literals is a gnarrowh string.  Here some examples:

 

            u"a"  u"bà   u"ab"       U"a"  U"b"  à   U"ab"         L"a"  L"b à   L"ab

            u"a"  "b"   à   u"ab"        U"a"  "b"   à   U"ab"          L"a"  "b"    à   L"ab"   

            "a"  u"b"   à   u"ab"        "a"  U"b"   à   U"ab"          "a"  L"b"    à   L"ab"   

 

Any other catenations are implementation-defined (they might or might not be supported).

6          Library functions

 

Speaking in general, it is desirable to free the C applications from character-based operations and encourage string-based operations.  The detail of the library for the new character data types should be left to the future enhancements of the C standard.  This technical report specifies merely the 4 minimum character conversions among 3 character data types: char, char16_t and char32_t.

 

6.1        The mbtoc16 function

 

Synopsis

#include <wchar.h>

size_t mbtoc16(

char16_t *restrict pwcs,

const char * restrict s,

size_t n); 

 

Description

If s is not a null pointer, the mbtoc16 function determines the number of bytes that are contained in the multibyte character pointed to by s. It then determines the code for the value of type char16_t that corresponds to that multibyte character. (The value of the code corresponding to the null character is zero.) If the multibyte character is valid and pwc is not a null pointer, the mbtoc16 function stores the code in the object pointed to by pwc. At most n bytes of the array pointed to by s will be examined.

The implementation shall behave as if no library function calls the mbtoc16 function.

 

Returns

 

If s is a null pointer, the mbtoc16 function returns a nonzero or zero value, if multibyte character encodings, respectively, do or do not have state-dependent encodings. If s is not a null pointer, the mbtoc16 function either returns 0 (if s points to the null character), or returns the number of bytes that are contained in the converted multibyte character (if the next n or fewer bytes form a valid multibyte character), or returns -1 (if they do not form a valid multibyte character). In no case will the value returned be greater than n or the value of the MB_CUR_MAX

macro.

 

 

6.2        The c16tomb function

 

Synopsis

#include <wchar.h>

size_t c16tomb(

char *s,

char16_t wc);

Description

 

The c16tomb function determines the number of bytes needed to represent the multibyte character corresponding to the wide character given by wc (including any shift sequences), and stores the multibyte character representation in the array whose first element is pointed to by s (if s is not a null pointer). At most MB_CUR_MAX characters are stored. If wc is a null wide character, a null byte is stored, preceded by any shift sequence needed to restore the initial shift state, and the function is left in the initial conversion state.

The implementation shall behave as if no library function calls the c16tomb function.

 

Returns

 

If s is a null pointer, the c16tomb function returns a nonzero or zero value, if multibyte character encodings, respectively, do or do not have state-dependent encodings. If s is not a null pointer, the c16tomb function returns -1 if the value of wc does not correspond to a valid multibyte character, or returns the number of bytes that are contained in the multibyte character corresponding to the value of wc.

6.3        The mbtoc32 function

Synopsis

#include <wchar.h>

size_t mbtoc32(

char32_t *restrict pwcs,

const char * restrict s,

size_t n); 

 

Description

If s is not a null pointer, the mbtoc32 function determines the number of bytes that are contained in the multibyte character pointed to by s. It then determines the code for the value of type char32_t that corresponds to that multibyte character. (The value of the code corresponding to the null character is zero.) If the multibyte character is valid and pwc is not a null pointer, the mbtoc32 function stores the code in the object pointed to by pwc. At most n bytes of the array pointed to by s will be examined.

The implementation shall behave as if no library function calls the mbtoc32 function.

 

Returns

 

If s is a null pointer, the mbtoc32 function returns a nonzero or zero value, if multibyte character encodings, respectively, do or do not have state-dependent encodings. If s is not a null pointer, the mbtoc32 function either returns 0 (if s points to the null character), or returns the number of bytes that are contained in the converted multibyte character (if the next n or fewer bytes form a valid multibyte character), or returns -1 (if they do not form a valid multibyte character). In no case will the value returned be greater than n or the value of the MB_CUR_MAX

macro.

 

6.4        The c32tomb function

 

Synopsis

#include <wchar.h>

size_t c32tomb(

char *s,

char32_t wc);

Description

 

The c32tomb function determines the number of bytes needed to represent the multibyte character corresponding to the wide character given by wc (including any shift sequences), and stores the multibyte character representation in the array whose first element is pointed to by s (if s is not a null pointer). At most MB_CUR_MAX characters are stored. If wc is a null wide character, a null byte is stored, preceded by any shift sequence needed to restore the initial shift state, and the function is left in the initial conversion state.

The implementation shall behave as if no library function calls the c32tomb function.

 

Returns

 

If s is a null pointer, the c32tomb function returns a nonzero or zero value, if multibyte character encodings, respectively, do or do not have state-dependent encodings. If s is not a null pointer, the c32tomb function returns -1 if the value of wc does not correspond to a valid multibyte character, or returns the number of bytes that are contained in the multibyte character corresponding to the value of wc.


 

7          ANNEX A Unicode encoding forms: UTF-16, UTF-32

 

See Section 2.3 "Encoding Forms" and Section 3.8 "Transformations" in The Unicode Standard, Version 3.0. Addison Wesley, 2000.

 

Online Edition

Section 2.3 Encoding Forms

http://www.unicode.org/uni2book/ch02.pdf

 

Section 3.8 Transformations

http://www.unicode.org/uni2book/ch02.pdf

 

Technical Report

http://www.unicode.org/reports/tr19/