Issue 0488: c16rtomb() on wide characters encoded as multiple char16_t

This issue has been automatically converted from the original issue lists and some formatting may not have been preserved.

Authors: WG14, Philipp Klaus Krause
Date: 2015-12-09
Reference document: N1991
Submitted against: C11 / C17
Status: Fixed
Fixed in: C23
Converted from: n2396.htm

Summary

Section 7.28.1 describes the function c16rtomb(). In particular, it states "When c16 is not a valid wide character, an encoding error occurs". "wide character" is defined in section 3.7.3 as "value representable by an object of type wchar_t, capable of representing any character in the current locale". This wording seems to imply that, e.g. for the common cases (e.g, an implementation that defines __STDC_UTF_16__ and a program that uses an UTF-8 locale), c16rtomb() will return -1 when it encounters a character that is encoded as multiple char16_t (for UTF-16 a wide character can be encoded as a surrogate pair consisting of two char16_t). In particular, c16rtomb() will not be able to process strings generated by mbrtoc16().

I would like to implement a standard-conforming c16rtomb() for SDCC, that allows conversion from all of UTF-16 (not just the basic multilingual plane) to UTF-8. It seems to me that this is currently not possible.

On the other hand, the description of mbrtoc16() described in section 7.28.1 states "If the function determines that the next multibyte character is complete and valid, it determines the values of the corresponding wide characters". So it considers it possible that a single multibyte character translates into multiple wide characters. So maybe the meaning of "wide character" in section 7.28.1 is different from definition of "wide character" in section 3.7.3.

In either case, the intended behaviour of c16rtomb() for characters encoded as multiple char16_t seems unclear. The issue has been discussed in the thread "A function to convert char16_t strings to char strings" in comp.std.c.

Suggested Change

I see two possible options:


Comment from WG14 on 2018-10-18:

Apr 2016 meeting

Committee Discussion

After discussion, the committee concluded that mbstate was already specified to handle this case, and as such the second interpretation is intended. The committee believes that there is an underspecification, and solicited a further paper from the author along the lines of the second option. Although not discussed a Suggested Technical Corrigendum can be found at N2040.

Oct 2016 meeting

Committee Discussion

The paper N2040. was discussed and found inadequate: it does not link the first call with the second as is intended by the standard.

Additional input was solicited and found in (SC22WG14.14481) DR488 Suggested Corrigendum and is repeated below:

In 7.28.1.2 paragraph 3, change:

If s is not a null pointer, the c16rtomb function determines the number of bytes needed to represent the multibyte character that corresponds to the wide character given by c16 (including any shift sequences), and stores the multibyte character representation in the array whose first element is pointed to by s.

to:

If s is not a null pointer, the c16rtomb function determines the number of bytes needed to represent the multibyte character that corresponds to the wide character given or completed by c16 (including any shift sequences), and stores the multibyte character representation in the array whose first element is pointed to by s, or stores nothing if c16 does not represent a complete character.

Apr 2017 meeting

Committee Discussion

The words discussed and reported in the last meeting were adopted.

Proposed Change

In 7.28.1.2 paragraph 3, change:

If s is not a null pointer, the c16rtomb function determines the number of bytes needed to represent the multibyte character that corresponds to the wide character given by c16 (including any shift sequences), and stores the multibyte character representation in the array whose first element is pointed to by s.

to:

If s is not a null pointer, the c16rtomb function determines the number of bytes needed to represent the multibyte character that corresponds to the wide character given or completed by c16 (including any shift sequences), and stores the multibyte character representation in the array whose first element is pointed to by s, or stores nothing if c16 does not represent a complete character.