N 1991: c16rtomb() on wide characters encoded as multiple char16_t


Submitter:Philipp Klaus Krause
Submission Date:2015-12-09

Summary

Section 7.28.1 describes the function c16rtomb(). In particular, it states "When c16 is not a valid wide character, an encoding error occurs". "wide character" is defined in section 3.7.3 as "value representable by an object of type wchar_t, capable of representing any character in the current locale". This wording seems to imply that, e.g. for the common cases (e.g, an implementation that defines __STDC_UTF_16__ and a program that uses an UTF-8 locale), c16rtomb() will return -1 when it encounters a character that is encoded as multiple char16_t (for UTF-16 a wide character can be encoded as a surrogate pair consisting of two char16_t). In particular, c16rtomb() will not be able to process strings generated by mbrtoc16().

I would like to implement a standard-conforming c16rtomb() for SDCC, that allows conversion from all of UTF-16 (not just the basic multilingual plane) to UTF-8. It seems to me that this is currently not possible.

On the other hand, the description of mbrtoc16() described in section 7.28.1 states "If the function determines that the next multibyte character is complete and valid, it determines the values of the corresponding wide characters". So it considers it possible that a single multibyte character translates into multiple wide characters. So maybe the meaning of "wide character" in section 7.28.1 is different from definition of "wide character" in section 3.7.3.

In either case, the intended behaviour of c16rtomb() for characters encoded as multiple char16_t seems unclear. The issue has been discussed in the thread "A function to convert char16_t strings to char strings" in comp.std.c.

Suggested Change

I see two possible options: