N 2282: Additional multibyte/wide string conversion functions

Submitter:Philipp Klaus Krause
Submission Date:2018-06-14

Summary:

I suggest to add functions for converting strings between multibyte and char16_t / char32_t strings. These would be similar to the existing functions from 7.22.8 for conversion between char and wchar_t, and could be added to uchar.h.

This is an updated version of N2245 that takes into account comments from the discussion immediately following the Brno meeting.

Justification:

C has quite some functions for converting between char and wchar_t. Though some of them are thread-unsafe (the 7.22.7 ones) or inefficient (7.29.6.3, 7.29.6.4). But the 7.22.8 ones look ok.

On the other hand for converting between char, char16_t, char32_t there are only the restartable functions from 7.28.1. They do not convert whole strings at a time and are inefficient: Restartable functions can handle partial characters as input which comes with a substantial burden on implementations, affecting both speed and code size substantially. E.g. in SDCC targeting STM8 mbrtowc() has twice the code size (437 vs. 227 bytes) and three times the stack memory consumption (51 vs 17 bytes) of mbtowc(). Since SDCC targets tiny 8-bit systems this difference matters a lot. But even for big systems there is a problem: At the WG14 meeting in London some other compiler developers told me that the performance of the restartable functions is unacceptable for some of their customers, and that they recommend use of the non-restartable functions instead. Also, the restartable functions are hard to use correctly due to their somewhat surprising behaviour of reading bytes past a 0 character (and thus beyond the end of 0-terminates strings).

Applications commonly use char multibyte strings for string processing. Adding functions to convert them to / from char16_t and char32_t strings would allow applications to do their string processing on char strings, and then easily convert to / from char16_t and char32_t where those are needed (e.g. for some API requiring UTF-16 on an implementation where __STDC_UTF_16__ is defined). The conversion between multibyte strings and UTF-16 is apparently so important to users that there are even encodings designed to deal with weird corner-cases (the WTF-8 and OMG-WTF-8 encodings that allow encoding invalid UTF-16 in a UTF-8-like way).

So converting strings between char, char16_t and char32_t is a useful addition, and using an interface similar to 7.22.8 seems like a good choice.

Compared to the existing functionality for conversion between multibyte and char16_t, char32_t, the new functions offer the following advantages:

  1. Better performance and code size by being non-restartable
  2. Convenience by allowing conversions of whole strings at a time and not reading beyond a terminating 0
Implementations of the mbstoc16s and c16stombs functions can be found in the library of the Small Device C Compiler.

Proposal:

7.28.2 Non-restartable multibyte/wide string conversion functions

7.28.2.1 The mbstoc16s function

Synopsis
	#include <uchar.h>
	size_t mbstoc16s(char16_t *restrict c16s, const char *restrict s, size_t n)

Description
	The mbstoc16s function converts a sequence of multibyte characters that begins in the initial
	shift state from the array pointed to by s into a sequence of corresponding wide characters and
	stores not more than n wide characters into the array pointed to by c16s.

	Each multibyte character is converted as if by a call to the mbrtoc16 function with a non-null ps and
        MB_MAX_LEN for n. However, no multibyte characters that follow a null character (which is converted into
        a null wide character) will be examined or converted.

Returns
	If a multibyte character is encountered that does not correspond to a valid sequence of wide
	characters, the mbstoc16s function returns (size_t)(-1). Otherwise, the mbstoc16s function returns
	the number of array elements modified, not including a terminating null wide character, if any.

7.28.2.1 The c16stombs function

Synopsis
	#include <uchar.h>
	size_t c16stombs(char *restrict s, const char16_t *restrict c16s, size_t n)

Description
	The c16stombs function converts a sequence of wide characters from the array pointed to by c16s into
	a sequence of corresponding multibyte characters that begins in the initial shift state, and stores
	these multibyte characters into the array pointed to by s, stopping if a multibyte character would
	exceed the limit of n total bytes or if a null character is stored. Each sequence of wide characters
	is converted as if by calls to the c16rtomb function with a non-null ps.

Returns
	If a sequence of wide characters is encountered that does not correspond to a valid multibyte
	character, the c16stombs function returns (size_t)(-1). Otherwise, the c16stombs function returns the
	number of bytes modified, not including a terminating null character, if any.

7.28.2.1 The mbstoc32s function

Synopsis
	#include <uchar.h>
	size_t mbstoc32s(char32_t *restrict c32s, const char *restrict s, size_t n)

Description
	The mbstoc32s function converts a sequence of multibyte characters that begins in the initial shift
	state from the array pointed to by s into a sequence of corresponding wide characters and stores not
	more than n wide characters into the array pointed to by c32s.

	Each multibyte character is converted as if by a call to the mbrtoc32 function with a non-null ps and
        MB_MAX_LEN for n. However, no multibyte characters that follow a null character (which is converted into
        a null wide character) will be examined or converted.

Returns
	If an invalid multibyte character is encountered, the mbstoc32s function returns (size_t)(-1).
	Otherwise, the mbstowcs function returns the number of array elements modified, not including a
	terminating null wide character, if any.

7.28.2.1 The c32stombs function

Synopsis
	#include <uchar.h>
	size_t c32stombs(char *restrict s, const char32_t *restrict s, size_t n)

Description
	The c32stombs function converts a sequence of wide characters from the array pointed to by c32s into
	a sequence of corresponding multibyte characters that begins in the initial shift state, and stores
	these multibyte characters into the array pointed to by s, stopping if a multibyte character would exceed
	the limit of n total bytes or if a null character is stored. Each wide character is converted as if by
	calls to the c32rtomb function with a non-null ps.

Returns
	If a wide character is encountered that does not correspond to a valid multibyte character, the c32stombs
	function returns (size_t)(-1). Otherwise, the c32stombs function returns the number of bytes modified,
	not including a terminating null character, if any.