ISO/ IEC JTC1/SC22/WG14 N1012

WG14 Document Number: N1012
Date: 07-May-2003

Defect Report

Issue: deficiency on multibyte conversions


Problem statement

Consider a typical use of the multibyte conversion function mbrtowc:

    enum { FINISHED, ERROR } convert (void)
    {
        mbstate_t s = { 0 };
        char c;
        wchar_t wc;

        for (;;)
        {
            c = get_a_byte ();
            switch (mbrtowc (&wc, &c, 1, &s))
            {
            case 1:          put_wide_char (wc);    break;
            case (size_t)-2: break;
            case 0:          put_wide_char (L'\0'); return FINISHED;
            case (size_t)-1: return ERROR;
            }
        }
    }

The multibyte conversion functions were originally written on the
assumption that wide characters are singletons. That is, while several
multibyte characters may map to one wide character, and may map to
different wide characters depending on the current state, each sequence
maps to only one wide character. As a result, functions such as mbrtowc
do not have the concept of returning more than one wide character; only
a single one can be returned per call.

This is fine for mappings like:

    ISO 8859-1   ->  UTF-16 or UTF-32 in Normalisation Form C
    UTF-8        ->  UTF-32 without change of normalisation

but not for others.

It is possible to play fast and loose with the meaning of "state" to
relax the requirement a little bit - if a sequence of three bytes maps
to two wide characters, the first call to mbrtowc can return 2 and the
second call 1, with the state object holding any necessary information.
The requirement then becomes: N bytes can result in M wide characters,
where N >= M and where the first K wide characters depend only on the
first N-M+K bytes.

An example of such a mapping is UTF-8 -> UTF-16 (shown in binary):

  0AAAAAAA                            -> 000000000AAAAAAA
  110AAAAA 10BBBBBB                   -> 00000AAAAABBBBBB
  1110AAAA 10BBBBBB 10CCCCCC          -> AAAABBBBBBCCCCCC
  11110AAA 10BBCCCC 10DDEEEE 10FFFFFF -> 110110XXXXCCCCDD
110111EEEEFFFFFF

In the last case the mbrtowc function can return -2, -2, 1, and 1 in
that order.

However, consider a very similar encoding to UTF-16 where the two wide
characters are in the opposite order. The first wide character (the one
beginning 110111) cannot be output until all 4 bytes have been seen, so
the first three calls to mbrtowc must return -2. The fourth call can
return the first wide character, but there is now no way to return the
second one. If the next UTF-8 sequence is 2 or 3 bytes long, it would
provide an opportunity, but if it is only 1 byte long or, even worse,
was the zero character, it wouldn't.

While the above is a hypothetical situation, a real conversion that has
this problem is converting ISO 8859-1 (or many similar encodings) to
Unicode in Normalisation Form D. In NFD all accented characters are
broken down into their components. So some example conversions are:

    0x20 -> 0x0020
    0x61 -> 0x0061
    0xBF -> 0x00BF
    0xC0 -> 0x0041 0x0300
    0xC1 -> 0x0041 0x0301
    0xC4 -> 0x0041 0x0308
    0xC8 -> 0x0045 0x0300
    0xE6 -> 0x00E6
    0xE7 -> 0x0063 0x0327

What is needed is for mbrtowc to have a way to say "I have an unfinished
wide character sequence, I do not need any more bytes for now". The
obvious way to represent this would be a returned value of 0.
Unfortunately this has already been given a different meaning ("end of
string reached") and changing it would be impractical. Therefore the
following text proposes the return value -3 for this case. This value
would only be generated for locales where this was an issue, so it will
not affect existing uses of the code. And applications that are not
modified to handle this code but are presented with it are likely to
treat it as an error.

Of the other functions in 7.24.6, mbrlen has the same problem. The
wcrtomb function also has to deal with this issue, but the wording
already allows it to be state-ful and return 0 to indicate that nothing
has been output at this stage. Neither the mbsrtowcs nor wcsrtombs
functions have an issue (though with the former it is possible that the
limit of len wide characters is reached in the middle of a
multi-wide-character sequence; the rest of the sequence will be retained
in the mbstate_t object until the next call).


Suggested Technical Corrigendum

Append to 7.24.6.3.2#4:

    (size_t)(-3)  if the multibyte sequence converted by the previous
call
                  with the same mbstate_t object generated more than one
                  wide character and not all these characters have yet
                  been stored; the next wide character in the sequence
has
                  now been stored and no bytes from the input have been
                  consumed by this call.

In 7.24.6.3.1#3, add "(size_t)(-3)" to the possible returned values.

Optional extra change for clarity:

In 7.24.6.3.3#4, EITHER add to the end of the first sentence:

    ; this may be 0

OR add a footnote reference to that sentence:

    291A  If the wide character encoding requires two or more wide
    characters to be considered together when doing the conversion, the
    value returned can be 0.

The Rationale could also be amended to address these issues.