A code unit is an integer value of character type (6.8.2). Characters in a character-literal [...] or in a string-literal are encoded as a sequence of one or more code units [...]; this is termed the respective literal encoding. The ordinary literal encoding is the encoding applied to an ordinary character or string literal. The wide literal encoding is the encoding applied to a wide character or string literal.Then, [lex.string] p10.1 specifies
The sequence of characters denoted by each contiguous sequence of basic-s-chars, r-chars, simple-escape-sequences (5.13.3), and universal-character-names (5.3) is encoded to a code unit sequence using the string-literal’s associated character encoding.Thus, an encoding for ordinary and wide literals in C++ relates a sequence of characters with a sequence of integer values of the respective character type (
As described by RFC2978:
The term "charset" (referred to as a "character set" in previous versions of this document) is used here to refer to a method of converting a sequence of octets into a sequence of characters.Thus, an encoding in the IANA registry relates a sequence of octets with a sequence of characters.
Regrettably, the specified encoding forms and encoding schemes have overlapping naming; "UTF-16" refers both to an encoding form and an encoding scheme.
UTF-8 is the UCS encoding form that assigns each UCS scalar value to an octet sequence of one to four octets, as specified in table 2.The encoding scheme is defined as follows in section 11.2:
The UTF-8 encoding scheme serializes a UTF-8 code unit sequence in exactly the same order as the code unit sequence itself.Thus, for UTF-8, the code units are octets and those octets also constitute the encoding scheme. This encoding does not depend on endianness (byte order in the object representation of an integer) at all.
UTF-16 is the UCS encoding form that assigns each UCS scalar value to a sequence of one to two unsigned 16-bit code units, as specified in table 4.The encoding scheme called "UTF-16" is specified in section 11.5 as follows:
The UTF-16 encoding scheme serializes a UTF-16 code unit sequence by ordering octets in a way that either the less significant octet precedes or follows the more significant octet. In the UTF-16 encoding scheme, the initial signature read as <FE FF> indicates that the more significant octet precedes the less significant octet, and <FF FE> the reverse. The signature is not part of the textual data. In the absence of signature, the octet order of the UTF-16 encoding scheme is that the more significant octet precedes the less significant octet.The "initial signature" is otherwise known as a byte order mark (BOM).
The Unicode standard version 14.0.0 specifies in section 3.10:
UTF-16 encoding scheme: The Unicode encoding scheme that serializes a UTF-16 code unit sequence as a byte sequence in either big-endian or little-endian format.Note the caveat of an undefined "higher-level protocol", which does not exist in ISO 10646.
In the UTF-16 encoding scheme, an initial byte sequence corresponding to U+FEFF is interpreted as a byte order mark; it is used to distinguish between the two byte orders. An initial byte sequence <FE FF> indicates big-endian order, and an initial byte sequence <FF FE> indicates little-endian order. The BOM is not considered part of the content of the text.
The UTF-16 encoding scheme may or may not begin with a BOM. However, when there is no BOM, and in the absence of a higher-level protocol, the byte order of the UTF-16 encoding scheme is big-endian.
In either standard, there are also encoding schemes UTF-16LE and UTF-16BE that do not interpret a signature (byte order mark) at all, but use the given big-endian or little-endian layout unconditionally.
iconvis a transcoding function specified by POSIX:
size_t iconv(iconv_t cd, char **restrict inbuf, size_t *restrict inbytesleft, char **restrict outbuf, size_t *restrict outbytesleft);The conversion descriptor (the first argument) is created using
iconv_t iconv_open(const char *tocode, const char *fromcode);with the following specification:
iconv_open()function shall return a conversion descriptor that describes a conversion from the codeset specified by the string pointed to by the
fromcodeargument to the codeset specified by the string pointed to by the
tocodeand their permitted combinations are implementation-defined.
As a non-normative note,
The objects indirectly pointed to byThus,
outbufare not restricted to containing data that is directly representable in the ISO C standard language
chardata type. The type of
char **, does not imply that the objects pointed to are interpreted as null-terminated C strings or arrays of characters. Any interpretation interpretation of a byte sequence that represents a character in a given character set encoding scheme is done internally within the codeset converters. For example, the area pointed to indirectly by inbuf and/or outbuf can contain all zero octets that are not interpreted as string terminators but as coded character data according to the respective codeset encoding scheme. The type of the data (
long, and so on) read or stored in the objects is not specified, but may be inferred for both the input and output data by the converters determined by the
char*parameter possibly pointing to objects of other integer types.
std::text_encoding::wide_literal()function to UTF16. This is user-unfriendly for the following reasons:
char, a BOM is expected to be present) vs. the UTF16LE/BE text that is produced from a wide literal (objects of type
wchar_t, without a BOM).
iconvalways creates a BOM when writing the UTF-16 encoding. If a user were to convert third-party text from e.g. UTF-8 to "UTF16" for use with
std::wstringand string literals, BOMs are likely to end up in the middle of a string.
iconv, presumably one of the premier consumers of the object representation model (see below), was designed with the understanding that the encoding name also conveys the object type for each code unit (e.g.
wchar_t). This distinction is lost when both network data (in a
wchar_tliterals are expected to be described with the same
It is conceivable to introduce a new enumerator
that has the value of either of the existing
appropriate) and return that value
std::text_encoding::wide_literal() on e.g. Windows
platforms. This approach, as well as an earlier approach in P1885
that returns either UTF16LE or UTF16BE, but never UTF16, would
redundantly represent information about platform endianness in an
unrelated part of the standard. Platform endianness should be handled
exclusively by the existing targeted facility
(see 26.5.8 [bit.endian]).
P1885 also elects to map UTF16LE/BE to UTF16 for the non-wide
8 is required for this function, the ordinary literal encoding
can never be UTF-16. If it were, two consecutive
elements would be used to represent a single code unit, but
char elements might have the value 0 without
representing the null character. This is not a valid encoding per
[lex.charset]. The mapping is thus superfluous for the result of
Everything said above also applies analogously to UTF-32.
The usage situation is approximately the same as that for UTF-16, yet P1885 does not even attempt to perform any mapping that could be viewed as removing endianness assumptions from the name. Adding to that, the IANA registry appears to define "UCS2" as big-endian, but does not make any allowance for a little-endian UCS-2 encoding scheme. This leaves the relevant (admittedly outdated) Microsoft Windows platforms conceptually unsupported.
bit_cast), lots of care needs to be applied to properly deal with padding bits, partially uninitialized values, and other obscure situations.
I believe it is a mistake that P1885 talks about specifying the object representation by applying an encoding scheme. The object representation should never be in the focus of a user or the specification of a user-facing facility.
The following alternative model avoids talking about the object
representation, naturally supports implementations with
> 8 or with
sizeof(wchar_t) == 1, and allows proper
differentiation between literal encodings and network data.
wchar_t); each octet value of an IANA encoding is thus understood to be a code unit.
char) is at least 8 bits and thus can hold the value of an octet.
iconv(and likely other implementations) does not currently support the "WIDE.*" names. This can reasonably be expected to change when the names are standardized.
sizeof(wchar_t) > 1.
sizeof(wchar_t) == 1.
CHAR_BIT >= 16. There is no difference regarding the string literal encoding approach between a
charwith 16 bits and a
wchar_twith 16 bits, regardless of whether the latter consists of one or two bytes.
CHAR_BIT >= 16or
sizeof(wchar_t) == 1.
sizeof(char_type) == 1.
CHAR_BIT == 8or
sizeof(wchar_t) > 1(if any).