Deprecate implicit conversions between char8_t and char16_t, char32_t, or wchar_t

Document number:
P3695R2
Date:
2025-09-28
Audience:
EWG, SG16
Project:
ISO/IEC 14882 Programming Languages — C++, ISO/IEC JTC1/SC22/WG21
Author:
Jan Schultke <janschultke@gmail.com>
GitHub Issue:
wg21.link/P3695/github
Source:
github.com/Eisenwave/cpp-proposals/blob/master/src/deprecate-unicode-conversion.cow

Implicit conversions between char8_t and char16_t, char32_t, or wchar_t are bug-prone and thus harmful to the language. I propose to deprecate them.

Contents

1

Revision history

1.1

Changes since R1

1.2

Changes since R0

2

Introduction

2.1

It's not hypothetical. This really happens.

2.2

The underlying problem

3

Scope

3.1

What about "safe" comparisons?

3.2

What about char16_t and char32_t?

3.3

What about char and wchar_t?

3.4

What about conversions with integers?

3.5

What comes after deprecation?

3.6

Why not make these conversions narrowing?

4

Impact on existing code

4.1

Replacement for deprecated behavior

5

Implementation experience

6

Wording

6.1

[conv.integral]

6.2

[expr.arith.conv]

6.3

[expr.static.cast]

6.4

[depr.conv.unicode]

7

References

1. Revision history

1.1. Changes since R1

R0 of the paper was seen by SG16, with the following poll results:

P3695R1: Recommend deprecating conversions between char and the charN_t types.

P3695R1: Recommend deprecating conversions between char8_t and wchar_t.

P3695R1: Recommend deprecating conversions between char16_t and char32_t.

Consequently, the following changes were made:

1.2. Changes since R0

2. Introduction

Implicit conversions between char8_t and char32_t invite bugs:

Until very recently, no major compiler would detect the following "bad comparison":

constexpr bool contains_oe(std::u8string_view str) { for (char8_t c : str) if (c == U'ö') return true; return false; } static_assert(contains_oe(u8"ö")); // fails?!

c == U'ö' always fails if c is a UTF-8 code unit because it is equivalent to c == char32_t(0xf6), and a UTF-8 code unit cannot have this value.

An even more evil variation is a search which yields false positives:

constexpr bool contains_nbsp(std::u8string_view str) { for (char8_t c : str) if (c == U'\N{NO-BREAK SPACE}') return true; return false; } static_assert(contains_nbsp(u8"\N{CYRILLIC CAPITAL LETTER EL WITH MIDDLE HOOK}")); // OK?!

The assertion succeeds because Ԡ (U+0520) is UTF-8 encoded as 0xd4, 0xa0, and NBSP is U+00A0, so the char32_t(0xa0) value matches the second UTF-8 code unit of U+0520.

Such bad comparisons often don't occur directly, but within <algorithm>:

constexpr bool is_umlaut(char32_t c) { return c == U'ä' || c == U'ö' || c == U'ü'; } // ... constexpr std::u8string_view umlauts = u8"äöü"; static_assert(std::ranges::find_if(umlauts, is_umlaut) != umlauts.end()); // fails?!

Note that the "bad comparison" occurs between two char32_t in is_umlaut, which demonstrates that implicit conversions in general are bug-prone, not just comparisons. We obviously don't want to deprecate char32_t == char32_t.

Conversions "the other way" (e.g. char32_tchar8_t) are obviously bug-prone too because information is lost, but such bugs can already be caught by all major compilers' warnings, and they are problematic for the same reason as intshort, not because of anything specific to character types. The listed bugs are interesting precisely because no information is lost.

2.1. It's not hypothetical. This really happens.

These kinds of bugs are not far-fetched hypotheticals either; I have written such bugs myself, and have had them contributed to my syntax highlighter [µlight], which makes extensive use of char8_t and char32_t. Very early in development, I have realized how dangerous these implicit conversions are, so most functions in the style of is_umlaut have a deleted overload:

constexpr bool is_umlaut(char8_t) = delete; constexpr bool is_umlaut(char32_t c) { return c == U'ä' || c == U'ö' || c == U'ü'; }

Compilers do have warnings which detect comparisons which are always false, but technically, char8_t can have the values 0xf6 and 0xa0, so it is undetectable.

Using char may raise more tautology warnings because if char is signed, it can only hold values up to 127, meaning it never compares equal to, e.g. 0xa0.

2.2. The underlying problem

The underlying problem is that char8_t == char32_t is Car == Banana. In general, it is meaningless to compare code units with different encodings.

To be fair, Unicode character types aren't strictly required to store Unicode code units. However, that is their primary purpose, and the assumption holds true for any Unicode character-literal and string-literal.

3. Scope

I propose to deprecate implicit conversions between char8_t and char16_t, char32_t, or wchar_t. As demonstrated above, these are extremely bug-prone. Conversions between char16_t and char32_t are not affected.

3.1. What about "safe" comparisons?

In comparisons between code units, certain ranges of code points yield the expected result. For example, u8'x' == U'x' is true because all Unicode encodings are ASCII-compatible, so the numeric value of anything in the basic latin block (≤ U+007F) will have the same single-code-unit value in UTF-8, UTF-16, and UTF-32.

However, even those should be deprecated because:

3.2. What about char16_t and char32_t?

The following explanation assumes that char16_t and char32_t are used to store a UTF-16 and UTF-32 code unit, respectively.

Following some negative feedback on [ClangWarning], the proposal no longer seeks to deprecate conversions between char16_t and char32_t. While these conversions are not guaranteed to be meaningful, there are no false positives in comparisons of UTF-16 and UTF-32 code units, and the comparison is quite likely to be correct.

There are no false positives in char16_tchar32_t because any code point in [U+0000, U+D7FF] or [U+E000, U+FFFF] is encoded using a single code unit equivalent to the code point value, in both UTF-16 and UTF-32.

Other code points are encoded using high surrogates ([0xD800, 0xDBFF]) and low surrogates ([0xDC00, 0xDFFF]). The corresponding surrogate code points in [U+D800, U+DFFF] cannot be encoded in UTF-32.

It is possible to have false negatives when searching for a UTF-32 code unit outside the Basic Multilingual Plane (BMP) in UTF-16 text. However, these searches are tautologically false because values ≥ 0x10000 cannot (usually) be represented by char16_t, so compilers may catch some of them already.

It also also much less likely that char16_tchar32_t conversions actually manifest as a bug. An application that only uses, say, Basic Latin characters and German or Norwegian umlauts can use char16_t and char32_t interchangeably. By contrast, mixing char8_t with other Unicode character types will almost certainly blow up in the user's face if the application processes any kind of non-ASCII text.

Last but not least, UTF-8 is becoming the "default encoding", especially on the web, while UTF-16 is increasingly becoming a "legacy encoding". This makes it unattractive to raise warnings for char16_t when the surrounding code may exist mostly for compatibility purposes, and C++ users are not interested in sinking much time into its maintenance. Substantially more code may be affected by a char16_tchar32_t deprecation because both types were introduced in C++11, unlikely char8_t, which was added in C++20. See also [WikipediaEncodingPopularity]:

Recently it has become clear that the overhead of translating from/to UTF-8 on input and output, and dealing with potential encoding errors in the input UTF-8, overwhelms any benefits UTF-16 could offer. So newer software systems are starting to use UTF-8. The default string primitive used in newer programming languages, such as Go, Julia, Rust and Swift 5, assume UTF-8 encoding. PyPy also uses UTF-8 for its strings, and Python is looking into storing all strings in UTF-8. Microsoft now recommends the use of UTF-8 for applications using the Windows API, while continuing to maintain a legacy "Unicode" (meaning UTF-16) interface.

In summary, in char16_tchar32_t comparisons, there are no false positives, the only false negatives are tautologically false (warnings exist), bugs are unlikely to manifest because code points outside the BMP are relatively uncommon, and if deprecation warnings were raised, that may happen in low-priority legacy code.

Not one participant in SG16 voted in favor of deprecating char16_tchar32_t conversions; see §1.1. Changes since R1.

3.3. What about char and wchar_t?

As recommended by SG16, I propose to leave char intact, but deprecate conversions between char8_t and wchar_t. Since wchar_t almost certainly has a different encoding than char8_t, this conversions is as problematic as the char16_t and char32_t conversions.

The following conversions are not deprecated:

Furthermore, deprecating any conversion from char to other character types is a bad idea, and was unanimously recommended against by SG16. In some code bases, char is used purely for ASCII characters and strings. In such code bases, comparing char to any other character type is always correct, assuming that an ASCII-compatible encoding is used everywhere.

It may also be possible to deprecate conversions with char depending on ordinary literal encoding, but char is not necessarily using literal encoding, and doing so would invite non-portable code that fails to compile on e.g. EBCDIC platforms, to the great surprise of the author.

3.4. What about conversions with integers?

It is quite common to compare character types to integer types. For example, we may write c <= 0x7f to check whether a character falls into the basic latin block. There is nothing exceptionally bug-prone about comparing with say, 0x00A0 instead of U'\u00A0', so we are not interested in deprecating character/integer conversions.

3.5. What comes after deprecation?

The goal is to eventually remove these conversions entirely. Since the behavior is easily detected (§5. Implementation experience) and easily replaced (§4.1. Replacement for deprecated behavior), removal should be feasible within one or two revisions of the language.

Furthermore, I don't believe that having "tombstone behavior" would be necessary. That is, allowing the conversion to happen but making the program ill-formed if it happens. The reason is that char8_t, char16_t, and char32_t rarely appear in overload sets that include types that are not characters.

Without "tombstone behavior", the following code would eventually change its meaning:

void f(std::any); void f(char32_t); int main() { // Currently selects f(char32_t), would select f(std::any) in the future. f(u8'a'); }

3.6. Why not make these conversions narrowing?

Another possible option (instead of deprecation or following deprecation) is to make the affected char8_t conversions narrowing conversions. This would make char32_t{c} for some char8_t c ill-formed, but the implicit conversion from char8_t to char32_t would remain valid.

There are multiple problems with this approach, which is why it is not proposed:

4. Impact on existing code

It is not trivial to estimate how much code would be affected by a deprecation like this. However, that is ultimately not what makes or breaks this proposal. The goal is not to deprecate a rarely used feature to give it new meaning, like array[0,1] prior to [P1161R3].

The goal is to deprecate a bug-prone and harmful feature to make the language safer.

The longer we wait, the more mistakes will be made using char8_t and other types. C++ will undoubtedly get improved support for the Unicode character types over time, making them used more frequently, so we better deal with this problem now than never.

4.1. Replacement for deprecated behavior

If the new deprecation warnings spot a bug like in §2. Introduction, some work will be required to fix it, but the deprecation will have done its job.

If the comparison is obviously safe, such as c == U'0' with char8_t c, the resolution is usually trivial, like c == u8'0'. This could even be done automatically with tools like clang-tidy.

5. Implementation experience

Corentin Jabot has recently implemented a -Wcharacter-conversion warning in Clang ([ClangWarning]), which is enabled by default. You can test this at [CompilerExplorer].

However the warning is more conservative than the proposed deprecation; it does not warn on "safe comparisons" (§3.1. What about "safe" comparisons?).

6. Wording

The following changes are relative to [N5014].

[conv.integral]

Change [conv.integral] paragraph 1 as follows, and split it into two paragraphs:

1 A prvalue of an integer type can be converted to a prvalue of another integer type. The conversion is deprecated ([depr.conv.unicode]) if one of the types involved in the conversion is char8_t, and the other type is char16_t, char32_t, or wchar_t.

[Note: This deprecation also applies to cv-qualified types because prvalues of such types are adjusted to cv-unqualified types ([expr.type]). — end note]

2 A prvalue of an unscoped enumeration type can be converted to a prvalue of an integer type.

[expr.arith.conv]

Change [expr.arith.conv] paragraph 1 as follows:

Many binary operators that expect operands of arithmetic or enumeration type cause conversions and yield result types in a similar way. The purpose is to yield a common type, which is also the type of the result. This pattern is called the usual arithmetic conversions, which are defined as follows:

Integral promotion converts character types to int or unsigned int, so if we didn't add this wording, the conversion would not be deprecated.

[expr.static.cast]

Immediately prior to [expr.static.cast] paragraph 5, insert a new paragraph:

An expression E of type cv char8_t can be explicitly converted to cv char16_t, cv char32_t, or cv wchar_t and vice-versa. The effect is equivalent to explicitly converting to unsigned char, then to the target type.

[Note: Integral conversions ([conv.integral]) between these types have the same effect and are deprecated, unlike this explicit conversion ([depr.conv.unicode]). — end note]

The wording for static_cast needs to permit possibly cv-qualified types with "cv". While cv-qualifications get dropped from expressions ([expr.type]), we still need to say that you can write e.g. static_cast<const char8_t>. Cv-qualifications do not get dropped automatically from the type-id.

Do not change [expr.static.cast] paragraph 5; it is cited here for reference:

Otherwise, an expression E can be explicitly converted to a type T if there is an implicit conversion sequence ([over.best.ics]) from E to T, […]. […], the result object is direct-initialized from E.

[depr.conv.unicode]

Insert a new subclause in [depr] between [depr.local] and [depr.capture.this], containing a single paragraph:

Unicode character conversions [depr.conv.unicode]

The following conversions are deprecated:

[Example:

bool is_oe(char8_t c) { return c == U'ö'; // deprecated } void f() { char32_t c = u8'x'; // deprecated char32_t c = u'x'; // OK, conversion from char16_t to char32_t char8_t c = 'x'; // OK, conversion from char to char8_t is_oe(U'ö'); // deprecated is_oe(static_cast<char8_t>(U'ö')); // OK, explicit conversion ([expr.static.cast]) is_oe((char8_t)U'ö'); // OK, explicit conversion }

end example]

7. References

[N5014] Thomas Köppe. Working Draft, Programming Languages — C++ 2025-08-05 https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2025/n5014.pdf
[µlight] Jan Schultke. ascii_chars.hpp utilities in µlight https://github.com/Eisenwave/ulight
[ClangWarning] Corentin Jabot. [Clang] Add warnings when mixing different charN_t types https://github.com/llvm/llvm-project/pull/138708
[CompilerExplorer] Demonstration of -Wcharacter-conversion https://compiler-explorer.com/z/8j9qqe8MY
[P1161R3] Corentin Jabot. Deprecate uses of the comma operator in subscripting expressions https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2019/p1161r3.html