|Audience:||Library Working Group|
|Reply-to:||Tom Honermann <firstname.lastname@example.org>|
The support for char8_t as adopted for C++20 via P0482R6 [P0482R6] affects backward compatibility for existing C++17 programs in at least the following ways:
This paper does not further discuss case 1 above. Adding new keywords and new members to the std namespace is business as usual; see SD-8 [SD-8]. It is acknowledged that these additions will affect some code bases. Code surveys have found that these names have generally been used to emulate the set of features introduced with the adoption of P0482R6 [P0482R6]. In some cases, existing code has already been updated to adapt to the new standard features. For example, EASTL will now use the the standard provided char8_t type when available instead of the type alias previously used. The pull request for this change can be found at https://github.com/electronicarts/EASTL/pull/239.
Case 2 above is a change that does not fit into the set of standard library rights reserved in SD-8 [SD-8]. This is a cause for concern, but is somewhat mitigated by the fact that std::filesystem is new with C++17 and therefore does not have a long history of use. Some options for dealing with this change are discussed later in this paper.
Case 3 above is the change responsible for most of the backward compatibility impact.
This paper is motivated by three goals:
The following table presents examples of well-formed C++17 code that is either ill-formed or behaves differently in C++20. The table also reflects the intended changes proposed in this paper. Note that most of these examples remain ill-formed with this proposal. This is intentional as the examples reflect problematic code that leads to mojibake in C++17 code due to use of the same type (char) for multiple encodings (execution encoding and UTF-8).
|Code||C++17||C++20 with P0482R6||C++20 with this proposal|
|Initializes p with the address of the UTF-8 encoded string.||Ill-formed.||Ill-formed.|
|Initializes a with the UTF-8 encoded string.||Ill-formed.||Ill-formed.|
|Initializes v with the result of calling operator ""_udl with the UTF-8 encoded string literal.||Ill-formed.||Ill-formed.|
|Initializes s with the UTF-8 encoded string.||Ill-formed.||Ill-formed.|
|Initializes s with the UTF-8 encoded representation of the file path stored in p.||Ill-formed.||Ill-formed.|
|Writes a sequence of UTF-8 code units as characters to stdout.
(mojibake if the execution character encoding is not UTF-8)
|Writes an integer or pointer value to stdout.
(consistent with handling of char16_t and char32_t)
(for all of char8_t, char16_t, and char32_t)
|Constructs a std::filesystem::path object from the UTF-8 encoded string.||Ill-formed.||Constructs a std::filesystem::path object from the UTF-8 encoded string.|
u8 string literals were added in C++11, but support for u8 character literals was only added in C++17.
Code surveys have so far revealed little use of u8 literals. Google and Facebook have both reported less than 1000 occurrences in their code bases, approximately half of which occur in test code. Representatives of both organizations have stated that, given the actual size of their code base, this is an insignificant number of occurrences. Likewise, Firefox contains around 100 occurrences and Mozilla engineers have reviewed this paper and have no concerns regarding addressing the existing uses using the remediation approaches discussed here.
Code surveys have been attempted on github, but github search doesn't facilitate distinguishing uses of u8 as identifiers (which is quite common) vs use as a UTF-8 literal. Further, github doesn't provide a search that filters out duplicate hits for the same source code in different repositories. As a result, finding instances of u8 literals is challenging. Most cases that were identified were in tests included in clones of Clang and gcc.
Searches on Debian code search found uses in only a few packages and, with one exception discussed below, only a small number of uses (mostly single digit counts), most of which occurred in tests.
Survey results for Debian code search (https://codesearch.debian.net) follow. These exclude hits to gcc, clang, packages like fmt and eastl that have already been conditionalized for char8_t in C++20, and C code. Additionally, packages bundled with other packages are omitted to avoid double counting. Search hits reflect number of lines matching the search, not number of occurrences. Search results are categorized by hits in program source files vs hits in test source files.
There is one clear outlier in these search results. Chromium has two orders of magnitude more uses of u8 literals than any other package. The Chromium team was contacted to discuss this. Almost all of the existing uses appear in bulk data source files that the Chromium team has determined can be mechanically addressed.
|Searched for||Debian packages (out of ~18000)|
|char8_t||spring (emulates its own char8_t support)|
|u8string||libopenmpt (defines a mpt::u8string typedef of
std::basic_string with a custom char_traits)
spring (defines a std::u8string class that derives from std::string)
chromium (~10310, ~102 files)
firefox (97 hits, 1 files, 3 test files)
icu (83 hits, 1 files, 5 test files)
qbs (56 hits, 1 file, 1 test file)
mongodb (30 hits, 2 test files)
aseba (28 hits, 1 file)
monero (26 hits, 1 file, 1 test file)
nlohmann-json (21 hits, 1 file, 3 test files)
bambootracker (20 hits, 5 files)
capnproto (18 hits, 1 file, 1 test file)
lgogdownloader (11 hits, 1 file)
libosmium (10 hits, 3 test files)
cbmc (8 hits, 3 test files)
maim (8 hits, 1 file)
octave-ltfat (8 hits, 1 file)
praat (8 hits, 2 files)
slop (8 hits, 1 file)
mame (7 hits, 3 files)
nlohmann-json3 (7 hits, 3 test files)
sdcc (7 hits, 1 test file)
antlr4 (3 hits, 1 file)
keyman-keyboardprocessor (3 hits, 2 test files)
scram (3 hits, 2 test files)
tesseract (3 hits, 1 test file)
boost1.67 (2 hits, 1 test file)
freeorion (2 hits, 1 file)
supertux (2 hits, 1 file)
cjs (1 hit, 1 test file)
cpp-hocon (1 hit, 1 test file)
efl (1 hit, 1 file)
gjs (1 hit, 1 test file)
kate4 (1 hit, 1 example file)
kodi (1 hit, 1 test file)
libtcod (1 hit, 1 test file)
retroarch (1 hit, 1 file)
rtags (1 hit, 1 test file)
chromium (940, 2 files, 2 test files)
kate4 (2 hits, 1 example file, 1 test file)
nlohmann-json (2 hits, 1 test file)
nlohmann-json3 (2 hits, 1 test file)
ksyntax-highlighting (1 hit, 1 test file)
ktexteditor (1 hit, 1 test file)
A single approach to addressing backward compatibility impact is unlikely to be the best approach for all projects. This section presents a number of options to address various types of backward compatibility impact. In some cases, the best solution may involve a mix of these options.
Each of these approaches assumes a requirement for continued use of UTF-8 encoded literals with char based types. For most projects, such a requirement is expected to be temporary while the project is fully migrated to C++20. However, some projects may retain a sustained need for such literals. For those projects, the Emulate C++17 u8 literals approach is able to address most cases of backward compatibility impact.
The simplest possible solution in the short term is to simply disable the new features completely. Clang and gcc will allow disabling char8_t features in both the language and standard library, via a -fno-char8_t option. It is expected that Microsoft and EDG based compilers will offer a similar option.
This option should be considered a short-term solution to enable testing existing C++17 code compiled as C++20 with minimal effort. This isn't a viable long-term option as continued use would potentially complicate composition with code that depends on the new features.
Adding function overloads that accept char8_t based types is an effective step towards full migration to C++20. Ideally, older char based functions would eventually be removed.
This approach may be a reasonable option when the execution encoding is ASCII based (but not UTF-8; otherwise just use ordinary literals) and characters outside the basic source character set are infrequently used in existing u8 literals. This approach matches how code using UTF-8 had to be written prior to C++11.
Common uses of u8 literals can be handled in a backward compatible manner through use of reinterpret_cast. Note that use of reinterpret_cast is well-formed in these situations since lvalues of type char may be used to access values of other types. Such code is valid in both C++17 and C++20.
This approach may suffice when there are just a few uses of UTF-8 literals that need to be addressed and the uses do not appear in constexpr context. In general, sprinkling reinterpret_cast all over a code base is not desirable.
The techniques applied here are also applicable to the examples illustrated in the prior section regarding use of reinterpret_cast. This approach makes use of P0732R2 [P0732R2] to enable constexpr UTF-8 encoded char based literals using a user defined literal. The example code below defines overloaded character and string UDL operators named _as_char. These UDLs can then be used in place of existing UTF-8 character and string literals.
When wrapped in macros, the above UDL can be used to retain source compatibility across C++17 and C++20 for all known scenarios except for array initialization.
In C++17, arrays of char may be initialized with u8 string literals, but such initialization is ill-formed in C++20. C++17 behavior can be emulated by substituting a class type with appropriate class template argument deduction guides.
Explicit conversion functions can be used, in a C++17 compatible manner, to cope with the change of return type to the std::filesystem::path member functions when a UTF-8 encoded path is desired in an object of type std::string. For example:
This naturally incurs a cost when building with char8_t support enabled due to the need to copy the path contents.
Tooling could potentially assist programmers in migrating code. Several of the approaches discussed above could be applied mechanically to an existing code base. For example, re-writing existing u8 literals to ordinary literals with escape sequences, or adding an _as_char UDL suffix to existing literals (inserting include directives as needed).
The following sections summarize options that have been considered to reduce backward compatibility impact. Most of these options are not proposed in this paper because they would actively interfere with goals of the char8_t proposal; to enable the type system to protect against inadvertent mixing of UTF-8 data and the execution encoding. However, some of these options may be useful for some code bases and could be provided by implementations as opt-in extensions.
Only two of these options (7 and 8) are proposed for inclusion in the standard. In both of these cases, the concern that is addressed was not specifically intended by the changes adopted in P0482R6. These are effectively bug fixes.
Many of the backward compatibility concerns could be avoided by reinstating u8 literals as having type char and introducing a new prefix, for example U8, to specify UTF-8 literals with type char8_t.
The visible difference between u8 and U8 is subtle. Some coding compliance standards, such as MISRA, forbid use of identifiers that differ only in case. It has been suggested that C++11's use of u and U to denote UTF-16 and UTF-32 literals was a mistake because the visual distinction is too subtle. To avoid these subtle visual differences, new literal prefixes such as utf8, utf16, and utf32 could be introduced and the old ones deprecated. The downside of these prefixes is, of course, that they are longer.
Implementing this option would continue enabling problems with encoding confusion that we see today. The execution encoding is not UTF-8 on some popular platforms and continuing to use char based types for execution encoding and UTF-8 (and other untrusted input or encodings) is a recipe for continued occurrences of mojibake in applications. For platforms that use UTF-8 as the execution encoding, ordinary literals are already UTF-8 encoded. This option would introduce three distinct ways of writing UTF-8 literals on such platforms; having two ways to do (almost) the same thing is usually one too many already.
Allowing implicit conversions from char8_t to char was considered with the original P0482 proposal. The concerns with this approach are the same as in option 1; this enables continued, potentially unintended, mixing of UTF-8 data with non-UTF-8 data resulting in mojibake.
Additionally, allowing implicit conversions would not address all compatibility concerns. For example:
However, such implicit conversions could still be useful for some existing code. Implementations could offer extensions to enable such conversions.
This option would allow the following code to remain well-formed in C++20.
Array initialization is the one context in which the previously discussed uses of reinterpret_cast or the _as_char UDL isn't an option. This option would allow array initializations to remain well-formed and avoid the need for workarounds like the previously discussed char_array template. However, this option would continue to promote mixing of UTF-8 data with non-UTF-8 data potentially resulting in mojibake.
Implementations could allow these initializations as a conforming extension.
This option would enable use of the previously discussed _as_char UDL to initialize an array without the need for workarounds like the previously discussed char_array template. However, this option would continue to promote mixing of UTF-8 data with non-UTF-8 data potentially resulting in mojibake.
Implementations could allow these initializations as a conforming extension.
This option has been suggested as a way to allow some existing uses of std::string to hold UTF-8 data to remain valid in C++20. For example:
This option constitutes a narrow fix for a few specific use cases within a considerably larger problem space. Further, it would require changes to std::basic_string specifically for its char-based specializations. As with previously discussed options, this would again continue to promote mixing of UTF-8 data with non-UTF-8 data potentially resulting in mojibake.
This option has been suggested as a means to address the backward compatibility impact due to the changes to the std::filesystem::path u8string and generic_u8string member functions. It would allow code like the following to continue to work as expected:
This option is, again, not proposed because it would allow unintended mixing of UTF-8 encoded data and the execution character encoding.
An unintended and silent behavioral change was introduced with the adoption of P0482R6. In C++17, the following code wrote the code units of the literals to stdout. In C++20, this code now writes the character literal as a number, and the address of the string literal, to stdout.
This is a surprising change that provides no benefit to programmers. Adding deleted ostream inserters would avoid this surprising behavioral change while reserving the possibility to specify behavior for these operations in the future (for example, to specify implicit transcoding to the execution encoding).
Another unintended behavioral change introduced with the adoption of P0482R6 is that the following code is now ill-formed because std::filesystem::u8path requires a range or pair of iterators specifically with a value type of char.
std::filesystem::u8path is now deprecated, but since it previously required UTF-8 data, there is no risk of encoding confusion (unlike with many of the other options discussed in this paper). Allowing it to continue to be called with u8 literals (or other char8_t based ranges and iterators) causes no harm other than potentially encouraging continued use of a deprecated interface.
These changes are relative to N4762 [N4762]
Change in table 35 of 16.3.1 [support.limits.general] paragraph 3:
Table 35 — Standard library feature-test macros
Macro name Value Header(s) […] […] […] __cpp_lib_char8_t 201811** placeholder ** <atomic> <filesystem> <istream> <limits> <locale> <ostream> <string> <string_view> […] […] […]
Drafting note: the final value for the __cpp_lib_char8_t feature test macro will be selected by the project editor to reflect the date of approval.
Append new paragraphs in 188.8.131.52.4 [ostream.inserters.character]:
basic_ostream<char, traits>& operator<<(basic_ostream<char, traits>& out, wchar_t c) = delete;
basic_ostream<char, traits>& operator<<(basic_ostream<char, traits>& out, char8_t c) = delete;
basic_ostream<char, traits>& operator<<(basic_ostream<char, traits>& out, char16_t c) = delete;
basic_ostream<char, traits>& operator<<(basic_ostream<char, traits>& out, char32_t c) = delete;
6. [ Note: These overloads prevent formatting character values as numeric values. — end note ]
basic_ostream<char, traits>& operator<<(basic_ostream<char, traits>& out, const wchar_t* s) = delete;
basic_ostream<char, traits>& operator<<(basic_ostream<char, traits>& out, const char8_t* s) = delete;
basic_ostream<char, traits>& operator<<(basic_ostream<char, traits>& out, const char16_t* s) = delete;
basic_ostream<char, traits>& operator<<(basic_ostream<char, traits>& out, const char32_t* s) = delete;
7. [ Note: These overloads prevent formatting strings as pointer values. — end note ]
Change in C.5.11 [diff.cpp17.input.output] paragraph 2:
Affected subclause: 184.108.40.206.4
Change: Overload resolution for ostream inserters used with UTF-8 literals.
Rationale: Required for new features.
Effect on original feature: Valid ISO C++ 2017 code that passes UTF-8 literals to basic_ostream::operator<<
no longer calls character related overloads.
std::cout << u8"text"; // Previously called operator<<(const char*) and printed a string. // Now
calls operator<<(const void*) and prints a pointer value. std::cout << u8'X'; // Previously called operator<<(char) and printed a character. // Now calls operator<<(int) and prints an integer value.
Add a new paragraph after C.5.11 [diff.cpp17.input.output] paragraph 2:
Affected subclause: 220.127.116.11.4
Change: Overload resolution for ostream inserters used with wchar_t, char16_t, and char32_t types.
Rationale: Removal of surprising behavior.
Effect on original feature: Valid ISO C++ 2017 code that passes wchar_t, char16_t, and char32_t characters or strings to basic_ostream<char, ...>::operator<< is now ill-formed.
std::cout << u"text"; // Previously called operator<<(const void*) and printed a pointer value. // Now ill-formed. std::cout << u'X'; // Previously called operator<<(int) and printed an integer value. // Now ill-formed.
Change in D.16 [depr.fs.path.factory] paragraph 1:
Requires: The source and [first, last) sequences are UTF-8 encoded. The value type of Source and InputIterator is char. Source meets the requirements specified in 18.104.22.168.
"Working Draft, Standard for Programming Language C++", N4762, 2018.
"Permit conversions to arrays of unknown bound", P0388R2, 2018.
"char8_t: A type for UTF-8 characters and strings (Revision 6)", P0482R6, 2018.
Jeff Snyder and Louis Dionne,
"Class Types in Non-Type Template Parameters", P0732R2, 2018.
"SD-8: Standard Library Compatibility", SD-8, 2018.