"In these meetings, these conferences, we only see a little. C++ is not done in the light. The majority of C++ is not done publicly. Most C++ is done privately, in the dark, and that is where it matters most."
– Daniela K. Engert, November 14th, 2019
1. Revision History
1.1. Revision 1 - March 2nd, 2020
- 
     Thoroughly improve § 2 Motivation. - 
       Explicit state goals and non-goals in the § 2.3 Statement of Objectives. 
 
- 
       
- 
     Rewrite most of paper to more thoroughly explain the API, especially the § 3.3 High Level section with validate decode_count encode_count - 
       Include drastically improve the explanation for the free functions in § 3.3.1 Eager Free Functions. 
- 
       Emphasize the need for ranges in § 3.3.3 Improving Usability for Low-Memory Environments: Ranges. 
 
- 
       
- 
     Add new descriptions in the low-level API regarding error handling in § 3.2.2.2 Error Handling: Allow All The Options. 
- 
     Describe customization points in full in § 3.4.1 Speed and Flexibility for Everyone: Customization Points. 
- 
     The Implementation is now hidden, after doing a magic trick. Contact the author for access. 
- 
     Add § 5 FAQ. 
- 
     Going no-where, targeted at no-one. 
1.2. Revision 0 - June 17th, 2019
- 
     Initial release of exploratory paper. 
2. Motivation
It’s 2020 and Unicode is still barely supported in both the C and C++ standards.
From the POSIX standard requiring a single-byte encoding by default, heavy limitations placed in 
This paper aims to explore the design space for both extremely high performing transcoding (encoding and decoding) as well as a flexible one-by-one interface for more careful and meticulous text processing. This proposal arises from industry experience in large codebases and best-practice open source explorations with [libogonek], [icu], [boost.text] and [text_view] while also building on the concepts and design choices found in both [range-v3] and pre-existing text encoding solutions such as Windows’s 
The ultimate goal is to allow an interface that is correct by default but capable of being fast both by Standard Library implementer efforts but also program overridable ADL free functions. It will produce interfaces for encoding, decoding, and transcoding in eager and lazy forms.
2.1. The Basic Ideas
While some of these types aren’t contained in this paper, the end goal is to enable the following to be possible:
#include <encoding> // this proposal#include <text> // future proposalint main ( int , char * []) { using namespace std :: literals ; std :: text :: u8text my_text = std :: text :: transcode ( “안녕하세요 👋”sv , std :: text :: utf8 {}); std :: cout << my_text << std :: endl ; // prints 안녕하세요 👋 to a capable console std :: cout << std :: hex ; for ( const auto & cp : my_text ) { std :: cout << static_cast < uint32_t > ( cp ) << “ “; } // 0000c548 0000b155 0000d558 0000c138 0000c694 00000020 0001f44b return 0 ; } 
This paper is in support of reaching this goal. The following examples are more concretely tied to this proposal in particular.
2.1.1. Reading "Execution Encoding" Data
The following is an example of opening a file handle on Windows after converting from the execution encoding of the system 
#define WINDOWS_LEAN_AND_MEAN 1 #include <windows.h>#include <encoding> // this proposal#include <iostream>int main ( int argc , char * argv []) { if ( argc < 2 ) { std :: cerr << "Path unspecified: exiting." << std :: endl ; return - 1 ; } std :: wstring path_as_wstr = std :: text :: transcode ( std :: string_view ( argv [ 1 ]), std :: text :: wide_execution {}); // Interop with Windows std :: unique_ptr < HANDLE , FileHandleDeleter > target_file = CreateFileW ( path_as_wstr . data (), GENERIC_WRITE , 0 , NULL, CREATE_ALWAYS , FILE_ATTRIBUTE_NORMAL ); if ( ! target_file ) { // GetLastError(), etc... return - 2 ; } /* Use File... */ return 0 ; } 
This paper directly enables such a use case.
2.1.2. Networking with Boost.Beast
The following is an example using this proposal to do a byte-based read off the network of a UTF-16 Big Endian payload in any machine.
#include <boost/beast.hpp>#include <boost/beast/http.hpp>#include <boost/asio/ip/tcp.hpp>#include <iostream>#include <encoding> // this proposalnamespace beast = boost :: beast ; namespace http = beast :: http ; using tcp = boost :: asio :: ip :: tcp ; using results_type = tcp :: resolver :: results_type ; class session : public std :: enable_shared_from_this < session > { /* ... */ http :: request < http :: empty_body > req_ ; std :: vector < std :: byte > res_body_ ; http :: response < http :: vector_body < std :: byte : > res_ ; std :: u8string converted_body_ ; /* ... */ void on_connect ( beast :: error_codeec , results_type :: endpoint_type ); void on_resolve ( beast :: error_code ec , results_type results ); /* ... */ void on_read ( beast :: error_code ec , std :: size_t bytes_transferred ) { if ( ec ) { log_fail ( ec , u8"read failed" ); return ; } std :: span < std :: byte > bytes ( res_body_ . data (), bytes_transferred ); std :: ranges :: unbounded_view output ( std :: back_inserter ( converted_body_ )); // utf16, but big endian std :: text :: encoding_scheme < std :: text :: utf16 , std :: endian :: big > from_encoding {}; std :: text :: utf8 to_encoding {}; // transcode from bytes that are UTF16, Big Endian, // into unbounded output std :: text :: transcode ( bytes , output , from_encoding , to_encoding ); std :: clog << converted_body_ << std :: endl ; /* Commit / clean up, etc. */ } }; 
This paper directly enables such a use case.
2.2. Current Problems
I don’t write any software which runs only in English. I’m tired of writing the same code different ways all the time just to display a handful of strings. Lately, I just skip C++ for anything that displays UI -- it’s so much easier in every other modern language.
This is REQUIRED for using C++ with any software which needs to run in multiple languages, without rolling your own code. I’m tired of writing this from scratch for every separate project (cannot share code for most of them), using different underlying libraries for each (as licensing and processing requirements vary, I can’t just pick one library and use it everywhere). Unfortunately, I have no confidence the ISO committee understands the problem well enough, given how it patted itself on the back so much for adding u8"", u"", and U"" a while back. Real-world software which runs in multiple languages never hard-codes strings...
Norway has its own character set which is a variant of ISO-8859-10 with modifications to a couple of characters. This proposal would ease the transition for existing software when C++ gets (better/more coherent) support for Unicode.
The standard : "Oh yeah hey dudes
is deprecated but we didn’t feel like writing an alternative so good luck yolo".codecvt 
– Herb Sutter’s "Top 5 C++ Proposals" Survey, Survey Respondent
Text in the Standard is a desert wasteland.
After pulling 
The use cases for text encoding are vast. From: basic processing of user-entered data; sanitization of scripts; domain name protection in browsers; text conversions when working with legacy systems or differing new/Unicode systems; supplying the components that can be successfully used with industry-standard FreeType/Harfbuzz and DirectWrite; talking properly to legacy GDI applications; communicating string data in JSON; receiving market data from the Chinese Exchange in GB18030; converting and preserving government data in digital records; handling data generated by logs in a multitude of languages; handling user names without mangling; and hundreds of dozens of other use cases, the need for text practically writes itself.
2.3. Statement of Objectives
Part of this proposal is identifying exactly how those needs should be served. The primary objectives of this proposal, therefore, is as follows:
- 
     Users should be able to define their own encodings for their own encodings. Jonathan Wakely’s time is not worth EBCDIC, but IBM will certainly be very invested in making sure EBCDIC and its code pages is well-implemented and optimized. Put another way: company-specific and user-specific problems should be specific to them and not exported to the whole ecosystem, and they should be able t handle their problems effectively and efficiently without throwing the C++ Standard in the trash. 
- 
     Locale-based char wchar_t 
- 
     The standard library should be able to cannibalize all existing legacy encodings and -- by way of leading design -- encourage and promote the use of Unicode in the user’s code. Embrace. Extend. Extinguish. 
- 
     The standard library (and its implementers) do not have time to implement every new, old, and existing encoding. Put bluntly: CJ Johnson’s brilliance and Stephan T. Lavavej’s passion is better spent improving their respective libraries and fixing bugs, not implementing EBCDIC or ISO/IEC 2022 CN, extended variant 2. 
- 
     Unicode is the one and only language the standard speaks in its higher level text algorithms and functionality: legacy encodings must convert to Unicode to work with functionality built beyond this proposal. Future proposals will never need to concern themselves with encodings after this proposal is done. 
- 
     Users may choose not to convert to Unicode, but they will need to spend the time and effort working out that trade off with their environment. The standard library will never have to care about text that willingly and deliberately exits the Unicode system. 
- 
     Safety is not optional. Code that performs unsafe operations should require explicit opt-in and easily searchable patterns and names that make it clear the user has made a deliberate choice to open themselves up to vulnerabilities such as Undefined Behavior. 
- 
     Performance is not optional, and correctness isn’t a tender suggestion achievable with insane workarounds. 
- 
     Simple function calls should be simple, but if the user wants to pry open the details they should be able to do so incrementally with ease. 
- 
     Nobody has time to reimplement all of iconv, especially the library developers. The interface should allow implementers to substitute a backend for certain encodings that takes advantage of pre-existing Operating System, Widely-Available Library, or similar functionality. 
- 
     Users should be able to do everything implementers can without undue clash between user functionality and implementer internal handling and extensions. 
- 
     Octets -- delivered over the network, from IPC, or similar -- are an important input case that must be handled. 
- 
     The design must be viable for low-memory environments, and prioritize zero allocation if a user cares enough to invest the time into the API with that goal. 
- 
     At no point should we be introducing new container types for this functionality. Container wrappers / adaptors and range wrapper / adaptors are enough. 
3. Design
The current design has been the culmination of a few years of collaborative and independent research, starting with the earliest papers from Mark Boyall’s [n3574], Tom Honermann’s [p0244r2], study of ICU’s interface, and finally the musings, experience and work of R. Martinho Fernandes in [libogonek]. Current and future optimizations are considered to ensure that fast paths are not blocked in the interface proposed for standardization. With [boost.text] showing an interface with a nailed down internally used UTF-8 encoding, Markus Sherer’s participation in SG16 meetings, Henri Sivonen’s feedback on blog posts and mailing lists, and Bob Steagall’s work in writing a fast UTF8 decoder this paper absorbs a wealth of knowledge to get reach a flexible interface that enables high-throughput.
In reading, implementing, working with and consuming all of these designs, the author of this paper, independent implementers, and several SG16 members have come to the following core tenants:
- 
     strong types for code units allow selecting proper default encodings for these interfaces; 
- 
     iterators and ranges are a huge interface win for working with text but are impossible to provide the fastest possible way to encode/decode/transcode text; 
- 
     and, avoid creating new vocabulary: improve working with original containers and imposing well-formedness constraints upon them rather than designing new containers from the ground up. 
Given these tenants, the following interface choices have arisen for this paper. Each section will describe a piece of the interface, its goals, and how it works. A low-level encoding interface and its plumbing and core types will be described first, followed by a high level interface that makes the low level easy to use. Both are imperative to cover the full design space that exists together, and the use cases today.
3.1. Definitions
Some handy definitions here which will be used liberally applied to template parameters and other things to shorten the specification.
- 
     Unicode Code Point: the 21-bit value (often represented as a 32-bit number for implementation-related reasons) that represents a code point from the Unicode Standard. Specifically, it is the range of integers 0 to 0x10FFFF inclusive. 
- 
     Unicode Scalar Value: the 21-bit value that represents a code point from the Unicode Standard, but without Surrogate Unicode Code Point values. Specifically, it is the ranges of integers 0 to 0xD7FF and 0xE000 to 0x10FFFF inclusive. 
- 
     unicode_code_point char32_t 
- 
     unicode_scalar_value char32_t 
- 
     using UEncoding = std :: remove_cvref_t < Encoding > Encoding 
- 
     using UToEncoding = std :: remove_cvref_t < FromEncoding > FromEncoding 
- 
     using UFromEncoding = std :: remove_cvref_t < ToEncoding > ToEncoding 
- 
     template < typename T > using encoding_state_t = typename std :: remove_cvref_t < T >:: state ; 
- 
     template < typename T > using encoding_code_unit_t = typename std :: remove_cvref_t < T >:: code_unit ; 
- 
     template < typename T > using encoding_code_point_t = typename std :: remove_cvref_t < T >:: code_point ; code_point T 
- 
     is_self - state_encoding_v < T > 
template < typename T > inline constexpr bool is_self_state_encoding_v = std :: is_same_v < std :: remove_cvref_t < T > , encoding_state_t < T >> ; 
- 
     range_of < T > value_type T std :: vector < int > int [ 1 ] const range_of < int > auto & 
template < typename R , typename T > concept range_of = std :: ranges :: range < std :: remove_cvref_t < R >> && std :: is_same_v < std :: ranges :: range_value_t < std :: remove_cvref_t < R >> , T > ; 
- 
     contiguous_range_of < T > value_type T std :: span < double > double [ 1 ] const contiguous_range_of < double > auto & 
template < typename R , typename T > concept contiguous_range_of = std :: ranges :: contiguous_range < std :: remove_cvref_t < R >> && std :: is_same_v < std :: ranges :: range_value_t < std :: remove_cvref_t < R >> , T > ; 
3.2. Low-Level
The high-level interfaces must be built on something: it cannot be magically willed into existence. There is quite a bit of plumbing that goes into the low-level interfaces, most of which will be boilerplate to users but will serve keen use and importance to several library developers and standard library implementers.
3.2.1. Error Codes
There is some boilerplate that needs to be taken care of before building our encoding, decoding, transcoding and similar functionality begins. First and foremost is the error codes and result types that will go in and out of our encoding functions. The error code enumeration is 
namespace std { namespace text { enum class encoding_errc : int { // just fine ok = 0x00 , // input contains ill-formed sequences invalid_sequence = 0x01 , // input contains incomplete sequences incomplete_sequence = 0x02 , // output cannot receive all the completed // code units insufficient_output_space = 0x03 , // sequence can be encoded but resulting // code point is invalid (e.g., encodes a lone surrogate) invalid_output = 0x04 , // input contains overlong encoding sequence // (e.g. for utf8) overlong_sequence = 0x05 , // leading code unit is wrong invalid_leading_sequence = 0x06 , // leading code units were correct, trailing // code units were wrong invalid_trailing_sequence = 0x07 }; }} 
The comments give some small amount of examples about what each one means. The reason 0 is used to signal success is very simple: the next part of the API creates an encoding_error_category class and hooks up the machinery for a 
namespace std { template <> class is_error_condition_enum < encoding_errc > : true_type {}; class encoding_error_category : public error_category { public : constexpr encoding_error_category () noexcept ; virtual const char * name () const noexcept override ; virtual string message ( int condition ) const override ; }; } 
This allows the creation of a 
3.2.2. Result Types
The result types are the glue that help users who use the low level interface loop through their text properly. It returns updated ranges of both the input and output to indicate how far things have been moved along, on top of an error_code and whether or not the result came from an error being handled:
namespace std { namespace text { template < typename Input , typename Output , typename State > class encode_result { Input input ; Output output ; State & state ; encoding_errc error_code ; bool handled_error ; template < typename InRange , typename OutRange , typename EncodingState > constexpr encode_result ( InRange && input , OutRange && output , EncodingState && state , encoding_errc error_code = encoding_errc :: ok ); template < typename InRange , typename OutRange , typename EncodingState > constexpr encode_result ( InRange && input , OutRange && output , EncodingState && state , encoding_errc error_code , bool handled_error ); constexpr std :: error_condition error () const ; }; template < typename Input , typename Output , typename State > class decode_result { Input input ; Output output ; State & state ; encoding_errc error_code ; bool handled_error ; template < typename InRange , typename OutRange , typename EncodingState > constexpr decode_result ( InRange && input , OutRange && output , EncodingState && state , encoding_errc error_code = encoding_errc :: ok ); template < typename InRange , typename OutRange , typename EncodingState > constexpr decode_result ( InRange && input , OutRange && output , EncodingState && state , encoding_errc error_code , bool handled_error ); constexpr std :: error_condition error () const ; }; template < typename Input , typename Output , typename FromState , typename ToState > class transcode_result { Input input ; Output output ; FromState & state ; ToState & state ; encoding_errc error_code ; bool handled_error ; template < typename InRange , typename OutRange , typename FromEncodingState , typename ToEncodingState > constexpr decode_result ( InRange && input , OutRange && output , FromEncodingState && from_state , ToEncodingState && to_state , encoding_errc error_code = encoding_errc :: ok ); template < typename InRange , typename OutRange , typename FromEncodingState , typename ToEncodingState > constexpr decode_result ( InRange && input , OutRange && output , FromEncodingState && from_state , ToEncodingState && to_state , encoding_errc error_code , bool handled_error ); constexpr std :: error_condition error () const ; }; template < typename Input , typename State > struct validate_result { Input input ; bool valid ; State & state ; template < typename ArgInput , typename ArgState > constexpr validate_result ( ArgInput && input , bool is_valid , ArgState && state ); }; template < typename Input , typename State > struct count_result { Input input ; size_t count ; State & state ; encoding_error error_code ; bool handled_error ; template < typename ArgInput , typename ArgState > constexpr count_result ( ArgInput && input , size_t count , ArgState && state , encoding_error error_code = encoding_error :: ok ); template < typename ArgInput , typename ArgState > constexpr count_result ( ArgInput && input , size_t count , ArgState && state , encoding_error error_code , bool handled_error ); }; }} 
There is a lot to unpack here. There are two essentially identical structures: 
Note: Having 2 differently-named types with much the same interface is paramount to allow an 
3.2.2.1. Input and Output Ranges
These are essentially the ranges moved forward as much or as little as the encoding needed to for reading from the input, converting, and writing to the output. It also solves the problem of obtaining maximal speed based on checking if the destination is filled or if the input is exhausted: 
The decoding result and encoding result types both return the input and output range passed to encoding and decoding functions in the structure itself. This represents the changed ranges. In the event where the range cannot be successfully reconstructed from itself using the iterator and sentinel, a 
3.2.2.2. Error Handling: Allow All The Options
This is a low-level interface. As such, accommodating different error handling strategies is necessary. There are several ways to report errors used in both the C and C++ standard libraries, from throwing errors, to 
To accommodate the wide breadth of C++ programming environments and ecosystems, error reporting will be done through an error handler, which can be any type of callable that matches the desired interface. The standard will provide 4 of these error handlers:
namespace std { namespace text { class replacement_handler ; class throw_handler ; class assume_valid_handler ; class default_handler ; }} 
The interface for an error handler looks like the below example error handler:
namespace std { namespace text { class example_error_handler { template < typename Encoding , typename InputRange , typename OutputRange , typename State , contiguous_range_of < encoding_code_point_t < Encoding >> Progress > constexpr auto operator ()( const Encoding & encoding , encode_result < InputRange , OutputRange , State > result , const Progress & progress ) const { /* morph result, log, throw error, etc. ... */ return result ; } template < typename Encoding , typename InputRange , typename OutputRange , typename State , contiguous_range_of < encoding_code_unit_t < Encoding >> Progress > constexpr auto operator ()( const Encoding & encoding , decode_result < InputRange , OutputRange , State > result , const Progress & progress ) const { /* morph result, log, throw error, etc. ... */ return result ; } }; }} 
The specification here is a value-based one. 
There are a few things that can be done in the commented code shown above. First and foremost is that someone could look at 
Note: Throwing is explicitly not recommended by default by prominent vendors and implementers (Mozilla, Apple, the Unicode Consortium, WHATWG, etc.). Ill-formed text is common. Text from misbehaving programs -- 40 years of them -- is a frequent kind of user and machine input. It is extremely easy to provoke a Denial of Service Attack (DoS Attack) if an application throws an error on malformed input that the application author did not consider.
The default error handler will be the 
The 
- 
     On a failure in decode_one - 
       If the output is at its end, return the result as-is. 
- 
       If the expression decltype ( auto ) replacement_points = encoding . replacement_code_points (); replacement_points replacement_points 
- 
       Otherwise, if the code_point char32_t unicode_code_point unicode_scalar_value { 'U \uFFFD '} 
- 
       Otherwise, if the expression decltype ( auto ) replacement_units = encoding . replacement_code_units (); replacement_units auto intermediate_result = encoding . decode_one ( replacement_units , result . output , /* implementation-defined pass-through handler */ , result . state ); intermediate_result . error_code std :: text :: encoding_errc :: ok output 
- 
       Otherwise, the program is ill-formed. 
 
- 
       
- 
     On a failure in encode_one - 
       If the output is at its end, return the result 
- 
       If the expression decltype ( auto ) replacement_units = encoding . replacement_code_units (); replacement_units result result . output replacement_units 
- 
       Otherwise, if the code_point char32_t unicode_code_point unicode_scalar_value { 'U \uFFFD '} 
- 
       Otherwise, if the expression decltype ( auto ) replacement_points = encoding . replacement_code_points (); replacement_points auto intermediate_result = encoding . encode_one ( replacement_points , result . output , /* implementation-defined pass-through handler */ , result . state ); intermediate_result . error_code std :: text :: encoding_errc :: ok result . output 
- 
       Otherwise, the program is ill-formed. 
 
- 
       
If successful, the error code on the result will be corrected to say "everything is fine" (
For performance reasons and flexibility, the error callable must have a way to ensure that the user and implementation can agree on whether or not Undefined Behavior is invoked by assuming that the text is valid. [libogonek] made an object of type 
This is notably important: Rust attempted to force that every string constructed ever was valid UTF-8 and rigorously checked this pre- and post-condition. Doing this check was so obscenely expensive that they needed to introduce a new function to 
3.2.3. The Encoding Object
It is no great surprise that there is not enough library implementers prepared to standardize the entirety of what the WHATWG specifies in its encoding specification, let alone enough to handle every rogue request for a new encoding object type in C++ Standard. A system must be developed that provides flexibility for the end-user that does not require them writing a paper and getting into a 1-2 year long process of herding a proposal through the notoriously slow Committee, just to have support for X encoding or Y feature. There is also less and less (read: almost none) tolerance for adding whacky extension to libraries like libstdc++ or libc++, and MSVC has only recently open-sourced (with no appetite for shoveling more semi-abandonware legacy library extensions into their codebase at the time of writing).
Encoding objects provide flexibility that enable us to consume the entire encoding space without needing to tax the Standard Library. It enables other people to plug into the system and provides the flexibility they need, and only standardize when interoperability and redundant implementation becomes a burden to the greater C++ ecosystem. This frees up Billy O’Neal, Jonathan Wakely, Louis Dionne, their successors, and the dozens of other standard library contributors and implementers to focus on producing high quality code, rather than scrambling to implementing four or five dozen encodings because one company, somewhere, made an at-the-time-it-seemed-okay choice in 2005 about how to store their text.
Given our result types and error handlers, the interface for the encoding object itself can be defined. Here is the example encoding illustrating the interface:
namespace std { namespace text { // NOTE: exemplary encoding // for expository purposes // containing all the types class example_locale_encoding { class example_state { std :: mbstate_t multibyte_state ; }; // REQUIRED: member types and variables using code_point = char32_t ; using code_unit = char ; using state = example_state ; static constexpr size_t max_code_unit_sequence = MB_LEN_MAX ; static constexpr size_t max_code_point_sequence = 1 ; // OPTIONAL: member types and variables using is_encoding_injective = std :: false_type ; using is_decoding_injective = std :: true_type ; // REQUIRED: functions template < typename In , typename Out , typename Handler > decode_result < In , Out , state > decode ( In && in_range , Out && out_range , Handler && handler , state & current_state ); template < typename In , typename Out , typename Handler > encode_result < In , Out , state > encode ( In && in_range , Out && out_range , Handler && handler , state & current_state ); // OPTIONAL: functions constexpr const range_of < code_point > auto & replacement_code_points () const noexcept ; constexpr const range_of < code_unit > auto & replacement_code_points () const noexcept ; }; }} 
There are many pieces of this encoding object. Some of them fit the purposes explained above. As an overview, given an 
- 
     code_unit code_point code_point 
- 
     state - 
       If is_encoding_self_state_t < Encoding > encoding_state_t < Encoding > 
- 
       If is_encoding_self_state_t < Encoding > 
 
- 
       
- 
     max_code_unit_sequence max_code_point_sequence max_code_point_sequence 1 
- 
     decode encode decode code_unit code_point encode code_point code_unit In Out handler 
Optionally, some additional type definitions and functions help with safety, error handling (for replacement), and more:
- 
     is_encoding_injective is_decoding_injective error_handler 
- 
     replacement_code_points decode std :: text :: default_handler std :: text :: replacement_handler \(�). This can be defined to be an empty range (not recommended but possible).uFFFD 
- 
     replacement_code_units encode std :: text :: default_handler std :: text :: replacement_handler \(�). This can be defined to return an empty range (not recommended, but possible).uFFFD 
3.2.3.1. Encodings Provided by the Standard
The primary reason for the standard to provide an encoding is to ensure that it produces a way for applications to communicate with one another. As a baseline, the standard should support all the encodings it ships with its string literal types. On top of that, there is an important base-level optimization when working with strictly ASCII text that can be implemented with UTF8 which would most library implementers are interested in shipping. This means that the following encodings will be shipped by the standard library:
// header: <encoding> namespace std { namespace text { using unicode_code_point = char32_t ; class unicode_scalar_value ; template < typename CharT > class basic_utf8 ; template < typename CharT > class basic_utf16 ; template < typename CharT > class basic_utf32 ; template < typename Encoding , std :: endian endianness = std :: endian :: native , typename Byte = std :: byte > class encoding_scheme ; class ascii ; using utf8 = basic_utf8 < char8_t > ; using utf16 = basic_utf16 < char16_t > ; using utf32 = basic_utf32 < char32_t > ; class narrow_literal ; class wide_literal ; class narrow_execution ; class wide_execution ; }} 
All of 
Both 
These represent the core 9 encodings must be shipped with the standard, no matter what.
3.2.3.2. UTF Encodings: variants?
There are many variants of encodings like UTF8 and UTF16. These include [wtf8] or [cesu8] and are useful for internal processing and interoperability with certain systems, like direct interfacing with Java or communication with an Oracle database. However, almost none of these are publicly recommend as interchange formats: both CESU-8 and WTF-8 are documented and used internally for legacy reasons. In some cases, they also represent security vulnerabilities if they are used in interchange for the internet. This makes them less and less desirable to provide VIA the standard. However, it is worth acknowledging that supporting WTF-8 and CESU-8 as encodings will ease individuals who need to roll such encodings for their applications.
More pressingly, there is a wide body of code that operates with 
namespace std { namespace text { template < typename CharT , bool encode_null , bool encode_lone_surrogates > class basic_utf8 ; using utf8 = basic_utf8 < char8_t , false, false> ; template < typename CharT , bool allow_lone_surrogates > class basic_utf16 ; using utf16 = basic_utf8 < char16_t , false> ; }} 
And externally, libraries and applications could add their own using statements and type definitions for the purposes of internal interoperation:
namespace my_app { using compat_utf8 = std :: basic_utf8 < char , false, false> ; using mutf8 = std :: basic_utf8 < char8_t , true, false> ; using filesystem16 = std :: basic_utf16 < wchar_t , true> ; } 
There is clear utility that can be had here. But, this is not going to be looked into too deeply for the first iterations of this proposal. If there is a need, users are strongly encouraged to chime in (speak up) quickly so that this feature can be added to the proposal before later progression stages.
Finally, there is a plan that for early C++26, the full gamut of WHATWG encodings will be added to the standard, since this covers the minimal viable set of encodings that is required for communicating across the internet and through messaging mediums such as e-mail successfully.
3.2.3.3. Encoding Schemes: Byte-Based
Unicode specifies what are called Encoding Schemes for the encodings whose code unit size exceeds a single byte. This is essentially UTF16 and UTF32, of which there is UTF16 Little Endian (UTF16-LE), UTF16 Big Endian (UTF16-BE), UTF32 Little Endian (UTF32-LE), and UTF32 Big Endian (UTF32-BE). Encoding schemes can be generically handled without creating extremely specific encodings by creating an 
// header: <encoding> namespace std { namespace text { template < typename Encoding , std :: endian endianness = std :: endian :: native , typename Byte = std :: byte > class encoding_scheme ; }} 
This is a transformative encoding type that takes the source endianness and translates it to the native endianness. It has an identical interface to the 
All 
A few SG16 members have frequently advocated that the base input and outputs for all types matching the 
 Writing mostly-duplicate encoding object types for 
This direction is far less boilerplate, and  has also already seen implementation experience in [libogonek]'s [libogonek-encoding_scheme] type. Users have not complained. It has also proved to be implementable by simply decomposing the original input/output ranges into their iterators, and wrapping said iterators with a 
3.2.3.4. Default Encodings
For interactions with encodings, there are times when a default encoding may be inferred from input and output types in § 3.3 High Level's functions. Thusly, 2 traits provide defaults that can be overridden by the program:
// header: <encoding> namespace std { namespace text { template < typename T > using default_code_unit_encoding_t = /* ... */ ; template < typename T > using default_code_point_encoding_t = /* ... */ ; }} 
The implementation for the standard will attempt to select one of the following, or fail, for 
- 
     std :: text :: execution T char 
- 
     std :: text :: wide_execution T wchar_t 
- 
     std :: text :: utf8 T char8_t 
- 
     std :: text :: utf16 T char16_t 
- 
     std :: text :: utf32 T char32_t std :: text :: unicode_code_point std :: text :: unicode_scalar_value 
- 
     std :: text :: encoding_scheme < std :: text :: utf8 > T std :: byte 
- 
     Otherwise, the program is ill-formed. 
For 
- 
     std :: text :: utf8 T std :: text :: unicode_code_point std :: text :: unicode_scalar_value char32_t 
- 
     Otherwise, the program is ill-formed. 
3.2.4. Stateful Objects, or Stateful Parameters?
Stateful objects are good for encapsulation, reuse and transportation. They have been proven in many APIs both C and C++ to provide a good, reentrant API with all relevant details captured on the (sometimes opaque) object itself. After careful evaluation, stateful parameter rather than a wholly stateful object for the function calls in encoding and decoding types are a better choice for this low-level interface. The main and important benefits for having the state be passed to the encoding / decoding function calls as a parameter are that it:
- 
     maintains that encoding objects can be cheap to construct, copy and move; 
- 
     improves the general reusability of encoding objects by allowing state to be massaged into certain configurations by users; 
- 
     and, allows users to set the state in a public way without having to prescribe a specific API for all encoders to do that. 
The reason for keeping encoding types cheap is that they will be constructed, copied, and moved a lot, especially in the face of the ranges that SG16 is going to be putting a lot of work into (
Consider the case of execution encoding character sets today, which often defer to the current locale. Locale is inherently expensive to construct and use: if the standard has to have an encoding that grabs or creates a 
In contrast, consider having an explicit parameter. At the cost of making a low-level interface take one more argument, the state can be paid for once and reused in many separate places, allowing a user to front-load the state’s expenses up-front. It also allows the users to set or get the locale ahead of time and reuse it consistently. It also allows for encoding or decoding operations to be reused or restart in the cases of interruptible or incomplete streams, such as network reading or I/O buffering. These are potent use cases wherein such a design decision becomes very helpful.
Finally, this paradigm makes it far more obvious to the end user when the state is inseparable from the encoding object itself. This is the case with a theoretical 
3.3. High Level
Working with the lower level facilities for text processing is not a pretty sight. Consider the usage of the low-level facilities described above:
#include <encoding>#include <iterator>#include <span>int main () { std :: text :: unicode_code_point array_output [ 41 ]{}; std :: u8string_view input = u8"𐌼𐌰𐌲 𐌲𐌻𐌴𐍃 𐌹̈𐍄𐌰𐌽, 𐌽𐌹 𐌼𐌹𐍃 𐍅𐌿 𐌽𐌳𐌰𐌽 𐌱𐍂𐌹𐌲𐌲𐌹𐌸." ; std :: text :: utf8 encoding {}; std :: u8string_view working_input = input ; std :: span < std :: text :: unicode_code_point > working_output ( array_output ); std :: text :: default_handler handler {}; std :: text :: utf8 :: state encoding_state {}; for (;;) { auto result = encoding . decode ( working_input , working_output , handler , encoding_state ); if ( result . error_code != encoding_errc :: ok ) { // not what we wanted. return - 1 ; } if ( std :: empty ( result . input )) { break ; } working_input = std :: move ( result . input ); working_output = std :: move ( result . output ); } assert ( std :: u32string_view ( array_output ) == U"𐌼𐌰𐌲 𐌲𐌻𐌴𐍃 𐌹̈𐍄𐌰𐌽, 𐌽𐌹 𐌼𐌹𐍃 𐍅𐌿 𐌽𐌳𐌰𐌽 𐌱𐍂𐌹𐌲𐌲𐌹𐌸." ); return 0 ; } 
These low-level facilities -- while powerful and customizable -- do not represent what the average user will -- or should -- be wrangling with. Therefore, the higher-level facilities become incredibly pressing to make these interfaces palatable and sustainable for developers in both the short and long term. Consider the same encoding functionality, boiled down to something far easier to use:
std :: u32string output = std :: text :: decode ( u8"𐌼𐌰𐌲 𐌲𐌻𐌴𐍃 𐌹̈𐍄𐌰𐌽, 𐌽𐌹 𐌼𐌹𐍃 𐍅𐌿 𐌽𐌳𐌰𐌽 𐌱𐍂𐌹𐌲𐌲𐌹𐌸." ); assert ( output == U"𐌼𐌰𐌲 𐌲𐌻𐌴𐍃 𐌹̈𐍄𐌰𐌽, 𐌽𐌹 𐌼𐌹𐍃 𐍅𐌿 𐌽𐌳𐌰𐌽 𐌱𐍂𐌹𐌲𐌲𐌹𐌸." ); 
This is much simpler and does exactly the same as the above, without all the setup and boilerplate. Of course, taking only the input and giving the output is too much of a simplification, so there are a few overloads and variants that will be offered. Particularly, there needs to be 3 sets of free functions: 
Note that, at the core of all these functions, the loop as shown above captures the core of the work. All of these abstractions are built on the 7 basis operations specified in § 3.2.3 The Encoding Object. Actually getting additional optimizations is, of course, left to the readers and implementers.
3.3.1. Eager Free Functions
The free functions are written in a way to eagerly consume input and output space, unless given an explicit output container which limits its behavior or an error occurs. This is beneficial because many text processing algorithms receive the bulk of their gains by being able to work on multiple code units / code points. Therefore, this layer of the high level API is provided to satisfy the need where input and output space are of little concern.
3.3.1.1. Free Function decode 
   The 
- 
     Performing an auto result = encoding . decode_one (...) 
- 
     Checking if the return value’s error code is std :: text :: encoding_errc :: ok 
- 
     Checking std :: ranges :: empty ( result . input ) error_code std :: text :: encoding_errc :: ok 
- 
     Otherwise, go to 0 and use the result . input result . output 
The surface of the 
// header: <encoding> namespace std { namespace text { template < typename Input , typename Output , typename Encoding , typename State , typename ErrorHandler > constexpr auto decode_into ( Input && input , Encoding && encoding , Output && output , ErrorHandler && error_handler , State & state ); template < typename Input , typename Encoding , typename Output , typename ErrorHandler > constexpr auto decode_into ( Input && input , Encoding && encoding , Output && output , ErrorHandler && error_handler ); template < typename Input , typename Encoding , typename Output > constexpr auto decode_into ( Input && input , Encoding && encoding , Output && output ); template < typename Input , typename Output > constexpr auto decode_into ( Input && input , Output && output ); template < typename Input , typename Encoding , typename ErrorHandler , typename State > constexpr auto decode ( Input && input , Encoding && encoding , ErrorHandler && error_handler , State & state ); template < typename Input , typename Encoding , typename ErrorHandler > constexpr auto decode ( Input && input , Encoding && encoding , ErrorHandler && error_handler ); template < typename Input , typename Encoding > constexpr auto decode ( Input && input , Encoding && encoding ); template < typename Input > constexpr auto decode ( Input && input ); }} 
The order of arguments is chosen based on what users are likely to specify first. In many cases, all that is needed is the input: the encoding can be chosen automatically for the user based on such. For 
- 
     If is_encoding_self_state_t < Encoding > encoding . reset_state (); encoding State & 
- 
     Otherwise, encoding_state_t < Encoding > {} 
The 
Note: in the current running implementation, there are also separate overloads for 
3.3.1.2. Free Function encode 
   The 
- 
     Performing an auto result = encoding . encode_one (...) 
- 
     Checking if the return value’s error code is std :: text :: encoding_errc :: ok 
- 
     Checking std :: ranges :: empty ( result . input ) error_code std :: text :: encoding_errc :: ok 
- 
     Otherwise, go to 0 and use the result . input result . output 
The surface of the 
// header: <encoding> namespace std { namespace text { template < typename Input , typename Output , typename Encoding , typename State , typename ErrorHandler > constexpr auto encode_into ( Input && input , Encoding && encoding , Output && output , ErrorHandler && error_handler , State & state ); template < typename Input , typename Encoding , typename Output , typename ErrorHandler > constexpr auto encode_into ( Input && input , Encoding && encoding , Output && output , ErrorHandler && error_handler ); template < typename Input , typename Encoding , typename Output > constexpr auto encode_into ( Input && input , Encoding && encoding , Output && output ); template < typename Input , typename Output > constexpr auto encode_into ( Input && input , Output && output ); template < typename Input , typename Encoding , typename ErrorHandler , typename State > constexpr auto encode ( Input && input , Encoding && encoding , ErrorHandler && error_handler , State & state ); template < typename Input , typename Encoding , typename ErrorHandler > constexpr auto encode ( Input && input , Encoding && encoding , ErrorHandler && error_handler ); template < typename Input , typename Encoding > constexpr auto encode ( Input && input , Encoding && encoding ); template < typename Input > constexpr auto encode ( Input && input ); }} 
For 
- 
     If std :: is_same_v < typename std :: iterator_traits < std :: ranges :: range_iterator_t < Output >>:: iterator_category , std :: output_iterator_tag > default_code_unit_encoding_t < std :: ranges :: range_value_t < Output >> {} 
- 
     Otherwise, if the iterator category of the iterators of the output range are std :: output_iterator_tag default_code_point_encoding_t < std :: ranges :: range_value_t < Input >> {} 
Otherwise, the user must specify the 
- 
     If is_encoding_self_state_t < Encoding > encoding . reset_state (); encoding State & 
- 
     Otherwise, encoding_state_t < Encoding > {} 
The 
Note: in the current running implementation, there are also separate overloads for 
3.3.1.3. Free Function transcode 
   The 
- 
     Performing an auto d_result = from_encoding . decode_one (...) encoding_code_point_t < FromEncoding > intermediate [ FromEncoding :: max_code_points ]; 
- 
     Checking if the return value’s error code is std :: text :: encoding_errc :: ok 
- 
     Performing an auto e_result = to_encoding . encode_one (...) intermediate 
- 
     Checking if the return value’s error code is std :: text :: encoding_errc :: ok 
- 
     Checking std :: ranges :: empty ( d_result . input ) error_code std :: text :: encoding_errc :: ok 
- 
     Otherwise, go to 0 and use the d_result . input e_result . output 
The surface of the 
// header: <encoding> namespace std { namespace text { template < typename Input , typename FromEncoding , typename Output , typename ToEncoding , typename FromErrorHandler , typename ToErrorHandler , typename FromState , typename ToState > constexpr auto transcode_into ( Input && input , FromEncoding && from_encoding , Output && output , ToEncoding && to_encoding , FromErrorHandler && from_error_handler , ToErrorHandler && to_error_handler , FromState & from_state , ToState & to_state ); template < typename Input , typename FromEncoding , typename Output , typename ToEncoding , typename FromErrorHandler , typename ToErrorHandler , typename FromState > constexpr auto transcode_into ( Input && input , FromEncoding && from_encoding , Output && output , ToEncoding && to_encoding , FromErrorHandler && from_error_handler , ToErrorHandler && to_error_handler , FromState & from_state ); template < typename Input , typename FromEncoding , typename Output , typename ToEncoding , typename FromErrorHandler , typename ToErrorHandler > constexpr auto transcode_into ( Input && input , FromEncoding && from_encoding , Output && output , ToEncoding && to_encoding , FromErrorHandler && from_error_handler , ToErrorHandler && to_error_handler ); template < typename Input , typename FromEncoding , typename Output , typename ToEncoding , typename FromErrorHandler > constexpr auto transcode_into ( Input && input , FromEncoding && from_encoding , Output && output , ToEncoding && to_encoding , FromErrorHandler && from_error_handler ); template < typename Input , typename Output , typename ToEncoding , typename FromEncoding > constexpr auto transcode_into ( Input && input , Output && output , FromEncoding && encoding , ToEncoding && encoding ); template < typename Input , typename Output , typename ToEncoding > constexpr auto transcode_into ( Input && input , Output && output , ToEncoding && encoding ); template < typename Input , typename FromEncoding , typename ToEncoding , typename FromErrorHandler , typename ToErrorHandler , typename FromState , typename ToState > constexpr auto transcode ( Input && input , FromEncoding && from_encoding , ToEncoding && to_encoding , FromErrorHandler && from_error_handler , ToErrorHandler && to_error_handler , FromState & from_state , ToState & to_state ); template < typename Input , typename FromEncoding , typename ToEncoding , typename FromErrorHandler , typename ToErrorHandler , typename FromState > constexpr auto transcode ( Input && input , FromEncoding && from_encoding , ToEncoding && to_encoding , FromErrorHandler && from_error_handler , ToErrorHandler && to_error_handler , FromState & from_state ); template < typename Input , typename FromEncoding , typename ToEncoding , typename FromErrorHandler , typename ToErrorHandler > constexpr auto transcode ( Input && input , FromEncoding && from_encoding , ToEncoding && to_encoding , FromErrorHandler && from_error_handler , ToErrorHandler && to_error_handler ); template < typename Input , typename FromEncoding , typename ToEncoding , typename FromErrorHandler > constexpr auto transcode ( Input && input , FromEncoding && from_encoding , ToEncoding && to_encoding , FromErrorHandler && from_error_handler ); template < typename Input , typename ToEncoding , typename FromEncoding > constexpr auto transcode ( Input && input , FromEncoding && encoding , ToEncoding && encoding ); template < typename Input , typename ToEncoding > constexpr auto transcode ( Input && input , ToEncoding && encoding ); }} 
For 
- 
     If std :: is_same_v < typename std :: iterator_traits < std :: ranges :: range_iterator_t < Output >>:: iterator_category , std :: output_iterator_tag > default_code_point_encoding_t < std :: ranges :: range_value_t < Output >> {} 
- 
     Otherwise, if the iterator category of the iterators of the output range are std :: output_iterator_tag default_code_point_encoding_t < std :: ranges :: range_value_t < Input >> {} 
Otherwise, the user must specify the 
- 
     If is_encoding_self_state_t < Encoding > encoding . reset_state (); encoding State & 
- 
     Otherwise, encoding_state_t < Encoding > {} 
The 
Note: in the current running implementation, there are also separate overloads for 
3.3.1.4. Free Function validate 
   The 
- 
     Performing an auto result = encoding . decode_one (...) 
- 
     Checking if an error occurred, and returning failure if so. 
- 
     Performing an auto intermediate_result = encoding . encode_one (...) 
- 
     Checking if an error occurred, and returning failure if so. 
- 
     Performing a std :: equals 
- 
     If it is not equal, return failure. 
- 
     If std :: ranges :: empty ( result . input ); 
- 
     Go to 0. 
The function signature for 
// header: <encoding> namespace std { namespace text { template < typename Input , typename Encoding , typename DecodeState , typename EncodeState > constexpr auto validate ( Input && input , Encoding && encoding , DecodeState & decode_state , EncodeState & encode_state ); template < typename Input , typename Encoding , typename DecodeState > constexpr auto validate ( Input && input , Encoding && encoding , DecodeState & decode_state ); template < typename Input , typename Encoding > constexpr bool validate ( Input && input , Encoding && encoding ); template < typename_Input > constexpr bool validate ( Input && input ); }} 
The order of arguments is chosen based on what users are likely to specify first. In many cases, all that is needed is the input: the encoding can be chosen automatically for the user based on such. For 
- 
     If is_encoding_self_state_t < Encoding > encoding . reset_state (); encoding State & 
- 
     Otherwise, encoding_state_t < Encoding > {} 
Interestingly, we come to a conundrum here with "self-referential" encodings. We cannot use the 
3.3.1.5. Free Functions decode_count encode_count 
   This proposal will not spoon feed the reader everything: the 
3.3.2. Safety with the Free Functions
The second problem is the ability to _lose_ data due to not using lossless encodings. For example, most legacy encodings are lossy when it comes to code points and graphemes outside of their traditional reservoir (e.g., trying to handle Chinese scripts with a latin-1 encoding). Trying to properly encode between these myriad of encodings leaves room for losing information. Even for Wide Character Locale-based (
Therefore, an error at compile-time is wanted if a user uses the above high-level free functions, but does not explicitly specify an error handler in the case where a conversion is lossy. Taking an example from this presentation, this puppy emoji cannot fit in ASCII. In general, most Unicode Code Points cannot fit in an ASCII string: this is a dangerous conversion! So, unless you use a non-default error handler, the library will 
int main ( int , char * []) { // Compiler Error: lossy encoding, specify non-default error handler std :: string ascii_emoji0 = std :: text :: encode ( U “🐶”, std :: text :: ascii {}); // Compiler Error: lossy encoding, specify non-default error handler std :: string ascii_emoji1 = std :: text :: encode ( U “🐶”, std :: text :: ascii {}, std :: text :: default_handler {}); // Okay: you asked for it! std :: string ascii_emoji2 = std :: text :: encode ( U “🐶”, std :: text :: ascii {}, std :: text :: replacement_handler {}); // ascii_emoji2 contains '?' // Okay: undefined behavior, but you asked for it. std :: string ascii_emoji3 = std :: text :: encode ( U “🐶”, std :: text :: ascii {}, std :: text :: assume_valid_handler {}); // ascii_emoji3 has no guarantees // at this point: undefined behaivor was invoked! } 
3.3.3. Improving Usability for Low-Memory Environments: Ranges
One of the biggest problems with 
Most importantly, wrappers around other ranges are employed here. This is important: nobody has time to rewrite all of this functionality just because the API strongly mixed 
3.3.3.1. decode_view decode_iterator 
   
// header: <encoding> namespace std { namespace text { template < typename _Encoding , typename Range = basic_string_view < encoding_code_unit_t < _Encoding >> , typename ErrorHandler = default_handler , typename State = encoding_state_t < _Encoding >> class decode_iterator ; template < typename _Encoding , typename Range = basic_string_view < encoding_code_unit_t < Encoding >> , typename ErrorHandler = default_handler , typename State = encoding_state_t < _Encoding >> class decode_view { public : using iterator = decode_iterator < Encoding , Range , ErrorHandler , State > ; using sentinel = decode_sentinel ; using range_type = Range ; using encoding_type = Encoding ; using error_handler_type = ErrorHandler ; using encoding_state_type = encoding_state_t < encoding_type > ; constexpr decode_view ( range_type range ) noexcept ; constexpr decode_view ( range_type range , encoding_type encoding ) noexcept ; constexpr decode_view ( range_type range , encoding_type encoding , error_handler_type error_handler ) noexcept ; constexpr decode_view ( range_type range , encoding_type encoding , error_handler_type error_handler , encoding_state_type state ) noexcept ; constexpr decode_view ( iterator it ) noexcept ; constexpr iterator begin () const & noexcept ; constexpr iterator begin () && noexcept ; constexpr sentinel end () const noexcept ; friend constexpr decode_view reconstruct ( :: std :: in_place_type_t < decode_view > , iterator it , sentinel ) noexcept ; }; }} 
The 
In the case of errors, the standard has a number of well-defined behaviors that prevent the need to add a 
- 
     default_handler replacement_handler 
- 
     throw_handler ++ it * it 
- 
     assume_valid_handler 
Therefore, the only error case wherein 
Note: This differs from how Tom Honermann’s 
A third option is returning a special type which holds the 
Note: It is recognized that the Standard does not bless such implementations. This proposal does not care: the needs of C++'s users greatly outweighs the theoretical purity of the C++ abstract machine where the cost of all things is equal and does not matter. The standard’s preferred error handling method has a non-zero cost (particularly in binary size) to simply exist that have not been fully optimized into a "do not pay for what you do not use" state. Furthermore, it is still extremely dubious to throw-by-default on any ill-formed text for reasons mentioned above. Therefore, directions wherein the default is equivalent to throwing are not preferred at this time.
3.3.3.2. encode_view encode_iterator 
   This is identical to § 3.3.3.1 decode_view and decode_iterator, except the name of the view and iterator are 
- 
     The Range basic_string_view < encoding_code_point_t < _Encoding >> 
- 
     The encod_view value_type encoding_code_unit_t < Encoding > Encoding encoding . encode_one 
Everything else is identical in nature to 
3.3.3.3. transcode_view transcode_iterator 
   This is mostly identical to § 3.3.3.1 decode_view and decode_iterator, though there are more apparent changes here.
- 
     The name of the view and iterator types are transcode_view transcode_iterator 
- 
     The template parameters are modified to take a ToEncoding FromEncoding ToErrorHandler FromErrorHandler ToState FromState 
- 
     The Range basic_string_view < encoding_code_unit_t < ToEncoding >> std :: basic_string_view < encoding_code_unit_t < ToEncoding >> 
- 
     The value_type encoding_code_unit_t < ToEncoding > ToEncoding 
Additionally, another important change here is an optimization opportunity. The default implementation of performing a single "
- 
     Take the input range stored in the class, call from_encoding . decode_one 
- 
     Take the intermediate output range for the previous decode_one to_encoding . encode_one 
- 
     Present the output to the user in a suitable manner. 
This is fine, as long as the 
For example, 
3.4. The Need for Speed
Performance is correctness. If these methods and the resulting interface are not fast enough to meet the needs of the programmers, there will be little to no adoption over current solutions. Thanks to work by Bob Steagall and Zach Laine, it is fact that it is incredibly hard to make a range-based or iterator-based interface which will achieve the text processing speeds that will satisfy users of trivial (
An explicit goal of this library is that there shall be no room for a lower level abstraction or language here, and the first steps to doing that are recognizing the benefits of eager encoding, decoding and transcoding interfaces, as well as pluggable and overridable behavior for the variety of functionality as it relates to higher-level abstractions.
Research and implementation experience with [boost.text], [text_view] and others has made it plainly clear that while iterators and ranges can produce an extremely efficient binary, it is still not the fastest code that can be written to compete with hand-written/vectorized bulk text processing routines made specifically for each encoding. Therefore, it is imperative that lazy ranges cannot be the only solution. The C++ Standard must steadily and nicely supplant the codebase-specific or ad-hoc solutions individuals keep rolling for encoding and decoding operations.
3.4.1. Speed and Flexibility for Everyone: Customization Points
An important part of that is the ability to provide performance for both lazy, range-based iteration as described in § 3.3.3 Improving Usability for Low-Memory Environments: Ranges and fast free functions as described in § 3.3.1 Eager Free Functions. To this end, an ADL free function scheme similar to the Range Access Customization Points (e.g. 
Considering this is going to be one of the most fundamental text layers that sits between typical text and a lot of the new I/O routines, it is imperative that these conversions are not only as fast as possible, but customizable. The user can already customize the encoding by creating their own conforming encoding object, but encodings still do their transformations on a code point-by-code point basis. Therefore, a means of extensibility needs to be chosen for the 
What is not negotiable is that it must be extensible. Users should be able to write fast transcoding functions that the standard picks up for their own encoding types. From GB18030 to other ISO and WHATWG encodings, there will always be a need to extend the fast bulk processing of the standard. Current standard library implementers do not have the time to support every single legacy encoding on the planet, and companies do not have the time to petition each and every standard library to add support for their internal encoding. Similarly, government records kept in legacy encodings for political or organizational reasons cannot be locked out of this world either.
Thusly, the following extension points are provided.
3.4.1.1. One-by-one Transcoding Shortcuts
Using the example of 
- 
     Convert input encoding_code_unit_t < FromEncoding > encoding_code_point_t < FromEncoding > 
- 
     Convert shared encoding_code_point_t < FromEncoding > encoding_code_unit_t < ToEncoding > 
This is accomplished by first calling 
To speed this process up, the free function 
// in any related namespace in which ADL can find it template < typename Input , typename FromEncoding , typename Output , typename ToEncoding , typename FromErrorHandler , typename ToErrorHandler , typename FromState , typename ToState > std :: text :: transcode_result < Input , Output , FromState , ToState > text_transcode_one ( Input input , FromEncoding && from , Output output , ToEncoding && to , FromErrorHandler && from_error_handler , ToErrorHandler && to_error_handler , FromState & from_state , ToState & to_state ); 
The following is a complete example of this customization point.
using ascii_to_utf8_result = std :: text :: transcode_result < std :: span < char > , std :: span < char8_t > , std :: text :: ascii :: state , std :: text :: utf8 :: state > ; template < typename FromErrorHandler , typename ToErrorHandler > ascii_to_utf8_result text_transcode_one ( std :: span < char > input , std :: text :: ascii & from , std :: span < char8_t > output , std :: text :: utf8 & to , FromErrorHandler && from_error_handler , ToErrorHandler && to_error_handler , std :: text :: ascii :: state & from_state , std :: text :: utf8 :: state & to_state ) { if ( input . empty ()) { // no input: that’s fine return ascii_to_utf8_result ( input , output , from_state , to_state ); } if ( output . empty ()) { // error: no room! return std :: text :: propagate_transcode_one_error ( from , input , to , output , from_error_handler , to_error_handler , from_state , to_state , std :: text :: encoding_errc :: insufficient_output_space , std :: span < char , 0 > {}); } if (( input [ 0 ] & '\x7f' ) != 0 ) { // error: high bit set in ASCII return std :: text :: propagate_transcode_one_error ( from , input . subspan < 1 > (), to , output , from_error_handler , to_error_handler , from_state , to_state , std :: text :: encoding_errc :: invalid_sequence , input . subspan < 1 , 1 > ()); } // bitwise compatible output [ 0 ] = static_cast < char8_t > ( input [ 0 ]); // return result return ascii_to_utf8_result ( input . subspan < 1 > (), output . subspan < 1 > (), from_state , to_state ); } 
This is faster than the round trip through 
Note: The function 
Note: This may be an indication that there should be a third kind of error handler for 
It is important to note that the above example customization point only works for 
3.4.1.2. Customizability: Transcoding Free Functions
The free functions are the chance for the user to optimize bulk encoding. This is an area that becomes very important to users all over the world. Many people have already written optimized routines to convert from one encoding to another: it would be a shame if all of this work could not interoperate with the standard as it is. That is why there are 3 ADL-found free functions that are checked for well-formedness, and if so are called by the implementation in 
// in any related namespace in which ADL can find it template < typename Input , typename Encoding , typename Output , typename State , typename ErrorHandler > decode_result < Input , Output , State > text_decode ( Input input , const Encoding & encoding , Output output , State & state , ErrorHandler && error_handler ); template < typename Input , typename Encoding , typename Output , typename State , typename ErrorHandler > encode_result < Input , Output , State > text_encode ( Input input , const Encoding & encoding , Output output , State & state , ErrorHandler && error_handler ); template < typename Input , typename FromEncoding , typename Output , typename ToEncoding , typename FromState , typename ToState , typename FromErrorHandler , typename ToErrorHandler > transcode_result < Input , Output , FromState , ToState > text_transcode ( Input input , const FromEncoding & from_encoding , Output output , const ToEncoding & to_encoding , FromState & from_state , ToState & to_state , FromErrorHandler && from_error_handler , ToErrorHandler && to_error_handler ); 
Each of these is the customization hook that a user can write in a namespace to enable a proper conversion from one encoding to another. Nominally, users would use concrete types in place of templated types like 
template < typename FromErrorHandler , typename ToErrorHandler > transcode_result < std :: span < char > , std :: span < char16_t > , win_wrap :: windows_1252 :: state , std :: text :: utf8 :: state > text_transcode ( std :: span < char > input , const win_wrap :: windows_1252 & encoding , std :: span < char8_t > output , const std :: text :: utf16 & to_encoding , win_wrap :: windows_1252 :: state & from_state , std :: text :: utf16 :: state & to_state , FromErrorHandler && from_error_handler , ToErrorHandler && to_error_handler ) { if ( input . empty ()) { // do nothing return transcode_result < /*...*/ > ( /* ... */ ); } int Needed = MultiByteToWideChar ( 1252 , 0 , input . data (), static_cast < int > ( input . size ()), nullptr , 0 ); if ( Needed == 0 || ( Needed > static_cast < int > ( output . size ()))) { // handle error ... return std :: text :: propagate_transcode_error ( input , output , from_handler , to_handler , from_state , to_state , std :: text :: encoding_errc :: insufficient_output_space , std :: span < char , 0 > {}); } int Succ = MultiByteToWideChar ( 1252 , 0 , input . data (), static_cast < int > ( input . size ()), reinterpret_cast < wchar_t *> ( output . data ()), static_cast < int > ( output . size ())); if ( Succ == 0 ) { // handle error ... return std :: text :: propagate_transcode_error ( input , to_encoding , output , from_encoding , transcode_result < /*...*/ > ( /* ... */ ), std :: text :: encoding_errc :: invalid_sequence , std :: span < char , 0 > {}); } return transcode_result < /*...*/ > ( /* ... */ ); } 
This does not show all the error handling, but it is a full explanation/demonstration of a custom 
Note: Like in § 3.4.1.1 One-by-one Transcoding Shortcuts, the function 
There does exist some concern for individuals who may want to do specializations for the standard’s encodings. The specification will permit someone to write their own 
 Even if this is possible, it is absolutely expected for implementations to optimize common Unicode encoding pairs with OS or library-internal specific algorithms. If a vendor fails to do this, please file a bug against their implementation.
Loudly.
3.4.1.3. Customizability: Validating and Counting Free Functions
The 
// in any related namespace in which ADL can find it template < typename Input , typename Encoding , typename Output , typename State , typename ErrorHandler > count_result < Input , State > text_decode_count ( Input input , const Encoding & encoding , State & state , ErrorHandler && error_handler ); template < typename Input , typename Encoding , typename Output , typename State , typename ErrorHandler > count_result < Input , State > text_encode_count ( Input input , const Encoding & encoding , State & state , ErrorHandler && error_handler ); template < typename Input , typename Encoding , typename DecodeState , typename EncodeState > validate_result < Input , DecodeState > text_validate ( Input input , const Encoding & encoding , DecodeState & state , EncodeState & state ); template < typename Input , typename Encoding , typename DecodeState > validate_result < Input , DecodeState > text_validate ( Input input , const Encoding & encoding , DecodeState & state ); 
Notably, there are two 
In this case, we need a customization point wherein such an encoding, using internal/secret knowledge, can do its validation without needing to rely on the 4-argument 
4. Implementation Experience
There are implementations of this work, taking some of it in part or in full.
4.1. Previous Work
While the ideas presented in this paper have been explored in various different forms, the ideas have never been succinctly composed into a single distributable library. Therefore, the author of this paper is working on an implementation that synthesizes all of the learning from [icu], [boost.text], [text_view] and [libogonek]. Reportedly, an implementation using a similar system exists in a few Fortune 500 company codebases. [copperspice] also has a somewhat similar implementation, but differs in a few places.
4.2. Current Work
This paper’s r2 hopes to contain benchmarks, initial implementation and usage experience. This paper’s r3 hopes to contain more benchmarks, refined implementation and additional field and usage experience after a more valuable and viable minimum product is established. The current implementation is being incubated in a private implementation in 
5. FAQ
Some commonly asked questions.
5.1. Question: Why is there a max_code_points 
   This is incorrect. There are cases for encodings such as TSCII that output multiple unicode code points at once. The minimum required space must be dictated by the encoding: C++ made the mistake for 
5.2. Question: What about Old Unicode Encodings / Private Use Area Encodings?
These are treated like legacy encodings. Someone must convert to "normal" (Unicode vRight-Now) Unicode in order to have higher level algorithms work. If this includes Private Use Area characters, than a person will need the ability to customize the normalization algorithms for use in getting e.g. Medieval Text and Biblical Text to normalize properly. This will be covered in a future paper on a 
5.3. Question: It can be faster to bulk-decode, then bulk-encode instead of one-by-one transcoding. Why not that design?
While this is true, as asserted in the § 3.3.1.3 Free Function transcode section, bulk decoding requires that there is a intermediary storage in to bulk-decode into. This imposes an invisible intermediate in the API, or requires explicitly allowing the user to pass one in. Furthermore, a user may only want to partially decode, partially encode, and then repeat because there is some internal memory limit rather than do a single "complete" bulk conversion.
A significant amount of thought and experimental implementation went into potentially providing both a 
5.4. Question: Where is the specification for normalization_view < nfkc > normalize (...) 
   Normalization is separable from the low-level transcoding, and even though APIs like 
5.5. Question: Where is the specification for std :: text :: basic_text std :: text :: basic_text_view 
   Those types as currently imagined requires additional functionality, like normalization and potentially segmentation algorithms (e.g., for making Grapheme Clusters). It will be split off into a separate paper, even if we allude to its existence and use in this proposal.
6. Acknowledgements
Thanks to R. Martinho Fernandes, whose insightful Unicode quips got me hooked on the problem space many, many years ago and helped me develop my first in-house solution for an encoding container adaptor several years ago. Thanks to Mark Boyall, Xeo, and Eric Tremblay for bouncing off ideas, fixes, and other thoughts many years ago when struggling to compile libogonek on a disastrous Microsoft Visual Studio November 2012 CTP compiler.
Thanks to Tom Honermann, who had me present my second SG16 meeting before it was SG16 and help represent and carry his papers which gave me the drive to help fix the C++ standard for text. Many thanks to Zach Laine, whose tireless implementation efforts have given me much insight and understanding into the complexities of Unicode and whose implementation in Boost.Text made clear the tradeoffs and performance issues. Thanks to Mark Zeren who helped keep me in SG16 and working on these problems.
And thank you to those of you who grew tired of an ASCII-only world and supported this effort.