| Document number: | N3336 = 12-0026 | 
| Date: | 2012-01-13 | 
| Project: | Programming Language C++, Library Working Group | 
| Reply-to: | Beman Dawes <bdawes at acm dot org> | 
Introduction
String conversion safety 
rationale
Design paths not taken
Acknowledgements
String interoperability problems and proposed 
solutions
    Problem 1: Strings don't interoperate if 
encoding differs
    Problem 2: Strings don't interoperate 
with I/O streams if encoding differs
    Problem 3: String  conversion 
iterators are not provided
  
This paper proposes additions to the C++ standard library/TR2 to ease use of Unicode and other string encodings. The motivation is a series of problems with the C++11 standard library.
The full statement of the problems with proposed solutions is given below in String interoperability problems and proposed solutions.
The C++03 versions of these problems were first encountered while providing Unicode support for the internationalization of commercial GIS software. The problems appeared again while working on the Boost Filesystem Library. These problems have become more apparent as compiler support for C++11's additional Unicode support has made it easier to write programs that run up against current limitations.
The proposed solutions are pure additions to the C++11 standard library. No C++03 or C++11 compliant code is broken or otherwise affected by the additions.
This paper does not provide working paper wording. WP wording will be provided if this proposal is accepted in principle.
A "proof-of-concept" implementation of the proposals (and more) is available at github.com/Beman/string-interoperability.
The proposed solutions below make the assumption that it is safe to convert a string of any type and encoding to another type and encoding. The rationale for that assumption follows.
Conversion in either direction between UTF-8 encoded std::string and UTF-32 encoded std::u32string is safe because it is defined by the Unicode Consortium and ISO/IEC 10646 as unambiguous and lossless.
Conversion in either direction between UTF-16 encoded std::u16string and UTF-32 encoded std::u32string is safe because it is defined by the Unicode Consortium and ISO/IEC 10646 as unambiguous and lossless.
Conversion in either direction between UTF-8 encoded std::string and UTF-16 encoded std::u16string is safe because it can be composed from the two previous known safe conversions via an intermediate conversion to and from UTF-32 encoded char32_t characters.
The cases of std::string and std::wstring are more 
complex in that the encoding is not implied by the char and 
wchar_t value types.  It is not necessary, however, to  know 
the encoding of these string types in advance as long as it is known how to convert them to 
one of the known encoding string types. The C++11 standard library requires
codecvt<char32_t,char,mbstate_t>  and codecvt<wchar_t,char,mbstate_t> 
facets, so such conversions are always possible using the standard library. In practice, library 
implementations have additional knowledge that allow such conversions to be 
more efficient than just calling codecvt facets. To ensure safety, error handling 
does need to be 
provided, however, as conversions involving some char and wchar_t 
encodings can encounter errors. See Problem 3 below for 
some requested error handling approaches. 
Implicit conversion between single characters of different types, as opposed to strings, may require multi-character sequences. No such single character implicit conversions are proposed here.
This proposal deals with C++11 std::basic_string and 
character types, and with their encodings. The deeper attributes of Unicode 
characters are not addressed. See Mathias Gaunard's 
  Unicode project for an example of deeper Unicode support.
This proposal does not suggest providing a string type guaranteed to provide 
UTF-8 encoding.  Although experiments with typedef 
basic_string<unsigned char> u8string; worked well, benefits would be 
speculative and not based on existing practice.
Another approach would be to provide a utf8_char_traits class 
and then typedef 
basic_string<char, utf8_char_traits> u8string;. This approach has 
not been investigated. 
Peter Dimov inspired the idea of string interoperability by arguing that the Boost Filesystem library should treat a path is a single type (i.e. not a template) regardless of character size and encoding.
John Maddock's Unicode conversion iterators demonstrated an 
  easier-to-use, more efficient, and STL friendlier way to perform character 
type and encoding conversions as an alternative to standard library 
codecvt facets.
The C++11 standard deserves acknowledgement as it provides the underlying language and library features that allow Unicode string interoperability:
char16_t and char32_t  provide Unicode 
  character types and null-terminated characters strings with guaranteed 
  encodings.std::u16string and std::u32string provide 
  library support for Unicode character types and encodings.u8, u, and U character and string literals ease 
  programming with Unicode character types and encodings.Standard library strings with different character encodings have different types that do not interoperate.
u16string s16 = u"您好世界"; u32string s32; s32 = s16; // error! s32 = "foo"; // error! s32 = s16.c_str(); // error! s32.assign(s16.cbegin(), s16.cend()); // error!void f(const string&); f(s32); //error!
The encoding of basic_string instantiations can be determined for the types under discussion. It is either implicit in the string's value_type or can be determined via the locale.
Boost Filesystem Version 3, and the filesystem proposal before the C++ 
  committee, class path solves some of the string 
  interoperability problems, albeit in limited context. A function that is 
  declared like this:
void f(const path&);
Can be called like this:
  f("Meow");
f(L"Meow");
f(u8"Meow");
f(u"Meow");
f(U"Meow");
// ... many additional variations such as basic_strings and iterators
This string interoperability support has been a success. It does, however, 
raise the question of why std::basic_string isn't providing the 
interoperability support. Users are misusing paths as general string containers 
because they provide interoperability. The string interoperability cat is out of the bag. 
The toothpaste is out of the tube.
See Boost.Filesystem V3 class path for an example of how such interoperability might be achieved.
Experience with Boost.Filesystem V3 class path has demonstrated that string interoperability brings a considerable simplification and improvement to internationalized user code, but that having to provide interoperability without the resolution of the issues presented here is a band-aid.
String interoperability will be easier to specify, implement, and use if the string interoperability iterators proposed below are accepted.
The approach is to add additional std::basic_string 
overloads to functions most likely to benefit from interoperability. The 
overloads are in the form of function templates with sufficient restrictions on 
overload resolution participation (i.e. enable_if) that the existing C++11 
functions are always selected if the value type of the argument is the same as 
or convertible to the std::basic_string type's value_type. 
The semantics of the added signatures are the same as original signatures except 
that arguments of the template parameter type have their value converted to the 
type and encoding of 
basic_string::value_type.
The std::basic_string functions given additional overloads are:
operator=, operator+=, 
append, and assign signature.template <class T> unspecified_iterator c_str(), 
  returning an unspecified iterator with value_type of T.
  begin() and end(). Similar to c_str(). 
To keep the number and complexity of overloads manageable, the 
proof-of-concept implementation does not provide any way to specify error 
handling policies, or string and wstring encoding. 
Every one of the added signatures does not need to be able to control error 
handling and encoding. The need is particularly rare in environments where UTF-8 
is the narrow character encoding and UTF-16 is the wide character encoding. A 
subset, possibly just c_str(), begin(), and 
end(), with error handling and encoding parameters or arguments, suitable 
defaulted, may well be sufficient.
Because full implicit interoperability involves a lot of additional 
signatures be added to basic_string, it will certainly be appropriate to discuss 
limiting changes to the key areas of need. For example, constructors and 
operator= are much more likely to need interoperability than operator+=, 
append, or assign signatures.
I/O streams do not accept strings of different character types
A "Hello World" program using a C++11 Unicode string literal illustrates this frustration:
  #include <iostream>
int main()
{
  std::cout << U"您好世界";   // error in C++11!
}
This code should 
"just work", even though the type of U"您好世界" is const 
char32_t*, not const char*, as long as the encoding of char 
supports 您好世界. Even if those characters are not 
supported by default encodings, alternatives like UTF-8 are available. 
The code does "just work" with the proof-of-concept implementation of this 
proposal. On Linux, with default char encoding of UTF-8,  execution 
produces the expected 您好世界 output. On Windows, the 
console doesn't support full UTF-8, so the output can be piped to a file or to a 
program which does handle UTF-8 correctly. And, yes, that does work correctly 
with the proof-of-concept implementation of this proposal.
Add additional function templates to those in 27.7.3.6.4 [ostream.inserters.character],
Character inserter function templates, to cover the case where the 
argument character type differs from charT and is not char, 
signed char, unsigned char, const char*, 
const signed char*, or const unsigned char*.  (The 
specified types are excluded because they are covered by existing signatures.) 
The semantics of the added signatures are the same as original signatures except 
that arguments shall be converted to the type and encoding of the stream.
Do the same for the character extractors in 27.7.2.2.3 [istream::extractors], basic_istream::operator>>.
Do the same for the two std::basic_string inserters and 
extractors in 21.4.8.9 [string.io], Inserters and extractors.
Conversion between character types and their encodings using current standard 
library facilities such as std::codecvt, std::locale, 
and 
std::wstring_convert has multiple problems:
codecvt facets don't easily compose into a complete 
  conversion from one encoding to another. Such composition is existing practice in C libraries like ICU. 
  UTF-32 is the obvious choice for the common encoding to pass between codecs.std::locale and code conversion, even when these 
  are implementation details that should be hidden from the application.The generalization of the std::basic_string function
c_str is:
template <class T> unspecified_iterator c_str() const;
Give a std::string named s8, this allows a user 
  to write s8.c_str<char16_t>() to obtain an iterator with a value 
  type of char16_t.  To implement this function generically 
  using the current standard library would be difficult, and would involve the 
  creation of a temporary sting. The full implementation with the proposed 
  solution is simply:
    template <class T>
converting_iterator<const_iterator, value_type, by_range, T> c_str() const
{
  return converting_iterator<const_iterator,
    value_type, by_range, T>(cbegin(), cend());
}
  
  No temporary string is created, and none of the other problems listed above are present either. The solution is generally useful for user defined types, and not just for implementations of the standard library.
Other problems become easier to solve with converting_iterator. 
  For example, the Filesystem library's class path in
  
  N3239 has many functions with an argument in the form const 
  codecvt_type& cvt=codecvt() that could be eliminated by either direct 
  or indirect use of converting_iterator.
Boost Regex for many years has included a set of Unicode conversion iterators as an implementation detail. Although these do not provide composition, they do demonstrate the technique of using encoding conversion iterators to avoid creation of temporary strings.
This solution is based on the proof-of-concept implementation. Input iterator requirements can probably be loosened to bidirectional, but that hasn't been tested yet.
The preliminaries begin with end-detection policy classes, since strings used null termination, size, or half-open ranges to determine the end of a sequence.
template <class InputIterator> class by_null; template <class InputIterator> class by_size; template <class InputIterator> class by_range;
Codec templates handle actual conversion to and from UTF-32. The primary templates are:
template <class InputIterator, class FromCharT, template<class> class EndPolicy> class to32_iterator; template <class InputIterator, class ToCharT> class from32_iterator;
The standard library would provide specializations for char,
  wchar_t, char16_t, and char32_t. 
  Presumably users could provide specializations for UDTs, but that hasn't been 
  tested yet. The char and wchar_t specializations 
  provide mechanisms to select the encoding. Since this is a new component the
  char default encoding could be UTF-8 rather than locale based and 
  no existing code would be broken.
The actual converting_iterator primary template is 
  simply:
    template <class InputIterator, class FromCharT, template<class> class EndPolicy,
          class ToCharT> 
class converting_iterator
  : public from32_iterator<to32_iterator<InputIterator, FromCharT, EndPolicy>,
      ToCharT>
{
public:
  using from32_iterator::from32_iterator;
};
  
  Specializations may be provided, but aren't required. The proof-of-concept implementation doesn't use inherited constructors because of lack of compiler support.