“We demand rigidly defined areas of doubt and uncertainty!”
― Douglas Adams
1. Introduction
A new text formatting facility ([P0645]) was adopted into the draft standard for C++20 in Cologne. Unfortunately it left unspecified units of width and precision which created an ambiguity for string arguments in variable-width encodings ([LWG3290]). This paper proposes fixing this shortcoming and specifying width and precision in a way that satisfies the following goals:
-
addressing the main use case
-
locale-independence by default
-
Unicode support
-
ordinary and wide execution encoding support
-
consistency with the SG16’s long-term direction
-
following existing practice
-
ease of implementation
2. Changes
-
Replace "required to display it in a terminal" with "appropriate for displaying it in a terminal".
-
Replace
[ Note: the implementations are encouraged to use Unicode on platforms capable of displaying Unicode text in a terminal which is the case for Windows-based and many POSIX-based operating systems. — end note ]
with
Implementations should use Unicode on platforms capable of displaying Unicode text in a terminal. [ Note: This is the case for Windows-based and many POSIX-based operating systems. — end note ]
-
Replace "Width of a string in a Unicode encoding is the sum" with "Implementations should estimate the width of a string in a Unicode encoding as the sum".
-
Remove the note saying "The method of estimated width computation is subject to change. — end note".
-
Add "Unicode® 12.1.0 Standard Annex #29 Unicode Text Segmentation" to Normative references [intro.refs].
-
Replace
For string types this field specifies the maximum width. Trailing grapheme clusters or implementation-defined units of width that exceed the given precision are ignored.
with
For string types, this field provides an upper bound for the estimated width of the prefix of the input string that is copied into the output. For a string assumed to be in a Unicode encoding, the longest prefix of whole extended grapheme clusters whose estimated width is no greater than the precision is copied to the output.
-
Fix minor wording issues.
-
Replace
For the purposes of width computation the string is assumed to be in a fixed operating system dependent encoding. If the operating system is capable of displaying Unicode text in a terminal both ordinary and wide encodings are Unicode encodings such as UTF-8 and UTF-16, respectively. [ Note: this is the case for Windows-based and many POSIX-based operating systems. — end note ] Otherwise, the encodings are implementation-defined.
with
For the purposes of width computation the string is assumed to be in a locale-independent implementation-defined encoding. [ Note: the implementations are encouraged to use Unicode on platforms capable of displaying Unicode text in a terminal which is the case for Windows-based and many POSIX-based operating systems. — end note ]
-
Remove the "Optional (possibly in C++23)" section.
-
Add poll results.
3. SG16 poll results
Do we want to address this problem in C++20
SF F N A SA 10 1 2 0 0
Width / precision calculation should be computed in terms of display width?
SF F N A SA 9 3 1 0 0
Resolve US228 with P1868R0 modified to specify the encoding used to interpret the input text for the purposes of width and precision computation be implementation defined, but not locale sensitive, with non-normative guidance to use Unicode if possible?
SF F N A SA 7 4 1 0 0
4. LEWG poll results
Forward D1868R1 to LWG for C++20.
SF F N A SA 5 8 1 0 0
5. Motivating example
To the best of our knowledge, the main use case for the string width and precision format specifiers is to align text when displayed in a terminal with a monospaced font. The motivating example is a columnar view in a typical command-line interface:
We would like to be able to produce similar or better output with the C++20 formatting facility using the most natural API, namely dynamic width:
// Prints names in num_cols columns of col_width width each. void print_columns ( const std :: vector < std :: string >& names , int num_cols , int col_width ) { for ( size_t i = 0 , size = names . size (); i < size ; ++ i ) { std :: cout << std :: format ( "{0:{1}}" , names [ i ], col_width ); if ( i % num_cols == num_cols - 1 || i == size - 1 ) std :: cout << '\n' ; } } std :: vector < std :: string > names = { "Die Allgemeine Erklärung der Menschenrechte" , "『世界人権宣言』" , "Universal Declaration of Human Rights" , "Всеобщая декларация прав человека" , "世界人权宣言" , "ΟΙΚΟΥΜΕΝΙΚΗ ΔΙΑΚΗΡΥΞΗ ΓΙΑ ΤΑ ΑΝΘΡΩΠΙΝΑ ΔΙΚΑΙΩΜΑΤΑ" }; print_columns ( names , 2 , 60 );
Desired output:
(Note that spacing in front of
is part of the character and it is aligned
correctly both in the code and in the output.)
6. Prior art
Display width is a well-established concept. In particular, POSIX defines the
function ([WCSWIDTH]) that has the required semantics:
The
function shall determine the number of column positions required for
wcswidth () wide-character codes (or fewer than
n wide-character codes if a null wide-character code is encountered before
n wide-character codes are exhausted) in the string pointed to by
n .
pwcs
Many languages have implementations of
or similar functionality.
Here is an incomplete list of them:
-
C
([MGK25])wcwidth -
Go
([RUNEWIDTH])go - runewidth -
JavaScript
([WCWIDTH-JS])wcwidth . js -
Julia
,Base . UTF8proc . charwidth
([CHARWIDTH]) - a part of the standard libraryBase . strwidth -
Perl
([TCW])Text :: CharWidth -
Python
([WCWIDTH-PY]) - used by over 60,000 projects according to the GitHub dependency graphwcwidth -
Ruby
([UDW]) - used by over 170,000 projects according to the GitHub dependency graphunicode - display_width
GitHub code search returns over 500,000 results for "wcwidth" and 180,000 results for "wcswidth".
The number of implementations of this facility together with large usage numbers indicate that it is an important use case. All of the above implementations work exclusively with Unicode.
7. Locale and execution encodings
One of the major design features of the C++20 formatting facility ([P0645]) is
locale independence by default with locale-aware formatting available as an
opt-in via separate format specifiers. This has an important safety property
that the result of
by default does not depend on the global
locale and a buffer allocated with this size can be passed safely to
even if the locale has been changed in the meantime, possibly from another
thread. It is desirable to preserve this property for strings for both safety
and consistency reasons.
Another observation is that the terminal’s encoding is independent from the
execution encoding. For example, on Windows it’s possible to change the
console’s code page with
and
([SCOCP])
independently of the active code page or the global locale. It is also possible
to write Unicode text to a console with
regardless of both the
active code page and the console code page. On macOS and Linux, the terminal’s
encoding is determined by the settings of the terminal emulator application and
normally defaults to UTF-8. Changing the encoding is possible but has severe
limitations. For example, if you change the terminal encoding from UTF-8 to
KOI8-R on macOS
can no longer display even Cyrillic paths even though both
UTF-8 and KOI8-R support Cyrillic. Here’s an example output with the terminal
encoding set to KOI8-R:
$ ls Die Allgemeine Erklц╓rung der Menschenrechte Universal Declaration of Human Rights Д╦√Г•▄Д╨╨Ф²┐Е╝ёХ╗─ н÷н≥н н÷н╔н°н•н²н≥н н≈ н■н≥н▒н н≈н║н╔н·н≈ н⌠н≥н▒ н╓н▒ н▒н²н≤н║н╘н═н≥н²н▒ н■н≥н н▒н≥н╘н°н▒н╓н▒ п▓я│п╣п╬п╠я┴п╟я▐ п╢п╣п╨п╩п╟я─п╟я├п╦я▐ п©я─п╟п╡ я┤п╣п╩п╬п╡п╣п╨п╟ ь╖ы└ь╔ь╧ы└ь╖ы├ ь╖ы└ь╧ь╖ы└ы┘ы┼ ы└ь╜ы┌ы┬ы┌ ь╖ы└ь╔ы├ьЁь╖ы├ $ LC_ALL=ru.RU.KOI8-R ls Die Allgemeine Erkl??rung der Menschenrechte Universal Declaration of Human Rights ?????????????????????? ?????????????????? ?????? ???? ?????????????????? ???????????????????? ???????????????? ???????????????????? ???????? ???????????????? ?????????????? ?????????????? ?????????? ?????????????? ??????????????????
Therefore, for the purposes of specifying width, the output of
shouldn’t dynamically depend on the locale’s encoding by default. As with other
argument types, a separate format specifier can be added to opt into
locale-specific behavior to support execution encodings and legacy code.
8. Windows
According to the Windows documentation ([WINI18N]):
Most applications written today handle character data primarily as Unicode, using the UTF-16 encoding.
and
New Windows applications should use Unicode to avoid the inconsistencies of varied code pages and for ease of localization.
Code pages are used primarily by legacy applications or those communicating with legacy applications such as older mail servers.
Since
is a completely new API which is not a drop-in replacement
for anything in the standard library today and therefore can only be used in the
new code, we think that it should be consistent with the Windows guidelines and
use Unicode by default on this platform. Additionally it should provide an
opt-in mechanism to communicate with legacy applications.
9. Precision
Precision, when applied to a string argument, specifies how many characters will
be used from the string. It can be used to truncate long strings in the columnar
output as in the motivating example shown earlier. Because it works with a
single argument and only for some argument types it is not particularly useful
for truncating output to satisfy storage requirements.
should be
used for the latter instead. The semantics of floating-point precision is
also unrelated to storage.
Since precision and width address the same use case, we think that they should be measured in the same units.
10. Proposal
To address the main use case, we propose using the display width of a string, i.e. the number of column positions needed to display the string in a terminal, for both width and precision.
There is a spectrum of solutions to the problem of estimating display width,
from always wrong (return 42 times the number of code units) and almost always
wrong (code units and
) to always correct (model the terminal’s logic of
width computation). We would like to take a pragmatic approach leaning towards
the correct side of the spectrum but without introducing too much complexity.
This can be accomplished by defining ranges of characters that are guaranteed to
be handled correctly on a capable terminal with an option of refining the
definition as technology matures and Unicode handling bugs observed today are
fixed. With our approach a program can produce high-quality output which is
always correct by escaping characters for which width computation is not
supported and is readable in many common cases, greatly improving on
.
To satisfy the locale-independence property we propose that for the purposes
of display width computation the default should be Unicode on systems that
support display of Unicode text in a terminal or fixed implementation-defined
encodings otherwise. In particular this allows using EBCDIC on z/OS and ASCII on
resource-constrained embedded systems that may not want to provide even minimal
Unicode handling capabilities.
On Unicode-capable systems both
and
strings should use Unicode
encodings (e.g. UTF-8 and UTF-16 respectively) by default. This will enable
portable code with optional transcoding at the system API boundaries (see [P1238]) and seamless integration with APIs that support Unicode such as
on Windows without data loss.
Using a fixed system encoding is completely safe because formatting functions
don’t do any transcoding. So the worst thing that can happen is that the display
width will be estimated incorrectly leading to misaligned text which is what
already happens when you pass a variable-width string to
. This is also
not novel, for example
also acknowledges the existence of
system dependent encodings:
The native encoding of an ordinary character string is the operating system dependent current encoding for pathnames.
For Unicode, the first step in computing width is to break the string into
grapheme clusters because the latter correspond to user-perceived characters
([UAX29]). Then the width should be adjusted to account for graphemes that
take two column positions as it is done, for example, in the Unicode
implementation of
by Markus Kuhn ([MGK25]). Non-printable
characters such as control characters do not contribute to width and it should
be a user’s responsibility to ensure that the input string does not contain
such characters as well as leading combining characters and modifier letters
that may compose after concatenation.
Width estimation can be done efficiently with a single pass over the input and optimized for the case of no variable-width characters. It has zero overhead when no width is specified or when formatting non-string arguments.
We also propose adding a new format specifier in C++23 for computing display width of a string argument based on the locale’s encoding, for example:
std :: locale :: global ( std :: locale ( "ru_RU.KOI8-R" )); std :: string message = std :: format ( "{:6ls}" , " \xd4\xc5\xd3\xd4 " ); // "тест" in KOI8-R // message == "\xd4\xc5\xd3\xd4 " ("тест " in KOI8-R)
This will support display width estimation for ordinary and wide execution
encodings.
We think that the current proposal is in line with SG16: Unicode Direction
([P1238]) goal of "Designing for where we want to be and how to get there"
because it creates a clear path for the future
overloads of
to have the desired behavior and be consistent with the C++20
formatting facility which currently supports
and
.
11. Why not code units?
It might seem tempting at first to measure width in code units because
it is simple and avoids the encoding question. However, it is not very useful in
addressing practical use cases. Also it is an evolutionary deadend because
standardizing code units for
and
overloads by default would
create an incentive for doing the same in
overloads or introduce a
confusing difference in behavior.
One might argue that if we do the latter it may push users to the
overloads but intentionally designing an inferior API and creating inconvenience
for users for the goal that may never realise seems wrong.
Measuring width in code units in the fmt library was surprising to some users
resulting in bug reports and eventually switching to higher-level units.
Code units are even less adequate for precision, because they can result in invalid output. For example
std :: string s = std :: format ( "{:.2}" , " \x41\xCC\x81 " );
would result in
containing
if precision was measured in code
units which is clearly broken. In Python’s
precision is measured in
code points which prevents this issue.
, which works with code units, can only handle basic Latin in UTF-8, so
even formatting of common English words containing accents is problematic.
For example:
printf ( "%10s - %s \n " , "bistro" , "a small or unpretentious restaurant" ); printf ( "%10s - %s \n " , "café" , "a usually small and informal establishment serving various refreshments" );
prints
bistro - a small or unpretentious restaurant café - a usually small and informal establishment serving various refreshments
or
bistro - a small or unpretentious restaurant café - a usually small and informal establishment serving various refreshments
depending on how é is represented.
If we want to truncate the output
printf ( "%.4s... \n " , "bistro" ); printf ( "%.4s... \n " , "café" );
the result is even worse:
bist... caf<C3>...
12. Limitations
Unlike terminals, GUI editors often use proportional fonts or fonts that claim to be monospaced but treat some characters such that their width is not an integer multiple of the other. Therefore width, regardless of how it is defined, is inherently limited there. However, it can still be useful if the input domain is restricted. Possible use cases are aligning numbers, text in ASCII or other subset of Unicode, or adding code indentation:
// Prints text prefixed with indent spaces. void print_indented ( int indent , std :: string_view text ) { std :: cout << fmt :: format ( "{0:>{1}}{2} \n " , "" , indent , text ); }
Our definition of width fully support these use cases and gives better results
than
for Unicode subranges.
13. Examples
#include <format>#include <iostream>#include <stdio.h>struct input { const char * text ; const char * info ; }; int main () { input inputs [] = { { "Text" , "Description" }, { "-----" , "------------------------------------------------------------------------" "--------------" }, { " \x41 " , "U+0041 { LATIN CAPITAL LETTER A }" }, { " \xC3\x81 " , "U+00C1 { LATIN CAPITAL LETTER A WITH ACUTE }" }, { " \x41\xCC\x81 " , "U+0041 U+0301 { LATIN CAPITAL LETTER A } { COMBINING ACUTE ACCENT }" }, { " \xc4\xb2 " , "U+0132 { LATIN CAPITAL LIGATURE IJ }" }, // IJ { " \xce\x94 " , "U+0394 { GREEK CAPITAL LETTER DELTA }" }, // Δ { " \xd0\xa9 " , "U+0429 { CYRILLIC CAPITAL LETTER SHCHA }" }, // Щ { " \xd7\x90 " , "U+05D0 { HEBREW LETTER ALEF }" }, // א { " \xd8\xb4 " , "U+0634 { ARABIC LETTER SHEEN }" }, // ش { " \xe3\x80\x89 " , "U+3009 { RIGHT-POINTING ANGLE BRACKET }" }, // 〉 { " \xe7\x95\x8c " , "U+754C { CJK Unified Ideograph-754C }" }, // 界 { " \xf0\x9f\xa6\x84 " , "U+1F921 { UNICORN FACE }" }, // 🦄 { " \xf0\x9f\x91\xa8\xe2\x80\x8d\xf0\x9f\x91\xa9\xe2\x80\x8d " " \xf0\x9f\x91\xa7\xe2\x80\x8d\xf0\x9f\x91\xa6 " , "U+1F468 U+200D U+1F469 U+200D U+1F467 U+200D U+1F466 " "{ Family: Man, Woman, Girl, Boy } " } // 👨👩👧👦 }; std :: cout << " \n std::format with the current proposal: \n " ; for ( auto input : inputs ) { std :: cout << std :: format ( "{:>5} | {} \n " , input . text , input . info ); } std :: cout << " \n printf: \n " ; for ( auto input : inputs ) { printf ( "%5s | %s \n " , input . text , input . info ); } }
Output on macOS Terminal: