ASCII character utilities
- Document number:
- P3688R0
- Date:
2025-05-19 - Audience:
- LEWG, SG16
- Project:
- ISO/IEC 14882 Programming Languages — C++, ISO/IEC JTC1/SC22/WG21
- Reply-To:
- Jan Schultke <janschultke@gmail.com>
- Co-Authors:
- Corentin Jabot <corentin.jabot@gmail.com>
- GitHub Issue:
- wg21.link/P3688R0/github
- Source:
- github.com/Eisenwave/cpp-proposals/blob/master/src/ascii.cow
or
are locale-specific,
not
,
and provide no support for Unicode character types.
We propose lightweight, locale-independent alternatives.
Contents
Introduction
Can't you implement this trivially yourself?
Design
List of proposed functions
is_ascii
base
parameter in is_ascii_digit
is_ascii_bit
and is_ascii_octal_digit
Case-insensitive comparison functions
Why no function objects?
What to do for ASCII-incompatible char
and wchar_t
Conditionally supported char
overloads
Transcode char
to ASCII
Treat the input as ASCII, regardless of the literal encoding
What if the input is a non-ASCII code unit?
Why not accept any integer type?
ASCII case-insensitive views and case transformation algorithms
Why just ASCII?
Implementation experience
Wording
References
1. Introduction
Testing whether a character falls into a specific subset of ASCII characters or performing some simple transformations are common tasks in text processing. For example, applications may need to check if identifiers are comprised of alphanumeric ASCII characters or underscores; Unicode properties are not relevant to this task, and usually, neither are locales.
Unfortunately, these common and simple tasks are only supported
through functions in the
and
headers, such as:
Especially the
functions are ridden with problems:
-
There is no support for Unicode character types
(
,char8_t
, andchar16_t
).char32_t -
These functions are not
, but performing basic characters tests would be useful at compile time.constexpr -
There are distinct function names for
andchar
such aswchar_t
andstd :: isalnum
, making generic programming more difficult.std :: iswalnum -
If
is signed, these functions can easily result in undefined behavior because the input must be representable aschar
or beunsigned char
. IfEOF
represents a UTF-8 code unit, passing any non-ASCII code unit into these functions has undefined behavior.char -
These functions violate the zero-overhead principle
by also handling an
input, and in many use cases,EOF
will never be passed into these functions anyway. The caller can easily deal withEOF
themselves.EOF -
The return type of charater tests is
, where a nonzero return value indicates that a test succeeded. This is very unnatural in C++, whereint
is more idiomatic.bool -
Some functions use the currently installed
locale, which makes their use questionable for high-performance tasks because each invocation is typically an opaque call that checks the current locale." C "
We propose lightweight replacement functions which address all these problems.
overloads in
,
but their locale dependence makes them unfit for what this proposal aims to achieve.
Testing whether a
(assumed to be a UTF-8 code unit)
is an ASCII digit is obviously a locale-independent task.
1.1. Can't you implement this trivially yourself?
It is worth noting that some of the functions can be implemented very easily by the user.
For example, existing code may already use a check like
to test for ASCII digits,
and our proposed
does just that.
However, not all of the proposed functions are this simple.
For example, checking whether a
is an
ASCII punctuation character (
,
, etc.)
would require lots of separate checks done naively.
In the standard library, it can be efficiently implemented using a 128-bit or 256-bit bitset.
Even if all proposed functions were trivial to implement, working with ASCII characters is such an overwhelmingly common use case that it's worth supporting in the standard library.
2. Design
All proposed functions are
,
locale-independent,
overloaded (i.e. no separate name for separate input types),
and accept any character type
(
,
,
,
, and
).
Furthermore, all function names contain
to be
,
but
to be
.
is declared follows:
means that there exists an overload set where
this placeholder is replaced with each of the character types.
This design is more consistent with
and
functions
than say,
.
Equivalent functions could also be added to C, if there is interest.
This signature also allows the use with types that are convertible to a specific character type.
2.1. List of proposed functions
Find below a list of proposed functions.
Note that the character set notation
|
Proposed name | Returns (given ASCII ) |
---|---|---|
N/A |
|
|
|
|
if is in , otherwise
|
N/A |
|
|
N/A |
|
if is in , otherwise
|
|
|
if is in , otherwise
|
|
|
if is in , otherwise
|
|
|
if is in , otherwise
|
|
|
|
|
|
|
|
|
if is in , otherwise
|
|
|
|
|
|
|
|
|
|
|
|
if is in , otherwise
|
|
|
|
|
|
the respective lower-case character if is , otherwise
|
|
|
the respective upper-case character if is , otherwise
|
N/A |
|
see §2.5. Case-insensitive comparison functions |
N/A |
|
see §2.5. Case-insensitive comparison functions |
or
could also be used.
should perhaps have no new version.
It is of questionable use,
and both the old and new name aren't obvious.
In the default
locale,
is simply
without
.
Similarly,
should perhaps have no new version either.
This proposal simply has a new version for every
function;
if need be, they are easy to remove.
2.2. is_ascii
This additional function is mainly useful for checking if a character "is ASCII", i.e. falls into the basic latin block, before performing an ASCII-only evaluation.
implementation delegates
to the
implementation to avoid repetition of its logic.
The
check is needed because
because an unconditional
may result in treating U+0130 LATIN CAPITAL LETTER I WITH DOT ABOVE as U+0030 DIGIT ZERO.
2.3. base
parameter in is_ascii_digit
Similar to
,
can also take a
parameter:
If
≤
,
the range of valid ASCII digit character is simply limited.
For greater
, a subset of alphabetic characters is also accepted,
starting with
or
.
Such a function is useful when parsing numbers with a base of choice,
which is what
does, for example.
2.4. is_ascii_bit
and is_ascii_octal_digit
C++ and various other programming languages support binary and octal literals,
so it seems like an arbitrary choice to only have dedicated overloads for (hexa)decimal digits.
may be especially useful,
such as when dealing with bit-strings like one of the
constructors.
In conclusion, we may as well have functions for bases 2, 8, 10, and 16; they're not doing much harm, they're trivial to implement, and some users may find them useful.
and
,
and even remove
,
leaving only the multi-base
.
2.5. Case-insensitive comparison functions
As shown in the table above, we also propose the case-insensitive comparison functions.
2.6. Why no function objects?
For case-insensitive comparisons and for character tests in general, function objects may be convenient because they can be more easily used in algorithms:
However, there is no reason why
needs to be a function object.
It is not a customization point, but a plain function.
Furthermore, defining function objects for this purpose may be obsoleted by
[P3312R1] Overload Set Types.
2.7. What to do for ASCII-incompatible char
and wchar_t
Not every ordinary and wide character encoding is ASCII-compatible,
such as EBCDIC, Shift-JIS, and (defunct) ISO-646,
i.e. code units ≤
do not represent the same characters as ASCII.
This begs the question:
what should
do on an EBCDIC platform,
where this call is
?
We have three options, discussed below.
is equivalent to
on any platform.
In general, the behavior for Unicode character types is obvious,
unlike that for
and
.
2.7.1. Conditionally supported char
overloads
We could mandate that the ordinary literal encoding is an ASCII superset
for the
overload to exist.
This would force a cast (to
) to use the functions on EBCDIC platforms.
It is not clear how implementations would treat Shift-JIS;
GCC assumes
to be
,
so this option may not be enough to alleviate
the awkwardness of
.
Also, this option is not very useful.
It is reasonable to have UTF-8 data stored in a
on EBCDIC platforms,
and having to perform casts to
would be awkward.
2.7.2. Transcode char
to ASCII
We could transcode from the ordinary literal encoding
to ASCII and produce an answer for the result of that transcoding.
This would be a greater burden for implementations,
especially on EBCDIC platforms.
The benefit is that
is always
,
although
may not be.
However,
is always
.
It probably does not solve the
case,
as implementers may keep transcoding
and
in the same way.
It would also give incorrect answers for stateful encodings.
There are EBCDIC control characters that do not have an ASCII equivalent,
so if we were to do conversions, we would have to decide what,
for example,
should produce.
2.7.3. Treat the input as ASCII, regardless of the literal encoding
This is our proposed behavior.
The most simple option is to ignore literal encoding entirely,
and assume that
inputs are ASCII-encoded.
The greatest downside is that depending on encoding,
may be
,
which may be surprising to the user.
However, the main purpose of these functions is to be called with characters taken from ASCII text,
so what results they yield when passing literals is not so important.
There are use cases for this behavior on EBCDIC platforms.
A lot of protocols (HTTP, POP) and file formats (JSON, XML) are ASCII/UTF-8-based
and need to be supported on EBCDIC systems,
making these functions universally useful,
especially as
functions cannot easily be used to deal with ASCII on these platforms.
Ultimately, do we want functions to deal with ASCII or the literal encoding?
If we want them to be a general way to query the ordinary literal encoding,
is a terrible name,
and finding a more general name would prove difficult.
→ (code point)
function,
although that may be outside the scope of this proposal.
2.8. What if the input is a non-ASCII code unit?
Text input is rarely guaranteed to be pure ASCII,
i.e. some code units may be >
.
However, we're still interested in ASCII characters within that input.
For example, we may
- parse pure ASCII numbers like
in a UTF-8 JSON (or other config) file,123 - trim ASCII whitespace in HTTP headers, which are encoded with ISO-8859-1,
- parse ASCII-alphanumeric variable names in Lua scripts, where non-ASCII characters can appear (comments, string),
- ...
It is possible (and expected) that the user calls say,
, at least indirectly.
For the sake of convenience, all proposed functions should handle such inputs by
- returning
in the case of all testing functions, andfalse - applying an identity transformation in transformation/case-insensitive comparison functions.
If
doesn't simply return
on non-ASCII inputs,
the proposal is useless for the common use case where some non-ASCII characters exist in the input.
The proposed behavior also works excellently with any ASCII-compatible encoding, such as UTF-8.
Surrogate code units in UTF-8 are all greater than
,
so if we implement say,
naively by checking
, it "just works".
2.9. Why not accept any integer type?
Some people argue that a test like
is a purely numerical test using the ASCII table,
and so passing
should also be valid.
However, this permissive interface would invite bugs.
For example,
is the difference between ASCII characters, not an ASCII character,
so passing it into
would be nonsensical.
Static type systems exist for a reason:
to protect us from stupid mistakes.
While
,
etc. are not required to be ASCII-encoded,
they are at least characters,
so passing them into our functions is likely something the user intended to do,
which we cannot say with confidence about
,
, etc.
Additionally, if we allowed passing signed integers,
we may want to make the behavior erroneous or undefined for negative inputs
because
is most likely a developer mistake.
Our interface is very simple:
it has a wide contract and almost all functions are
.
Let's keep it that way!
Lastly, even proponents of passing integer types would not want
to be valid.
2.10. ASCII case-insensitive views and case transformation algorithms
Ignoring or transforming ASCII case in algorithms is a fairly common problem.
Therefore, it may be useful to provide views such as
,
algorithms like
, etc.
,
etc.
To identify a
element, it would be nice if the user could write:
While case transformations can be implemented naively using
,
dedicated functions would allow an efficient vectorized implementation for contiguous ranges,
which can be many times faster ([AvoidCharByChar], [AVX-512CaseConv])
Similarly, a case-insensitive comparison function can be vectorized.
In fact, POSIX's
has been heavily optimized in glibc ([AVX2strncasecmp]),
and providing range-based interfaces would allow delegating to these heavily optimized functions.
We intend to propose such utilities in a future paper or revision of this paper. Currently, this proposal is focused exclusively on operations involving character types.
2.11. Why just ASCII?
It may be tempting to generalize the proposed utilities beyond ASCII, e.g. to UTF-8. However, this is not proposed for multiple reasons:
-
You cannot pass
into a UTF-8char8_t
function and expect meaningful results. In general, operations on variable-length encodings require sequences of code units. The interface we propose only makes sense for ASCII.is_upper - Unicode utilities are tremendously more complex than ASCII utilities. Some Unicode case conversions even require multi-code-point changes.
3. Implementation experience
A naive implementation of all proposed functions can be found at [CompilerExplorer], although these are implemented as function templates, not as overload sets (as proposed).
A more advanced implementation of some functions can be found in [µlight]. Character tests can be optimized using 128-bit or 256-bit bitsets.
4. Wording
The wording changes are relative to [N5008].
In subclause [version.syn], update the synopsis as follows:
In Clause [text], append a new subclause:
ASCII utilities [ascii]
Subclause [ascii] describes components for dealing with characters that are encoded using ASCII or encodings that are ASCII-compatible, such as UTF-8.
Recommended practice:
Implementations should emit a warning when a function in this subclause is invoked
using a value produced by a
is
if the
ordinary literal encoding ([lex.charset]) is EBCDIC
or some other ASCII-incompatible encoding,
which can be surprising to the user.
However,
is
regardless of ordinary literal encoding.
— end example]
Header <ascii>
synopsis [ascii.syn]
When a function is specified with a type placeholder of
,
the implementation provides overloads for all character types ([basic.fundamental])
in lieu of
.
ASCII character testing [ascii.chars.test]
Returns:
.
Preconditions:
has a value between 2 and 36 (inclusive).
Returns:
Remarks: A function call expression that violates the precondition in the Preconditions: element is not a core constant expression.
Returns:
.
Returns:
.
Returns:
.
Returns:
.
Returns:
.
Returns:
.
Returns:
.
Returns:
.
Returns:
.
Returns:
.
Returns:
.
Returns:
.
Returns:
.
ASCII character transformation [ascii.chars.transform]
Returns:
.
Returns:
.
ASCII case-insensitive character comparison [ascii.chars.case.compare]
Returns:
.
Returns:
.
are unnecessary to describe semantics.
For example,
is equivalent to
.
However, these uses of
may improve readability and avoid
the use of behavior which is proposed to be deprecated in [P3695R0].