More named universal character escapes
- Document number:
- P3733R1
- Date:
2025-09-27 - Audience:
- EWG
- Project:
- ISO/IEC 14882 Programming Languages — C++, ISO/IEC JTC1/SC22/WG21
- Reply-to:
- Jan Schultke <janschultke@gmail.com>
- GitHub Issue:
- wg21.link/P3733/github
- Source:
- github.com/Eisenwave/cpp-proposals/blob/master/src/more-unicode-escapes.cow
Contents
Revision history
Changes since R0
Introduction
History
Inconsistency with other languages
Motivation
Abbreviations
Figments
Category stability guarantees
Proposed change
Impact on implementations
Wording
References
1. Revision history
1.1. Changes since R0
The paper was seen by SG16 and discussed, with the following poll results:
Poll 1: P3733R0: Forward to EWG with a recommendation to adopt this feature as a defect report.
- Attendees: 10
SF F N A SA 7 3 0 0 0 - Unanimous consensus.
The following changes were made:
- Added §3.3. Category stability guarantees
- Provided rationale for adopting this feature as a defect report in §4. Proposed change
- Rebased §6. Wording on [N5014] (no changes needed)
2. Introduction
2.1. History
[P2071R2] introduced
within string or character literals.
These provide much-needed clarity as compared to
.
Some code points additionally or exclusively have aliases.
For example,
SG16 voted unanimously to support aliases within a
Match name aliases?
SF F N A SA 8 2 0 0 0
EWG reaffirmed that decision at the same meeting:
This [named universal character escapes] should further support aliases
SF F N A SA 18 2 1 0 0
However, some categories of aliases are disallowed in C++, as explained in [P2071R2] §8.2 Name sources:
Unicode aliases provide another critical service. As mentioned above, once assigned, names are immutable. Corrections are only offered by providing an alias. Aliases, accoring to the NamedAliases tables in the Unicode Character Database, come in five varieties:
- correction Aliases for cases where an incorrect assigned name was published. For example, U+FE18 has an assigned name of
PRESENTATION FORM FOR VERTICAL RIGHT WHITE LENTICULAR BRAKCET and a correction alias ofPRESENTATION FORM FOR VERTICAL RIGHT WHITE LENTICULAR BRACKET (note the typo correction).- control Aliases for various control characters. For example,
NULL for U+0000.- alternate Aliases for widely used alternate names. For example,
BYTE ORDER MARK for U+FEFF.- figment Aliases for names that were documented, but never accepted in a standard. For example,
HIGH OCTET PRESET for U+0081.- abbreviation Aliases for common abbreviations. For example,
NBSP for U+00A0.The intent is to use the aliases classified as
correction ,control , andalternate as recognized names.
While the paper does not make it obvious why
Following [P2736R2], the C++ standard references the Unicode standard instead of ISO/IEC 10646, and such a restriction is no longer motivated.
2.2. Inconsistency with other languages
Many design choices of [P2071R2] are ultimately motivated by
[P2071R2] §8.5 Existing practice.
For example, the
syntax in C++ is identical to Python and Perl.
While C++ shares a syntax,
it does not permit the same categories of aliases:
Alias category | Example | C++ | Python | Perl |
---|---|---|---|---|
| ✅ | ✅ | ✅ | |
| ✅ | ✅ | ✅ | |
| ✅ | ✅ | ✅ | |
| ❌ | ✅ | ✅ | |
| ❌ | ✅ | ✅ |
Although this made historical sense, it now feels like an arbitrary restriction.
3. Motivation
Besides the inconsistency with other languages above, there are good reasons to allow abbreviations and figments in C++ (or more generally, allow any category of alias).
3.1. Abbreviations
While the usefulness of some abbreviations is debatable, some of them significantly shorten commonly used code points.
All that is to say that some abbreviations are well-motivated.
3.2. Figments
There are currently only three aliases classified as
- U+0080 PADDING CHARACTER
- U+0081 HIGH OCTET PRESET
- U+0099 SINGLE GRAPHIC CHARACTER INTRODUCER
While there also exist
3.3. Category stability guarantees
While the Unicode standard guarantees that aliases are not changed or removed,
it does not have a stability guarantee for the alias categories.
It is theoretically possible for the
category to be merged
into
, for example,
although this seems implausible and unmotivated at this time.
In any case, simply permitting any category of alias removes a reliance on an unstable property of the Unicode standard.
4. Proposed change
The
As recommended by SG16, this change should be applied as a defect report.
It would be quite inconvenient to the user if they had to refrain from using
if they wanted their code to be compatible with C++23.
Since the motivation for disallowing that name disappeared with [P2736R2]
(still in C++23),
the proposed change is arguably fixing a defect in the language.
5. Impact on implementations
Permitting abbreviations and figments is essentially trivial. [UnicodeNameAliases] contains a list of all aliases, with 354 abbreviations and 3 figments. This is a drop in the ocean compared to the existing set of names.
Furthermore, the same guarantees of
uniqueness (will never conflict with other names)
and immutability (will never change)
are provided for
6. Wording
The following change is relative to [N5014].
Change [lex.universal.char] paragraph 3 as follows:
A of type “control”, “correction”, or “alternate”;
otherwise, the program is ill-formed.
[Note:
These aliases are listed in the Unicode Character Database's
Change a feature-test macro in [tab:cpp.predefined.ft] as follows:
Macro name | Value |
---|---|
[…] | […] |
|
|
[…] | […] |