N2828
Unicode Sequences More Than 21 Bits are a Constraint Violation

Published Proposal,

Previous Revisions:
None
Author:
Paper Source:
GitHub
Issue Tracking:
GitHub
Project:
ISO/IEC JTC1/SC22/WG14 9899: Programming Language — C
Proposal Category:
Change Request
Target:
General Developers

Abstract

Unicode escape sequences larger than hexadecimal 10FFFF do not make sense.

1. Changelog

1.1. Revision 0 - October 15th, 2021

2. Introduction & Motivation

What does u8"\UFFFFFFFF yield? By strict interpretation of the standard, it produces a 32-bit code point whose value is 0xFFFFFFFF. But, this code point has no meaning as an ISO/IEC 10646 code point or Universal Character Sequence. The maximum is 0x10FFFF, and that 21-bit limitation is baked into both ISO/IEC 10646 as well as a required property of Unicode as part of it’s Stability Guarantees.

The General_Category property value Surrogate (Cs) is immutable: the set of code points with that value will never change.

We should make this requirement explicit, rather than have developers maybe-or-maybe-not derive it from the ISO/IEC 10646 definition. Previously, ISO 10646 described, up to the 2003 edition, a complicated scheme that could allow for up to 31 bits of data (which, normally, made the UTF-8 encoding that uses 5 or 6 byte sequences to denote code units beyond 21 bits) legal. Past that version, Unicode has settled on the 21-bit guarantee and put it into its future compatibility promises. It is safe to standardize this because the invariant has been maintained for the past 18 years, and we have not yet nearly exhausted the 21-bit code point space Unicode has allotted for us. (Notably, the Unicode Consortium has taken to standardizing fantasy languages like Klingon because at this point it’s just whoever has the energy to show up and make proposals for specific scripts, and there’s millions of code points left.)

3. Design

The specification is changed to make it a constraint violation. Users can still request strange values in their strings using, specifically, numeric escape sequences (\x47593749478 or \0473847439574398). This follows most existing practice of compilers which warn/error (with no additional flags) when given this code:

int main () {
    const char meow[] = u8"\U49584958";
    return 0;
}

(Godbolt Link to demonstrate here.)

There are some compilers such as TCC which do not warn, and compile without any error. But, the output code does not seem to produce a valid sequence of UTF-8 characters.

4. Wording

The following wording is relative to N2596.

4.1. Modify "§6.4.3 Universal character names ", paragraph 2

A universal character name shall not specify a character whose short identifier is less than 00A0 other than 0024 ($), 0040 (@), or 0060 (‘), nor one in the range D800 through DFFF inclusive. A universal character name shall not designate a codepoint where the hexadecimal value is:
— less than 00A0 other than 0024 ($), 0040 (@), or 0060 (‘);
— in the range D800 through DFFF inclusive; or
— greater than 10FFFF.78)
78)The disallowed characters are the characters in the basic character set and the code positions reserved by ISO/IEC 10646 for control characters, the character DELETE, and the S-zone (reserved for use by UTF–16), and characters too large to be encoded by ISO/IEC 10646. Disallowed universal character escape sequences can still be specified with hexadecimal and octal escape sequences (6.4.4.4).