Proposal for C2y
WG14 3193

Title:               Obsolete implicitly octal literals
Author, affiliation: Alex Celeste, Perforce
Date:                2023-11-30
Proposal category:   Clarification/enhancement
Target audience:     Compiler implementers, users

Abstract

The use of base-8 instead of base-10 by integer literals that begin with a zero digit is the source of frequent confusion. We propose marking the use of such literals as obsolete in order to encourage a warning that will prompt rewrites, and the introduction of a new prefix to explicitly mark literals that are genuinely intended to be in base-8.


Obsolete implicitly octal literals

Reply-to:     Alex Celeste (aceleste@perforce.com)
Document No:  N3193
Revises:      (n/a)
Date:         2023-11-30

Summary of Changes

N3193

Introduction

In C23, binary and hexadecimal integer constants are prefixed to indicate that they describe a value using a base other than the default base-10. An unprefixed number is usually therefore implicitly a decimal number, and is naturally read by most human readers in that base.

However, a leading zero digit implicitly serves as a prefix that tells the lexer to interpret the literal as a value described by base-8, rather than by base-10. This is "obvious" enough to a human reader who is completely familiar with the rules, but in practice is unexpected by most users and is also easy for even an advanced user to miss.

Many users do not expect leading zeroes to be significant and like to use them as visual padding. This can lead to unexpected value results, or unclear error messages if they try to "pad" a literal that contains the digits 8 or 9 (though an error is the better result here). The error can go unnoticed for some time if by coincidence the user only had a sparse set of values, such as only values smaller than 8, or all the other "true" decimal literals in use are sufficiently large as to begin with a non-zero digit. The fact that other languages may allow these literals to be base-10 adds to the confusion for non-expert users.

MISRA C 2023 (and prior versions) prohibits the use of octal literals entirely (Rule 7.1, Required) on the grounds that this is so unclear that it is more likely to be misunderstood than not. There is an exception for a literal zero spelled with a single digit, which is technically an octal rather than decimal literal but the distinction is not meaningful in practice.

Example

This example is lifted directly from C++ document p0085r0:

// The following literals all specify the same number.

int literal_octal_prefered          = 0o52;
int literal_octal_to_be_deprecated  = 052;
int literal_decimal                 = 42;
int literal_hex                     = 0x2A;
int literal_binary                  = 0b00101010;

This is intended to highlight that the distinction between decimal and the prefixed octal literal syntax is clearer than the distinction between the decimal and traditional octal syntaxes.

Proposal

We propose that a new syntax is added for explicit octal constants, with a new prefix 0o or an alternative spelling to mark the beginning of a base-8 literal. The old syntax should be retained and marked as obsolescent to avoid breaking the meaning of existing code.

We do not propose that leading-zero ever change meaning to be accepted as a base-10 literal. This syntax should remain obsolescent or be fully deprecated and removed, but cannot be recycled safely.

We separately propose changing escape sequences within literals at this time. Escape sequences are visually prefixed with a \ and are therefore much less subject to this issue. As with the existing hexademical escape syntax, there is no leading zero on the prefix as this would interfere with a string that intentionally contained the nul character.

We do separately propose allowing the strto_l function family to recognize prefixed octal digit sequences, whereas before they would have returned the value zero.

We do not propose any changes to the formatted input or output functions here. Printing a prefix is already a matter of user choice as the prefix is not part of the functionality. The formatted input functions are defined in terms of the strto_l function family and are therefore covered by the change above.

Choice of prefix

The character o is the most obvious choice for the prefix and is in common use in other languages.

However, it has a major flaw for readability: depending on the typeface, the uppercase form 0O123 may not be visually distinct enough to achieve the primary goal of the change, which is enhanced readability. This may even be a problem with the lowercase form, depending on the user's setup.

Alternative proposals might be to use c, which is not in use, and visually tends towards being read "oc-"; or t, which is also not yet in use and has a similar tendency. Overall, this proposal should be read as allowing any of these letters to be used as the new prefix, or any other letter the Committee prefers.

Status of zero

Zero remains a traditional octal constant because the rule defining decimal constants requires them to begin with a non-zero.

This can be changed, but as long as traditional octal constants remain in the language, the definition of decimal constants has to be complicated in one way or another. Therefore, for the time being, we leave this as-is.

However, any tool that depends on this distinction is hiding a silent logical error.

Prior Art

A very similar change was proposed for C++ as document p0085r0.

This proposal also added the prefixed form without removing the traditional literal syntax, and a new syntax for octal escape sequences in literals.

There does not appear to be a record of this proposal being discussed by WG21 and the change was not adopted.

Impact

There is no impact to existing code, other than new deprecation warnings if the user has this functionality enabled in their tool (obsolescence is not a constraint violation and these warnings are not mandatory).

Causing tools to emit these warnings if they were not already doing so (any tool aiming to check for MISRA C or similar Guidelines compliance is already warning on any use of octal) is considered a goal of the proposal and is not a compatibility failure.

The proposed spelling is not currently valid in C and therefore use of the new octal literal format would not break existing code. However, whichever letter the Committee chooses as a prefix is effectively ruled out from also serving as a suffix. Therefore, care is warranted to not rule out a different good design for an unrelated feature in the future.

Future directions

If the Standard evolves to incorporate a distinction between deprecation and obsolescence, we would prefer implicit octal syntax to be marked as fully deprecated in that version of the Standard. This would allow for its eventual removal, and presumably require a stronger class of warning message (such as a mandatory warning against uses, rather than the current opt-in for uses of obsolescent features).

Octal escape sequences within character or string literals have an outstanding issue that the end of the sequence is not clear:

"\1234"  // two characters, \123 followed by 4
"\1289"  // three characters, \12 followed by 8 and 9

A future change should try to normalize this. For the time being we simply allow the prefix to appear here for visual clarity, but this could be improved by forcing prefixed octal escape sequences to have a fixed width. (The C++ proposal did the opposite here.)

Apart from their variable length, octal escape sequences seem well-understood compared to integer literals and their use does not seem to be confusing in practice.

Proposed wording

The proposed changes are based on the latest public draft of C23, which is N3096. Bolded text is new text when inlined into an existing sentence. These changes are not compatible with the words from p0085r0, which describe a different Standard (C++).

Integer constants

Within 6.4.4.1 "Integer constants", Syntax, paragraph 1 (the grammar):

Replace the existing octal-constant rule with a new rule:

octal-constant:
prefixed-octal-constant
unprefixed-octal-constant

Rename the original octal-constant rule to unprefixed-octal-constant:

unprefixed-octal-constant:
0
unprefixed-octal-constant ' opt octal-digit

Add a new rule prefixed-octal-constant immediately below unprefixed-octal-constant:

prefixed-octal-constant:
octal-prefix octal-digit
unprefixed-octal-constant ' opt octal-digit

Add a new rule octal-prefix immediately below binary-constant:

octal-prefix: one of
0o 0O

Modify paragraph 4:

A decimal constant begins with a nonzero digit and consists of a sequence of decimal digits. An octal constant consists of the prefix 0o or 0O followed by a sequence of the digits 0 through 7 only. A hexadecimal constant consists of the prefix 0x or 0X followed by a sequence of the decimal digits and the letters a (or A) through f (or F) with values 10 through 15 respectively. A binary constant consists of the prefix 0b or 0B followed by a sequence of the digits 0 or 1.

Add a new paragraph immediately after paragraph 4:

An unprefixed octal constant begins with the digit 0 optionally followed by a sequence of the digits 0 through 7 only. Use of an unprefixed octal constant with more than one digit is an obsolescent feature.

Escape sequences

Within 6.4.4.4 "Character constants", Syntax, paragraph 1 (the grammar):

Modify the existing octal-escape-sequence rule:

octal-escape-sequence:
\ o opt octal-digit
\ o opt octal-digit octal-digit
\ o opt octal-digit octal-digit octal-digit

Modify the list in paragraph 3:

octal character \octal digits or \o octal digits

Modify the beginning of paragraph 5:

The octal digits that follow the escape in an octal escape sequence are taken to be part of ...

Optionally, modify the second sentence of example 3:

To specify an integer character constant containing the two characters whose values are ’\x12’ and ’3’, the construction ’\o0223’ can be used, since an octal escape sequence is terminated after three octal digits.

6.4.5 does not need to be modified because it refers back to 6.4.4.4 for the meaning of escape sequences.

Future language directions

Add a new entry between 6.11.3 "External names" and 6.11.4 "Character escape sequences":

6.11.x Octal integer constants The use of octal integer constants without the prefix 0o or 0O is an obsolescent feature, except for the constant 0.

(no entry is intended for octal escape sequences here)

The strto_l functions

Add a new sentence near the end of 7.24.1.7 paragraph 3:

If the value of base is 2, the characters 0b or 0B may optionally precede the sequence of letters and digits, following the sign if present. If the value of base is 8, the characters 0o or 0O may optionally precede the sequence of letters and digits, following the sign if present. If the value of base is 16, the characters 0x or 0X may optionally precede the sequence of letters and digits, following the sign if present.

Questions for WG14

Does WG14 want to add the new spelling for base-8 integer literals with an explicit prefix?

Does WG14 want to mark the use of unprefixed base-8 integer literals, apart from zero itself, as obsolete?

Does WG14 prefer the character o, c, or t for the prefix character?

Does WG14 want to add the new spelling for octal escape sequences in character and string literals?

Would WG14 prefer prefixed octal escape sequences to have a fixed width of three digits?

Does WG14 want to change the behaviour of the strto_l function family to allow them to interpret the new octal prefix, rather than returning zero?

References

C23 latest public draft

Wikipedia on use of octal

C++ proposal P0085R0

MISRA C 2023