This issue has been automatically converted from the original issue lists and some formatting may not have been preserved.
Authors: Terence David Carroll, WG14
Date: 1992-12-10
Reference document: X3J11/90-011
Submitted against: C90
Status: Closed
Converted from: dr.htm, dr_003.html
Subclause 6.1.8: Preprocessing numbers I note from the rationale document of
November 1988, X3J11 Document Number 88-151, that the following problem has been
observed. I am surprised at the Committee's decision to allow such a loose
description. Under the grammar given for a pp-number
0xEE+23 0x7E+macro 0x100E+value-macro
are preprocessing numbers and as such a conforming C compiler would be required
to generate an error when it failed to successfully convert them to actual C
language number tokens. The solution is simply to restrict the inclusion of
[eE
][+-
] within a pp-number
to situations where the e
or E
is
the first non-digit
in the character sequence composing the preprocessing
number. This can be easily implemented in a variety of methods; the informal
description above gives perhaps a better guide to efficient implementation than
the following revised grammar:
pp-number:
pp-float
pp-number digit
pp-number .
pp-number non-digit /* A non-digit is a letter or underscore */
pp-float:
pp-real
pp-real E sign pp-real e sign
pp-real:
digit
.
pp-real digit
pp-real .
It is unbelievable that a standards committee could so lose sight of its objective that it would, in full awareness, make simple expressions illegal.
To illustrate the absurdity of the rationale document's claim that the faulty
grammar was felt to be easier to implement, why not adopt the following grammar
for a pp-number
and really make life simple; after all, who wants to have
their preprocessor slowed down by checking whether the +
or -
was preceded
by a n e
or an E
?
pp-number:
digit
.
pp-number digit
pp-number .
pp-number non-digit
pp-number sign
Comment from WG14 on 1997-09-23:
The Committee reasserts that the grammar and/or semantics of preprocessing as
they appear in the standard are as intended. We are attaching a copy of the
previous responses to this item from David F. Prosser. The Committee endorses
the substance of these responses, which follow: In response to your first
suggested grammar: This grammar doesn't include all valid numeric constants and
exclude other important tokens. For example, .
is derivable. But let's assume
that you intended something like
pp-number:
pp-float
digit
pp-number digit
pp-number non-digit
pp-float:
pp-real
pp-float E sign
pp-float e sign
pp-float digit
pp-float .
pp-float non-digit
pp-real:
digit
. digit
pp-real digit
pp-real .
This grammar is certainly more complicated than the one-level construction in
the C Standard, and consequently harder to understand. That's a strike against
it. Another strike is that, while it does mimic the two major numeric
categories, it still doesn't include all sequences covered by the existing
grammar, save those that would otherwise be valid by the stricter tokenization
rules. For example, 0b0101e+17
might be someone's future notion of a binary
floating constant. Finally, it suffers from a great deal of reduce/reduce
conflicts, making the implementation and specification less likely to be
understood and implemented as intended. In response to your second suggested
grammar: This could have been done. But the Committee chose a compromise at a
different point - one that restricts the inappropriate gobbling of characters
to +
and -
immediately after E
or e
. This was all that was necessary to
cover all valid numeric constants in as simple a grammar as was possible. For
more background, you'd need to know the state of the proposed standard a few
years before this grammar was voted in. The Committee had stated its intent that
“garbage” character sequences that began like a numeric constant were to be
tokenized as a single sequence. This was to prevent situations in which this
“garbage” would be turned into valid C code through obscure macro replacements,
among more minor reasons. This was, unfortunately, very poorly stated in the
draft. As I recall, it was placed in the constraints for subclause 6.1. It was
something like “Each pair of adjacent tokens that are both keywords,
identifiers, and/or constants must be separated by white space.” [As “improved”
for the May 1, 1986 draft proposed standard, subclause 6.1 Constraints
consisted of the single sentence: “Each keyword, identifier, or constant shall
be separated by some white space from any otherwise adjacent keyword,
identifier, or constant.”] As you can see, this constraint neither presented
the intent of the Committee nor caused implementations to behave in any sort of
consistent manner with respect to tokenization. Finally a letter writer
understood the issue well enough to suggest a grammar along the lines of the
current subclause 6.1.8. It, contrary to your opening remarks on this topic, is
not a “loose description,” and it finally stated in a precise way the intent
of the tokenization rules. The benefits of this construction were that all
tokenization for all implementations would now be the same, no “garbage”
character sequences would be able to be converted to valid C code, skipped
blocks of code could silently be scanned without generating needless and
unnecessary tokenization errors, the preprocessing tokenization of numeric
tokens would be greatly simplified, and room for future expansion of C's numeric
tokens was reserved. That's a lot of good. The down side was that certain
sequences now would require some white space to cause them to be tokenized as
the programmer intended. As noted in the rationale document, there are other
parts in C that require white space for tokenization to be controlled, and this
was found to be one more. Since the “mistokenization” of such sequences must
result in some diagnostic noise from the compiler, and since the fix is so mild,
the Committee agreed that the proposed standard is still much better with this
grammar than with any of the other suggestions. Personally, I think that the
biggest surprise “win” was the reservation of future numeric token “name space.”
I would not be at all surprised to find binary constants (that begin with 0b
)
in newer C implementations.