Doc. No.:	WG14/N1068
Date:	2004-07-29
Reply to:	Clark Nelson
Phone:	+1-503-712-8433
Email:	clark.nelson@intel.com

Changes from the C++ preprocessor description

In carefully considering all the differences between the descriptions of the preprocessor and translation phases from the C and C++ standards, for the purpose of re-synchronizing C++ with C, I discovered several changes -- almost all editorial -- which were made to the C++ standard, and which could apply equally well to C. Many of these were made in response to public review comments, and were apparently considered by some reviewers to be improvements over the original wording.

In this document, I present these changes to WG14 for their consideration, as part of the synchronization effort. Changes which are acknowledged to be an improvement on the current words of the C standard will be retained in the working draft of the C++ standard, and will hopefully be incorporated into the C standard -- at the appropriate time, and in the appropriate way. For changes deemed not worthwhile, a recommendation will be made to the C++ committee that the words be returned to the state in the C standard.

It may help to know that I am presenting these changes not specifically because I think they are all good ideas. I am presenting them because WG21 believed them to be good ideas; good enough, in fact, to incorporate them into the formally-approved C++ standard. That notwithstanding, if WG14 agrees that some change corresponds to a problem which should be solved, but believes a different solution would be more appropriate, I am more than willing to present modified wording back to WG21 for their approval as well.

Change §6.10.2¶5:

The implementation shall provide unique mappings for sequences consisting of one or more ~~letters or digits (as defined in 5.2.1)~~ nondigits or digits (6.4.2.1) followed by a period (.) and a single ~~letter~~ nondigit. The first character shall ~~be a letter~~ not be a digit. The implementation may ignore the distinctions of alphabetical case and restrict the mapping to eight significant characters before the period.

Note: This is the only proposed technical change. It is also the only change that is not verbatim from the C++ standard.

Rationale: In analyzing the differences in this paragraph between C99 and C++, I discovered that C89 admitted only letters in header names with guaranteed unique mappings. C99 later added digits, while C++ independently added underscores. I personally don't recall any discussion or rationale behind either decision. It's clear that simply synchronizing C++ to C99 would be a technical change, and could (from a pedantic perspective) invalidate some existing code. The only way to synchronize the two standards without invalidating any existing code would be to allow underscores and digits in both standards. This may be considered a good thing in any event.

Note that the terms of reference have changed slightly. That's because, in the C++ standard, for good or ill, the terms letter and digit aren't defined in the (earlier) section describing the character set, as they are in C99 (whereas in C89, the terms appeared there without being definitions). In C++, letter is defined in the (later) library section, and digit is defined only as a non-terminal. It would of course be possible to rearrange things in the C++ standard to more closely match the C standard, but synchronizing things in this way would be much easier. And again, using the non-terminal symbols in this context may be considered an improvement in itself.
Change §6.10¶2:

A preprocessing directive consists of a sequence of preprocessing tokens ~~that begins with~~ . The first token in the sequence is a # preprocessing token that (at the start of translation phase 4) is either the first character in the source file (optionally after white space containing no new-line characters) or that follows white space containing at least one new-line character~~, and is ended by the next~~ . The last token in the sequence is the first new-line character that follows the first token in the sequence.¹⁴⁰⁾ A new-line character ends the preprocessing directive even if it occurs within what would otherwise be an invocation of a function-like macro.

Rationale: This breaks up a fearsomely long sentence. However, I note that TC2 changes this sentence into the definition of the term "preprocessing directive", which may change the dynamics here.
Insert new constraint paragraph after §6.10.1¶1:

Each preprocessing token that remains after all macro replacements have occurred shall be in the lexical form of a token (6.4).

Rationale: According to §16.10.1¶3, "each preprocessing token [in a #if directive] is converted into a token." But what if, for example, the line contains an unmatched quote mark, or a preprocessing number like 4hello? How is such a preprocessing token converted into a token? No indication is given that the conversion may fail. The added sentence clarifies the intent considerably.
Change §6.10.1¶3:

«...» After all replacements due to macro expansion and the defined unary operator have been performed, all remaining identifiers and keywords are replaced with the pp-number 0, and then each preprocessing token is converted into a token. «...»

Rationale: This just clarifies that keywords are not treated specially in this regard. (In C++, the keywords true and false are treated specially in this regard; I suspect that someone didn't want the sentence to read, "... all remaining identifiers, except for true and false, are replaced ...", for reasons which seem fairly obvious to me.)
Add a new sentence to the end of §6.10.3¶9:
A preprocessing directive of the form
```
# define identifier replacement-list new-line
```
defines an object-like macro that causes each subsequent instance of the macro name¹⁴⁵⁾ to be replaced by the replacement list of preprocessing tokens that constitute the remainder of the directive. The replacement list is then rescanned for more macro names as specified below.
Rationale: I suspect that this was introduced as a result of a public comment from someone who was confused (honestly or perversely) about §16.3.4¶1: "After all parameters in the replacement list have been substituted, the resulting preprocessing token sequence is rescanned «...»" (emphasis added). This clearly describes the rescanning of function-like macros, but because of the reference to parameters, may be taken as not applying to object-like macros.
Change §6.10.3¶10:
A preprocessing directive of the form
```
# define identifier lparen identifier-list_opt ) replacement-list new-line
# define identifier lparen ... ) replacement-list new-line
# define identifier lparen identifier-list , ... ) replacement-list new-line
```
defines a function-like macro with ~~arguments~~ parameters, similar syntactically to a function call.
Rationale: Obviously, what appear in the definition syntax of a function-like macro are not its arguments, but its parameters. On the other hand, what is similar syntactically to a function call is obviously the invocation of the macro, not its definition. Clearly, there is confusion about whether this sentence is talking about the definition or an invocation.

Perhaps it would be clearer yet to say something like, "a function-like macro which takes arguments, similarly syntactically to a function call".
Change footnote 5 (§5.1.1.2¶1):

Implementations shall behave as if these separate phases occur, even though many are typically folded together in practice. Source files, translation units and translated translation units need not necessarily be stored as files, nor need there be any one-to-one correspondence between these entities and any external representation. The description is conceptual only, and does not specify any particular implementation.

Rationale: I do not recall the specific motivation for adding this note, but it's certainly true, and it seems harmless.

It should be noted that in the C++ standard, this text is an embedded non-normative note at the end of the description of phase 7 (parsing and semantic analysis). But the C standard does not have embedded notes, and the note is not actually specific to phase 7 (which talks principally about tokens, without even mentioning files). Adding the text to this footnote, which already points out an implication of the as-if rule for the phases of translation, would seem to be the ideal solution.
Change §5.2.1.1¶1:

~~All occurrences in a source file~~ Before any other processing takes place, each occurrence of one of the following sequences of three characters (called trigraph sequences¹²⁾) ~~are~~ is replaced with the corresponding single character.

Rationale: Obviously, "before any other processing" is already implied by the phases of translation, but it doesn't hurt to point out the implication here. And in general, a distributive description ("each" with singular) tends to be less ambiguous than a collective one ("all" with plural). The motivation for deleting the phrase "in a source file" is, as far as I can see, weak at best.
Add new example paragraph before §5.2.1.1¶2:
EXAMPLE 1:
```
??=define arraycheck(a,b) a??(b??) ??!??! b??(a??)
```
becomes
```
#define arraycheck(a,b) a[b] || b[a]
```
Rationale: The existing corner case example is a good one. For some reason it was removed from C++, and I will propose that it be restored. But in general there are very few cases where the only example presented is a corner case. If trigraphs make any sense at all, then perhaps it would make sense to present a more realistic example (possibly even more realistic than this example).