Doc. No.: WG21/N1566, J16/04-0006
Date: 2004-02-05
Reply to: Clark Nelson
Phone: +1-503-712-8433
Email: clark.nelson@intel.com

Synchronizing the C++ preprocessor with C99

This document is intended as a basis for discussion of the details of adopting text from C99 to describe the C++ preprocessor. This was proposed at the Kona meeting, and was supported almost unanimously by the Evolution working group.

This paper summarizes every change that was made to the preprocessor section of either the C++ standard (as of 2003) or the C standard (as of 2001), taking the 1989 C standard as the base. The descriptions of the phases of translation and of trigraphs are also covered; they were explicitly mentioned in the original straw vote from the first Nashua meeting (1991-03) on the subject of incorporating text from the C standard.

Differences in the area of universal character names are also mentioned, as they affect the phases of translation. UCNs were developed/introduced concurrently in both committees/standards; nevertheless (and unfortunately) there is considerable variance in the way they are described. Unfortunately, UCNs were not specifically mentioned at Kona, and therefore it is not yet clear that there is consensus to synchronize with C, nor which direction may be favored for resolving discrepancies.

References are to the C++ standard, with the corresponding C standard section number in parentheses. A complete C reference is used when a paragraph or section has been added to the C standard.

Changes fall into four major categories:

Universal character name differences
These are treated separately, because they are a new topic.
Technical changes
These correspond or relate to language features and other substantive committee decisions.
Terminology changes
Where different committees and editors have at different times used different terms of reference.
Editorial changes
Clarifying changes, large and small. (Some small editorial changes to non-normative text, and to text also modified more significantly, are ignored in this paper.)

Universal character names

I expect that everyone would agree that C and C++ should be synchronized with respect to universal character names. I am less certain that everyone will agree where changes should be made to effect synchronization. Personally, I believe that the model described in the C standard is at least as good as that in C++. Therefore, I recommend that the C++ standard be changed to match C: the changes from C99 should be adopted, and the changes from C++ should be abandoned.

It should be noted that this list of changes is not complete. Universal character names are also mentioned elsewhere in both standards: where they are defined, in the descriptions of identifiers and string and character literals, and in annexes specifying which characters are permitted in identifiers. More work in this area will be needed, especially if the committee prefers close synchronization.

Changes made to C99

§16.3.2¶2 (§6.10.3.2): Added a statement that stringizing a string literal containing a UCN is implementation-defined.

Changes made to C++

§2.1 phase 1 (§5.1.1.2): Processing of characters not in the basic source character set is described in terms of universal character names.

§2.1 phase 2 (§5.1.1.2): A universal character name may not be split by an escaped new-line.

§2.1 phase 5 (§5.1.1.2): Universal character names are mapped onto the execution character set. [In C, no change is needed here because of a terminology difference: a universal character name is described as an escape sequence, which is already mentioned.]

Technical changes

The technical changes are presented roughly in order of decreasing controversy (in my best guess).

Standard pragmas

This represents quite a lot of technical work, in both specification and implementation. I am not prepared to make a recommendation at this time.

Changes made to C99

§16.6¶1 (§6.10.6): Added statements distinguishing non-standard pragmas from standard pragmas.

§16.6 -- new paragraph (§6.10.6¶2): Added introductions of the standard pragmas:

Their semantics are specified elsewhere.

Predefined macros

Although the conditionally-defined macros added to C99 represent a fair amount of work in specification and/or coordination, my recommendation would be to adopt them into C++. __STDC_HOSTED__ is comparatively easy, makes as much sense for C++ as it does for C.

Clearly the description of __cplusplus in the C++ standard should not be synchronized with C. __STDC_VERSION__ might be trivially (and usefully?) defined to have the same value as __cplusplus. With respect to __STDC__, perhaps existing practice should be surveyed.

Changes made to C99

§16.8¶1 (§6.10.8): New macros were added:

__STDC_HOSTED__
Indicates whether or not the implementation is hosted.
__STDC_VERSION__
A year-month standard version number.

Several editorial clarifications are also applied.

§16.8 -- new paragraph (§6.10.8¶2): New conditionally-defined macros were added:

__STDC_IEC_559__
Indicates whether or not floating-point arithmetic conforms to IEC 60559 (a.k.a. IEEE 754).
__STDC_IEC_559_COMPLEX__
Indicates whether or not complex arithmetic conforms to IEC 60559.
__STDC_ISO_10646__
A year-month number of the version of ISO 10646 encoded by wchar_t.

§16.8 -- new paragraph (§6.10.8¶5): Added a prohibition against predefining or defining __cplusplus. [This was added more or less as a courtesy, to ensure that __cplusplus could be used to distinguish reliably between C and C++.]

Changes made to C++

§16.8¶1 (§6.10.8):

__STDC__
The state was changed to be implementation-defined.
__cplusplus
Added as a year-month version number.

In addition, a restriction on the spellings of any other predefined macros (i.e. that they must begin either with two underscores or an underscore and a capital letter) was deleted. [I believe this was removed due to a general reluctance to state restrictions on implementations using the word "shall". Other such instances were rephrased, not deleted. It is not clear to me that this particular change is worth preserving.]

Extended integer types

Probably every hosted C++ implementation already supports 64-bit integers, most by the name long long. So adopting it, along with the other <stdint.h> changes, would amount to codification of existing practice. I recommend it.

Changes made to C99

§16.1¶4 (§6.10.1): long and unsigned long were replaced by intmax_t and uintmax_t, respectively. Also, integer literals can have other widths than int or long.

Pragma operator

This is a very simple change; there is no interaction with the rest of the language. It should be adopted.

Changes made to C99

§16.3.4¶3 (§6.10.3.4): Added a statement that pragma operators are processed after macro expansion.

§16.9 -- new section (§6.10.9): Added description of pragma operator:

_Pragma ( string-literal )

§2.1 phase 4 (§5.1.1.2): Added a statement that pragma operators are interpreted.

Variadic macros and empty macro arguments

Paul Mensonides made this proposal in isolation at the Kona meeting. I trumped it by suggesting this grander unification before many people had a chance to comment on this aspect specifically. This is unquestionably the largest change under consideration. Along with Paul, I recommend it.

Changes made to C99

§16 control-line grammar rule (§6.10): Alternatives were added with an ellipsis before the close parenthesis.

§16.3¶4 (§6.10.3): A variadic macro may be invoked with more arguments than the definition has parameters.

§16.3 -- new paragraph (§6.10.3¶5): __VA_ARGS__ may be used only in the definition of a variadic macro.

§16.3¶9 (§6.10.3): Alternatives were added with an ellipsis before the close parenthesis.

§16.3¶10 (§6.10.3): Removed statement that empty macro arguments yield undefined behavior.

§16.3 -- new paragraph (§6.10.3¶12): Added description of argument collection for variadic macros.

§16.3.1 -- new paragraph (§6.10.3.1¶2): __VA_ARGS__ is an implicit parameter of a variadic macro.

§16.3.2¶2 (§6.10.3.2): Added definition of the result of stringizing an empty macro argument.

§16.3.3¶2-3 (§6.10.3.3): Added definition of token-pasting with an empty macro argument.

§16.3.3 -- new paragraph (§6.10.3.3¶4): A token-pasting example was added.

§16.3.5¶5 (§6.10.3.5): Examples of token-pasting and stringizing with empty macro arguments were added.

§16.3.5 -- new paragraph (§6.10.3.5¶7): More examples of token-pasting with empty macro arguments.

§16.3.5 -- new paragraph (§6.10.3.5¶9): Examples using variadic macros.

String literal concatenation

This change should be adopted. Note that, since the Technical Report on extensions for new character data types (WG14/N1040) has new kinds of string literals, its rules are slightly different, although analogous.

Changes made to C99

§2.1 phase 6 (§5.1.1.2): If adjacent string literals are of different types, the result of concatenation is a wide string literal.

Header and include file names

It is interesting to note that C89 explicitly allowed only letters in header and include file names. C++ added underscores, and C99 added digits. Probably both standards should allow both.

I have no idea why C99 dropped that the requirement that the implementation document the mapping to external file names. But there is probably no practical impact, so by default C++ should probably drop it as well.

Changes made to C99

§16.2¶5 (§6.10.2): The mapping from header or source file name syntax to external source file names is no longer implementation-defined.

§16.2¶5 (§6.10.2): Non-initial digits are now allowed in include syntax.

Changes made to C++

§16.2¶5 (§6.10.2): Underscores are allowed.

Translation limit changes

There is probably no support for adopting the lower limit on the significance of a header or include file name from C, even though it has now been increased.

On the other hand, I imagine it was only by oversight that the limitation to 15-bit numbers in a #line directive survived into C++. There is certainly no need to preserve it.

Changes made to C99

§16.2¶5 (§6.10.2): The lower limit on the significant characters of an include file or header name was raised to eight.

§16.4¶2 (§6.10.4): The lower limit on the number that can be specified in a #line directive was raised to 2147483647.

Changes made to C++

§16.2¶5 (§6.10.2): The standard does not explicitly grant license to limit the number of significant characters in the name of an included file or header.

Alternative tokens

This is a considered difference from C, in which these identifier-like alternative token spellings are explicitly implemented as macros. It should be preserved.

Changes made to C++

§16.1¶4 (§6.10.1): Added a footnote clarifying that an identifier-like spelling of an alternative token is not replaced by zero in a condition directive.

bool data type

Although C now has a Boolean type, Boolean-valued operators are still specified as having int results, unlike in C++. Also, in C++ true and false are not defined as macros. So this difference is still justified.

Changes made to C++

§16.1¶4 (§6.10.1): In a condition directive, true and false are not replaced by zero, and bool-typed subexpressions are immediately integral-promoted.

Template instantiation

This change is obviously still justified.

Changes made to C++

§2.1 phase 8 -- new phase (§5.1.1.2): Template instantiation was inserted between parsing/translation and linking.

Terminology changes

Unless someone would like to convince either committee to adopt terms from the other, these are simply areas where the committees have agreed to disagree. I recommend no changes.

Changes made to C99

"integral constant expression" was changed to "integer constant expression".

"comprise" was changed to "compose".

"preprocessing translation unit" was added, referring to a translation unit before macro expansion.

Changes made to C++

"character constant" was changed to "character literal".

The implication of "shall" in a Semantics paragraph of the C standard is spelled out as "undefined behavior".

When "shall" was used to express a requirement on an implementation, the requirement was rewritten.

§16.3¶2-3 (§6.10.3): Constraints on macro redefinition were made explicit using "ill-formed".

Editorial changes

Although I frankly do not see the point of a few of the changes made to C99, for simplicity I recommend that they all be adopted, including the small edits.

The changes made to C++ should be forwarded to the C committee for their consideration.

Changes made to C99

§16¶1 (§6.10): Clarifications were added with respect to translation phases (specifically, processing of comments and expansion of macros). An accompanying example was added as a new paragraph immediately before §16.1.

§16 grammar rules (§6.10): New rules were added for text-line and non-directive, and group-part was changed to use them, to clarify (for example) that any line beginning with # is interpreted as a directive (even though it also matches the grammar of a non-directive line). Two new accompanying text paragraphs were also added before §16¶2.

§16.3 -- new paragraph (§6.10.3¶3): Added a requirement for white-space after the macro name in an object-like macro definition.

§16.3.4¶1 (§6.10.3.4): Added clarification that token-pasting and stringizing precede rescanning. Also minor editorial changes.

§16.3.5¶1 (§6.10.3.5): Added clarification that macros are not used after translation phase 4.

§16.6¶1 (§6.10.6): Added clarification that (non-standard) pragmas may cause translation failure or non-conforming behavior.

§2.1 phase 1 (§5.1.1.2): Clarified that source may contain multibyte characters.

§2.1 phase 2 (§5.1.1.2): Clarified that a line that ends with two backslashes can not result in two line-splices.

§2.1 phase 4 (§5.1.1.2): Clarified that preprocessing directives do not survive past phase 4.

§2.1 phase 5 (§5.1.1.2): The mapping to the execution character set was clarified: a character not in the execution set must not be mapped to a null character, but different missing characters may be mapped to different execution characters.

§2.1 phase 7 (§5.1.1.2): Added clarification that the results of preprocessing are translated "as a translation unit".

Several examples were changed to include "C++-style" comments.

§16 grammar rules (§6.10): The definition of lparen was tweaked.

§16.3¶2-3 (§6.10.3): Definitions of object-like and function-like macro were moved down, and forward-referenced from here. Constraints were made explicit using "shall". Paragraphs were joined into one.

§16.3.1¶1 (§6.10.3.1): "translation unit" changed to "preprocessing file".

§16.3.3¶2 (§6.10.3.3): Clarify that special case for parameters in token-pasting applies only in function-like macros.

§16.3.4¶2 (§6.10.3.4): Change "Further" to "Furthermore".

§16.3.5¶6 (§6.10.3.5): A comment referring (misleadingly) to a previous example was deleted.

§2.1 phase 2 (§5.1.1.2): The description of an escaped new-line was rearranged.

§2.1 phase 3 (§5.1.1.2): Added "in a" in "or in a partial comment."

Changes made to C++

§16.1¶2 (§6.10.1): Added a restriction that only valid tokens may appear in a condition directive.

§16.1¶4 (§6.10.1): Added clarification that (most) keywords are replaced by zero in a condition directive.

§16.3¶8 (§6.10.3): Added clarification that object-like macros are rescanned.

§16.5¶1 (§6.10.5): Added a statement that #error causes a program to be ill-formed.

§2.1 phase 3 (§5.1.1.2): The footnote pointing out the context-dependent nature of tokenization (specifically with respect to header names) was made normative.

§2.1 phase 7 (§5.1.1.2): A note was added clarifying that there need not be a one-to-one correspondence between (for example) source files and external file system files.

§2.3¶1 (§5.2.1.1): Added clarification that trigraphs are recognized before preprocessing.

§2.3¶1 (§5.2.1.1): Added an example using several trigraphs. Deleted the example demonstrating a boundary condition (???/).

§16¶1 (§6.10): Modified to break up a very long sentence.

§16.1¶1 (§6.10.1): Spelled out "0" as "zero".

§16.3¶9 (§6.10.3): "arguments" was replaced with "parameters".

§2.3¶1 (§5.2.1.1): Description of trigraph processing changed from plural (collective) to singular (distributive). Also, trigraph sequences were formatted into a table.