Extensions to the preprocessor for C2Y

Jens Gustedt

2023-12-13

document history

document number	date	comment
n3190	202312	this paper, original proposal

license

CC BY, see https://creativecommons.org/licenses/by/4.0

1 Introduction and overview

The C and C++ preprocessor have recently attracted some attention because it provides means of textual replacement that have an expressibility that is balanced: it allows to express relatively sophisticated compile time features by guaranteeing termination within quite reasonable time frames. The preprocessor is not Turing complete, which has the advantage that processing in general stays bounded. Compiling with relative complex macro packages (such as boost or P99) is in general several orders of magnitude faster than compiling an equivalent code written with templates or constexpr, and has the advantage of producing intermediate equivalent source code that can be inspected.

Several projects are currently on the way to extend the preprocessing phases to gain in expressitivity. Our goal is to collect the different proposed features here. Most of them are relatively simple to implement, whereas the gain for every-day programming in C may be significant. It will be important to watch that all of this does not incur too much slowdown in compilation times.

There are several angles to preprocessing. Infact, generally several phases of the C and C++ translation model are commonly subsumed under the name, in particular lexing, evaluation of directives and macro replacement. The extensions we will discuss range from simple (but useful) predefined macros such as __COUNTER__ , over new forms of string and character literals such as R"(bäh)", to extensions of existing directives. prefix for #include, new directives #bind, to new rules for macro replacement (bounded recursion).

2 predefined macros

A lot of predefined macros have appeared as compiler or library specific extensions that would better be promoted to the C standard. This concerns object-like macros and function-like macros.

2.1 predefined object-like macros

__COUNTER__ returns an incremented value each time it is expanded. Already common in many implementations. Very useful to generate unique local identifiers that valid within the macro expansion that should not collide with those from another invocation of the same macro. Needs minimal compiler magic.
__BASE_FILE__ the name of the top level source file. Allows to identify a TU. Needs minimal magic.
__ISO_DATE__ same as __DATE__ but in the form "YYYY-MM-DD". Needs minimal magic.
__INTEGER_DATE__ same as __DATE__ but in the form YYYYMMDD, that is without " such that it is interpreted as an integer literal and may be used in arithmetic. Needs minimal magic.

2.2 predefined function-like macros

__ERROR__(message) The same as an error directive but may appear anywhere, in particular in macro expansions. Gnuc and similar have that around the back of the head by introducing a #pragma GNU error and then invoking that through _Pragma. Needs some magic.
__WARNING__(message) Same, but for #warning. Needs some magic.
__MANGLE__(...) A implementation specific result that eases the pain when programming with vendor attributes, for example. Needs magic.
__COMMAS__(...) Count the number of top level commas in the argument list. Basically the number of arguments plus one, but with no distinction when the list is completely empty. Can be done without magic, but is nasty to program. Better done by the implementation.
__EMPTY__(...) Detect an empty argument list. needs no magic since C23 when using __VA_OPT__.
__NARGS__(...) number of arguments where empty accounts as 0. same complexity as for __COMMAS__.
__UGLIFY__(...) add leading and trailing __ to an identifier. No magic needed.
__STRINGIFY__(...) No magic needed.
__EXPAND__(...) Expand the argument list once. No magic needed.
__EXPAND_STRINGIFY__(...) Expand the argument list once and then stringify. No magic needed.
__EXPAND_DEC__(...) expand and then evaluate as a prepro expression and convert to a decimal literal plus optional leading - token. Needs magic for the evaluation part.
__EXPAND_HEX__(...) expand and then evaluate as a prepro expression and convert to a hexadecimal literal including leading 0x. Needs magic for the evaluation part.

In combination with the #expand prefix and #include the latter two allow to have

basic counters
expanding features, such as static lists that are chained at compile time
limited recursion
iteration implemented through recursion.

3 New literals

C and C++ have prefixes for string and character literals u8, L, u and U for specific execution encoding, namely for multi-byte encoding (without prefix), UTF-8, wchar_t, char16_t and char32_t characters, respectively.

Note, that these prefixes only concern the execution encoding. The source encoding is unchanged, characters in the input are interpreted as anywhere else, including the interpretation of escape sequences. The prefix only changes the interpretation/realization as an array. So for example a given source character such as ö would lead to several array elements in multi-byte encodings (such as UFT-8) to one or two elements in UTF-16 and to one element in UTF-32. The character literals 'x' and u8'x' refer to the same concept (the character x in the source encoding) but result in different semantics, one is a char with the value of the character x in the execution environment, the other is unsigned char with a portable value, namely 120.

3.1 C++ so-called raw string literals

In contrast to that, C++ adds a specific syntax for a modification of source encoding of a string. Namely they add an R at the end of one of the prefixes (including the empty one) to indicate that the source encoding is “raw” without escape sequences. Examples

R"(hör)";
uR"(hör)";
R"limit(here"and"there)limit";
R"(\not a \newline)"

3.2 String literals with specific encoding rules

The introduction of UTF-8 character literals has the strange effect of introducing literals with base type unsigned char but by imposing a specific interpretation. It was already unfortunate that for historic reasons characters and bare bytes have the same type, we are repeating the same mistake, again. The type that should be reserved for bytes also represent a particular textual concept, namely UTF-8 characters and strings. While the decision to do so is consistent within C’s restricted framework, it has the inherent danger that in particular UTF-8 strings will be used to encode arbitrary literals of type unsigned char by sprinkling \x escape sequences all over the place. We think that UTF-8 characters and strings should be reserved to properly encoded strings and that there should be other features that encode arbitrary binary data.

We propose the following prefixes as new extensions for C2y.

Form	type	encoding
`x"cont\x00E4\0nts"`	`unsigned char[]`	restricted UTF-8, with escape sequences
`x'\xFFFF'`	`unsigned char`	restricted UTF-8, with escape sequences
`B"gAf4yu=="`	`unsigned char[]`	base64

Here the x prefix (mnemonic “hex encoded string”) is meant for arrays of base type unsigned char that hold arbitrary data in the range of 0 to UCHAR_MAX, including, and which is encoded in the usual way by presenting escape sequences. For portability, the source encoding of these strings should be fixed to one-byte sequences of characters that are representable with UTF-8 in the range between codes 32 and 126, including, that is ASCII. Byte values outside that range should be specified with octal or hexadecimal escape sequences or with \n or similar. By these specifications, the array then holds portable binary data, as long as the byte values that are presented fit in 8 bit.

The B prefix (mnemonic “Base64 Binary” encoding) represents binary data that is packed consecutively in an array of base type unsigned char and where the encoding is base64. It uses the 62 alpha-numerical characters of the source character set plus the characters +, / and = that are present in all currently used encodings to encode packs of 12 bit of data with 4 characters. By that it is relatively efficient because it only uses about 1.33 times space than it must, but is still portable on all modern architectures. This encoding is widely used in the transfer of binary data and it is almost trivial to program.

Both concepts (without using a different prefix) are already widely used in practice. In particular, hex, octal or base64 encoded strings are used by some implementation as intermediate source format for #embed. Such an intermediate format could be forced through a if_empty parameter, see below.

3.3 Reserve syntax and synchronize with C++

Usually in C identifiers are not directly followed by strings. But when U prefixed literals were introduced in C. there still were some rare clashes with existing code. This happened were a macro U that expanded to a string was used to add some sort of leading character sequence to a string. Prior, this usage was not sensible to whether or not there was a space between the two. By introducing the prefix the two usages (with and without space) became distinct and code changed its meaning or became invalid. So for this situation space is in fact significant.

Generally, it is often assumed that in C spaces don’t contribute much to the interpretation of programming text, but we think that for C23 this is a simplification that does not really reflect the current situation. Additionally, there is the problem of interfacing with C++, where some of the rules are different.

syntax	meaning, C	different meaning, C++
`# define X(A)`	function like macro, empty
`# define X (A)`	object macro, expands to `(A)`
`0x4'7'a`	hex number with digit separators
`0x4 '7'a`	number, character literal, and identifier	number, character literal with suffix
`0x4 '7' a`	number, character literal, and identifier
`"%" PRIx64`	valid format string for `printf`
`"%"PRIx64`	valid format string for `printf`	string literal with suffix
`R "(hör)"`	identifier followed by multi-byte string
`R"(hör)"`	identifier followed by multi-byte string	raw multi-byte string, contains just `hör`
`R "hör"`	identifier followed by multi-byte string
`R"hör"`	identifier followed by multi-byte string	invalid raw string
`U "hör"`	identifier, followed by multi-byte string
`U"hör"`	UTF-32 string

We think that it would be in order to coordinate here between C and C++ and in general to discourage any use of identifiers that are adjacent to character and string literals. If we want this to be diagnosed it should be before phase 4, in particular before macro expansion. Best would be if this is diagnosed in phase 3, lexing. We propose:

Change the definitions of character and string literals to include leading and trailing identifiers, and then add constraints for the accepted prefixes (and for C++ suffixes) to phase 5, decoding.

Implementations could start to diagnose such possible collision immediately.

4 Extensions to existing directives

4.1 Adjust `#embed` resource representation

For #embed we went with a compromise that is that the output of that directive is as-if a comma-separated list of integer values, representing the byte values is inserted in the program text. This is not much suitable for implementations that have the option of keeping preprocessed program text for intermediate stages of compilation. Such an intermediate file with all bytes spelled out as integer literals, looses all the advantages of #embed.

Thus, even today, they already use intermediate formats such as string literals with base64 encoding and wrap them inside some magic builtin. It would be good to generalize that idea, such that programmers would have the possibilities to specify what intermediate representation to expect. This could be achieved quite simply:

If the if_empty embed parameter specifies a narrow string literal, the encoded resource shall be represented as-if by a string literal of the same kind.

Example:

unsigned char tiger[] = {
#embed "tiger.dat" if_empty(B"")
};

is as if given as

unsigned char tiger[] = {
B"AQIDLS4+U0A="
};

where AQIDLS4+U0A= is the base64 encoding of the contents of the resource, here the 8 byte values \001\002\003\055\056\076\123\100 or 0x01 0x02 0x03 0x2d 0x2e 0x3e 0x53 0x40. So without the proposed convention the equivalent code as of C23 would be as if given as

unsigned char tiger[] = {
0x01, 0x02, 0x03, 0x2d, 0x2e, 0x3e, 0x53, 0x40
};

which uses about 5 encoding characters per encoded byte, about 3.8 times as much as with a B encoding.

4.2 add an `offset` parameter to `#embed`

This is more or less obvious to do and should account for the position in bytes from the start of the resource.

4.3 add parameters to `#include`

The same form of parameters as for #embed could be added, here, only that the semantics should be adapted to the case. Namely an #include resource should be accounted in lines instead of bytes. That is an offset or limit would skip and count the number of lines to be included.

The prefix and suffix parameters would always add directives before and after the file contents that are executed in the context of the include file.

prefix is particularly nice with the #bind directive, see below, but similar effects as with bind could be a combination of #define in a prefix and #undef in a suffix.
suffix is then nice for recursive inclusion of files that see bindings that are only active while the first level of inclusion has not terminated

  #include "my-main-xcode.c"                       \
     __prefix__(expand bind TOTO WHATDOWEHAVE(35)) \
     __suffix__(include "my-secondary-xcode.c")

similar, but without #bind

  #include "my-main-xcode.c"                         \
     __prefix__(expand define TOTO WHATDOWEHAVE(35)) \
     __suffix__(include "my-secondary-xcode.c")      \
     __suffix__(undef TOTO)

4.4 add a directory to the include and embed places

A slash at the end of the input file name add that file to the corresponding list of places instead of including a file. Example

#include </usr/local/include/>

This allows to distinguish additions to all the four lists that an implementation has to maintain, name #include with "/pa/th/" and </pa/th/> and #embed with "/pa/th/" and </pa/th/>.

This feature is perhaps not the most needed by normal code, but eases the tuning of system header files a lot.

5 new directives

5.1 bind a macro for a specific scope, `#bind`

Semantically this is really nothing else than an improved version of

#define TOTO bla
...
#undef TOTO

#define HOPLA(X) blub for X
...
#undef HOPLA

only that the #undef part is inserted automatically

at the end of the closest surrounding #if/#elif/#else block, if any
after the completion of the processing of the current include file

This is simple to implement, because it only uses recursive preprocessor program structure that is already there. It really helps in programming because it avoids a pollution of the macro name space with local macros that everybody forgets to #undef, in particular for programming with xcode inclusion. It is king when combined with the prefix parameter extension for #include and #include_source.

5.2 non-expanding variants, `#include_source`, `#embed_resource` and `#linenumber` directives

I found the sometime-evaluate-and-sometimes-not definitions of #include, #embed and #line in combination with the weird filename strings à la <stdsomething.h> quite annoying to implement. It adds a lot of complexity for a feature that not many people use (expansion on #include lines).

So these three directives don’t expand their line and have to receive proper file names, line numbers or limit parameters directly.

5.3 controlled expansion of directives, the `#expand` prefix

The idea of that prefix is that it allows to have user controlled expansion of the line, and not as #include currently to make expansion (or not) depend on some weird syntactic property of the rest of the line. I find

#expand include_source </usr/lib/gcc/x86_64-linux-gnu/__GNUC__/include/>

much clearer than hiding the evaluation and concatenation inside a macro as you would do currently, something like

#include MY_INCLUDE_DIRECTORY(__GNUC__)

also this allows expansion for directives that currently don’t have that

#expand warning the counter toto TOTO is too large

Derived from that prefix are #xdefine, #xbind, … that are just shortcuts for adding a #expand prefix.

5.4 Directives for iteration, `#do` and `#foreach`

Some of the macro preprocessing libraries allow to loop over argument lists or lists of tokens. This can be interesting when defining features for a list of types or when building interfaces that work with enumeration types.

As mentioned above, a combination of the #expand prefix and #include allows to define macros that emulate finite recursion / iteration.

6 Changes to macro expansion

6.1 Recursion

Macro recursion is a dangerous feature, because it easily leads to unbounded depth and introduces the halting problem. Additionally it has the direct problem, that it is not backwards compatible. C23 expects that a macro that is called recursively does not expand. Many C code out there relies on this and for example a macro with the same name as a function. Once the macro level is expanded, a second level remains, which is then carried on into later compiler phases.

So recursion has two imperatives:

Recursion depth has to be bounded
Macros must be marked explicitly to have that property.

There are several possible designs for this, some of it has already been discussed in the context of LLVM. In particular:

add a new directive that is similar to #define but that allows to expand the defined macros recursively
add a pseudo macro à-la __VA_OPT__ that refers to the current macro.

As mentioned above a combination of the #expand prefix and #include allows to define macros that emulate finite recursion / iteration.