Extensions to the preprocessor for C2Y

Jens Gustedt

2023-12-13

document history

document number date comment
n3190 202312 this paper, original proposal

license

CC BY, see https://creativecommons.org/licenses/by/4.0

1 Introduction and overview

The C and C++ preprocessor have recently attracted some attention because it provides means of textual replacement that have an expressibility that is balanced: it allows to express relatively sophisticated compile time features by guaranteeing termination within quite reasonable time frames. The preprocessor is not Turing complete, which has the advantage that processing in general stays bounded. Compiling with relative complex macro packages (such as boost or P99) is in general several orders of magnitude faster than compiling an equivalent code written with templates or constexpr, and has the advantage of producing intermediate equivalent source code that can be inspected.

Several projects are currently on the way to extend the preprocessing phases to gain in expressitivity. Our goal is to collect the different proposed features here. Most of them are relatively simple to implement, whereas the gain for every-day programming in C may be significant. It will be important to watch that all of this does not incur too much slowdown in compilation times.

There are several angles to preprocessing. Infact, generally several phases of the C and C++ translation model are commonly subsumed under the name, in particular lexing, evaluation of directives and macro replacement. The extensions we will discuss range from simple (but useful) predefined macros such as __COUNTER__ , over new forms of string and character literals such as R"(bäh)", to extensions of existing directives. prefix for #include, new directives #bind, to new rules for macro replacement (bounded recursion).

2 predefined macros

A lot of predefined macros have appeared as compiler or library specific extensions that would better be promoted to the C standard. This concerns object-like macros and function-like macros.

2.1 predefined object-like macros

2.2 predefined function-like macros

In combination with the #expand prefix and #include the latter two allow to have

3 New literals

C and C++ have prefixes for string and character literals u8, L, u and U for specific execution encoding, namely for multi-byte encoding (without prefix), UTF-8, wchar_t, char16_t and char32_t characters, respectively.

Note, that these prefixes only concern the execution encoding. The source encoding is unchanged, characters in the input are interpreted as anywhere else, including the interpretation of escape sequences. The prefix only changes the interpretation/realization as an array. So for example a given source character such as ö would lead to several array elements in multi-byte encodings (such as UFT-8) to one or two elements in UTF-16 and to one element in UTF-32. The character literals 'x' and u8'x' refer to the same concept (the character x in the source encoding) but result in different semantics, one is a char with the value of the character x in the execution environment, the other is unsigned char with a portable value, namely 120.

3.1 C++ so-called raw string literals

In contrast to that, C++ adds a specific syntax for a modification of source encoding of a string. Namely they add an R at the end of one of the prefixes (including the empty one) to indicate that the source encoding is “raw” without escape sequences. Examples

R"(hör)";
uR"(hör)";
R"limit(here"and"there)limit";
R"(\not a \newline)"

3.2 String literals with specific encoding rules

The introduction of UTF-8 character literals has the strange effect of introducing literals with base type unsigned char but by imposing a specific interpretation. It was already unfortunate that for historic reasons characters and bare bytes have the same type, we are repeating the same mistake, again. The type that should be reserved for bytes also represent a particular textual concept, namely UTF-8 characters and strings. While the decision to do so is consistent within C’s restricted framework, it has the inherent danger that in particular UTF-8 strings will be used to encode arbitrary literals of type unsigned char by sprinkling \x escape sequences all over the place. We think that UTF-8 characters and strings should be reserved to properly encoded strings and that there should be other features that encode arbitrary binary data.

We propose the following prefixes as new extensions for C2y.

Form type encoding
x"cont\x00E4\0nts" unsigned char[] restricted UTF-8, with escape sequences
x'\xFFFF' unsigned char restricted UTF-8, with escape sequences
B"gAf4yu==" unsigned char[] base64

Here the x prefix (mnemonic “hex encoded string”) is meant for arrays of base type unsigned char that hold arbitrary data in the range of 0 to UCHAR_MAX, including, and which is encoded in the usual way by presenting escape sequences. For portability, the source encoding of these strings should be fixed to one-byte sequences of characters that are representable with UTF-8 in the range between codes 32 and 126, including, that is ASCII. Byte values outside that range should be specified with octal or hexadecimal escape sequences or with \n or similar. By these specifications, the array then holds portable binary data, as long as the byte values that are presented fit in 8 bit.

The B prefix (mnemonic “Base64 Binary” encoding) represents binary data that is packed consecutively in an array of base type unsigned char and where the encoding is base64. It uses the 62 alpha-numerical characters of the source character set plus the characters +, / and = that are present in all currently used encodings to encode packs of 12 bit of data with 4 characters. By that it is relatively efficient because it only uses about 1.33 times space than it must, but is still portable on all modern architectures. This encoding is widely used in the transfer of binary data and it is almost trivial to program.

Both concepts (without using a different prefix) are already widely used in practice. In particular, hex, octal or base64 encoded strings are used by some implementation as intermediate source format for #embed. Such an intermediate format could be forced through a if_empty parameter, see below.

3.3 Reserve syntax and synchronize with C++

Usually in C identifiers are not directly followed by strings. But when U prefixed literals were introduced in C. there still were some rare clashes with existing code. This happened were a macro U that expanded to a string was used to add some sort of leading character sequence to a string. Prior, this usage was not sensible to whether or not there was a space between the two. By introducing the prefix the two usages (with and without space) became distinct and code changed its meaning or became invalid. So for this situation space is in fact significant.

Generally, it is often assumed that in C spaces don’t contribute much to the interpretation of programming text, but we think that for C23 this is a simplification that does not really reflect the current situation. Additionally, there is the problem of interfacing with C++, where some of the rules are different.

syntax meaning, C different meaning, C++
# define X(A) function like macro, empty
# define X (A) object macro, expands to (A)
0x4'7'a hex number with digit separators
0x4 '7'a number, character literal, and identifier number, character literal with suffix
0x4 '7' a number, character literal, and identifier
"%" PRIx64 valid format string for printf
"%"PRIx64 valid format string for printf string literal with suffix
R "(hör)" identifier followed by multi-byte string
R"(hör)" identifier followed by multi-byte string raw multi-byte string, contains just hör
R "hör" identifier followed by multi-byte string
R"hör" identifier followed by multi-byte string invalid raw string
U "hör" identifier, followed by multi-byte string
U"hör" UTF-32 string

We think that it would be in order to coordinate here between C and C++ and in general to discourage any use of identifiers that are adjacent to character and string literals. If we want this to be diagnosed it should be before phase 4, in particular before macro expansion. Best would be if this is diagnosed in phase 3, lexing. We propose:

Change the definitions of character and string literals to include leading and trailing identifiers, and then add constraints for the accepted prefixes (and for C++ suffixes) to phase 5, decoding.

Implementations could start to diagnose such possible collision immediately.

4 Extensions to existing directives

4.1 Adjust #embed resource representation

For #embed we went with a compromise that is that the output of that directive is as-if a comma-separated list of integer values, representing the byte values is inserted in the program text. This is not much suitable for implementations that have the option of keeping preprocessed program text for intermediate stages of compilation. Such an intermediate file with all bytes spelled out as integer literals, looses all the advantages of #embed.

Thus, even today, they already use intermediate formats such as string literals with base64 encoding and wrap them inside some magic builtin. It would be good to generalize that idea, such that programmers would have the possibilities to specify what intermediate representation to expect. This could be achieved quite simply:

If the if_empty embed parameter specifies a narrow string literal, the encoded resource shall be represented as-if by a string literal of the same kind.

Example:

unsigned char tiger[] = {
#embed "tiger.dat" if_empty(B"")
};

is as if given as

unsigned char tiger[] = {
B"AQIDLS4+U0A="
};

where AQIDLS4+U0A= is the base64 encoding of the contents of the resource, here the 8 byte values \001\002\003\055\056\076\123\100 or 0x01 0x02 0x03 0x2d 0x2e 0x3e 0x53 0x40. So without the proposed convention the equivalent code as of C23 would be as if given as

unsigned char tiger[] = {
0x01, 0x02, 0x03, 0x2d, 0x2e, 0x3e, 0x53, 0x40
};

which uses about 5 encoding characters per encoded byte, about 3.8 times as much as with a B encoding.

4.2 add an offset parameter to #embed

This is more or less obvious to do and should account for the position in bytes from the start of the resource.

4.3 add parameters to #include

The same form of parameters as for #embed could be added, here, only that the semantics should be adapted to the case. Namely an #include resource should be accounted in lines instead of bytes. That is an offset or limit would skip and count the number of lines to be included.

The prefix and suffix parameters would always add directives before and after the file contents that are executed in the context of the include file.

  #include "my-main-xcode.c"                       \
     __prefix__(expand bind TOTO WHATDOWEHAVE(35)) \
     __suffix__(include "my-secondary-xcode.c")

similar, but without #bind

  #include "my-main-xcode.c"                         \
     __prefix__(expand define TOTO WHATDOWEHAVE(35)) \
     __suffix__(include "my-secondary-xcode.c")      \
     __suffix__(undef TOTO)

4.4 add a directory to the include and embed places

A slash at the end of the input file name add that file to the corresponding list of places instead of including a file. Example

#include </usr/local/include/>

This allows to distinguish additions to all the four lists that an implementation has to maintain, name #include with "/pa/th/" and </pa/th/> and #embed with "/pa/th/" and </pa/th/>.

This feature is perhaps not the most needed by normal code, but eases the tuning of system header files a lot.

5 new directives

5.1 bind a macro for a specific scope, #bind

Semantically this is really nothing else than an improved version of

#define TOTO bla
...
#undef TOTO

or

#define HOPLA(X) blub for X
...
#undef HOPLA

only that the #undef part is inserted automatically

This is simple to implement, because it only uses recursive preprocessor program structure that is already there. It really helps in programming because it avoids a pollution of the macro name space with local macros that everybody forgets to #undef, in particular for programming with xcode inclusion. It is king when combined with the prefix parameter extension for #include and #include_source.

5.2 non-expanding variants, #include_source, #embed_resource and #linenumber directives

I found the sometime-evaluate-and-sometimes-not definitions of #include, #embed and #line in combination with the weird filename strings à la <stdsomething.h> quite annoying to implement. It adds a lot of complexity for a feature that not many people use (expansion on #include lines).

So these three directives don’t expand their line and have to receive proper file names, line numbers or limit parameters directly.

5.3 controlled expansion of directives, the #expand prefix

The idea of that prefix is that it allows to have user controlled expansion of the line, and not as #include currently to make expansion (or not) depend on some weird syntactic property of the rest of the line. I find

#expand include_source </usr/lib/gcc/x86_64-linux-gnu/__GNUC__/include/>

much clearer than hiding the evaluation and concatenation inside a macro as you would do currently, something like

#include MY_INCLUDE_DIRECTORY(__GNUC__)

also this allows expansion for directives that currently don’t have that

#expand warning the counter toto TOTO is too large

Derived from that prefix are #xdefine, #xbind, … that are just shortcuts for adding a #expand prefix.

5.4 Directives for iteration, #do and #foreach

Some of the macro preprocessing libraries allow to loop over argument lists or lists of tokens. This can be interesting when defining features for a list of types or when building interfaces that work with enumeration types.

As mentioned above, a combination of the #expand prefix and #include allows to define macros that emulate finite recursion / iteration.

6 Changes to macro expansion

6.1 Recursion

Macro recursion is a dangerous feature, because it easily leads to unbounded depth and introduces the halting problem. Additionally it has the direct problem, that it is not backwards compatible. C23 expects that a macro that is called recursively does not expand. Many C code out there relies on this and for example a macro with the same name as a function. Once the macro level is expanded, a second level remains, which is then carried on into later compiler phases.

So recursion has two imperatives:

  1. Recursion depth has to be bounded
  2. Macros must be marked explicitly to have that property.

There are several possible designs for this, some of it has already been discussed in the context of LLVM. In particular:

As mentioned above a combination of the #expand prefix and #include allows to define macros that emulate finite recursion / iteration.