N3710 - How to Slay Earthly Demons and UB#90: Invalid #include directives

by David Svoboda

The purpose of this document was twofold; first to address UB#90 (from Annex J.2), but also to summarize our various techniques for resolving Undefined Behaviors (UBs).

Introduction

First, let’s examine some terminology, from N3220 (the latest draft for C23):

behavior
external appearance or action

This has some overlap with “observable behavior”, defined in s5.1.2.4p6:

The least requirements on a conforming implementation are:
— Volatile accesses to objects are evaluated strictly according to the rules of the abstract machine.
— At program termination, all data written into files shall be identical to the result that execution
of the program according to the abstract semantics would have produced.
— The input and output dynamics of interactive devices shall take place as specified in 7.23.3.
The intent of these requirements is that unbuffered or line-buffered output appear as soon as
possible, to ensure that prompting messages appear prior to a program waiting for input.
This is the observable behavior of the program.

Sill, this implies that there is a category of ‘un-observable behavior’. The committee informally regards “behavior” is what the platform actually does that cannot (usually) be directly observed, such as non-volatile memory writes and arithmetic calculations. In fact, many platforms can observe such behavior, debuggers being the most common. So perhaps term ‘external’ in the definition of ‘behavior’ should be removed; it does not reflect how we use “behavior”, but that is not the purpose of this paper.

Moving on, here is our archetypal earthly demon:

undefined behavior
behavior, upon use of a nonportable or erroneous program construct or of erroneous data, for which
this document imposes no requirements
2Note 1 to entry: Possible undefined behavior ranges from ignoring the situation completely with unpredictable
results, to behaving during translation or program execution in a documented manner characteristic of the
environment (with or without the issuance of a diagnostic message), to terminating a translation or execution
(with the issuance of a diagnostic message).
3Note 2 to entry: J.2 gives an overview over properties of C programs that lead to undefined behavior.
4Note 3 to entry: Any other behavior during execution of a program is only affected as a direct consequence of
the concrete behavior that occurs when encountering the erroneous or non-portable program construct or data.
In particular, all observable behavior (5.1.2.4) appears as specified in this document when it happens before an
operation with undefined behavior in the execution of the program.
5EXAMPLE An example of undefined behavior is the behavior on dereferencing a null pointer.

Various behavior is designated UB in the standard by either saying “this behavior is undefined” under certain conditions, or providing a “shall” in a Semantics sections. This comes with the implication that if a shall-clause is violated, the behavior is undefined.

UBs seem to have two purposes: Errors vs Extensions. The first (errors) is to indicate an erroneous condition which is a bug in the program, and the second (extensions) is to document behavior lacking a definition in the hopes that a definition may be provided in the future, either by the standard, or by individual platforms. However, these purposes have a lot of overlap and so no one has successfully distinguished between the two types of behavior.

We currently employ three techniques for eliminating UB, elimination, constraint violations, and implementation-defined behavior.

Elimination

This technique simply means that we remove the UB from the standard and argue that it can not currently exist. Typically this involves arguing that the UB should not have been in the published C23 standard, and perhaps should have never been added in the first place.

Constraint Violation

So what is a constraint violation? C23 doesn’t define this term, but it does define “constraint”:

constraint
restriction, either syntactic or semantic, by which the exposition of language elements is interpreted

We traditionally say that a constraint violation must be diagnosed by a translator (i.e., compiler). They are typically indicated by a requirement (typically using “shall”) in a Constraints section. To eliminate a UB in this manner may be as simple as moving a sentence with a shall-clause from a Semantics section to a nearby Constraints section.

This is an appropriate strategy to employ if the UB can be reliably diagnosed at compile time (or link time).

Implementation-Defined Behavior

Resolving UB may be as simple as replacing “the behavior is undefined” with “the behavior is implementation-defined”.

To understand this and other types of behavior, we first need to understand values:

More Terminology

Values

value
precise meaning of the contents of an object when interpreted as having a specific type

That is, a value is not necessarily a pattern of bits in memory, but it includes the meaning we ascribe to it. A bit-pattern might be 0b01000001, but given the appropriate context, it might have the value 0x41, ‘A’, or 65.

non-value representation
an object representation that does not represent a value of the object type

A type is (among other things) a specification for mapping bit-patterns into values. For any type, the bit-patterns with the same size as the type are partitioned into “valid values” and “invalid values”. A “trap representation” is a subset of invalid values that happen to perform a trap. The unsigned char type is unique in that every appropriately-sized bit pattern is a valid value of the unsigned char, which can be represented as an integer between 0 and UCHAR_MAX…it has no invalid values. Every other type in C is permitted to have invalid values, even if a platform has no invalid values for a type. For example, the signed int type may have trap representations on some platforms, but has none on x86-64 platforms.

indeterminate representation
object representation that either represents an unspecified value or is a non-value representation

unspecified value
valid value of the relevant type where this document imposes no requirements on which value is
chosen in any instance

implementation-defined value
unspecified value where each implementation documents how the choice is made 

Behaviors

unspecified behavior
behavior, that results from the use of an unspecified value, or other behavior upon which this
document provides two or more possibilities and imposes no further requirements on which is
chosen in any instance
2Note 1 to entry: J.1 gives an overview over properties of C programs that lead to unspecified behavior.
3EXAMPLE An example of unspecified behavior is the order in which the arguments to a function are evaluated.

Apparently, a platform is not required to document unspecified behavior, but is required to constrain it to be one of “two or more possibilities”. So we cannot use “unspecified behavior” if we want future platforms to define new behavior.

There is an ambiguity over whether this document must provide possibilities for behavior that results from an unspecified value. But that doesn’t help us in managing UB.

implementation-defined behavior
unspecified behavior where each implementation documents how the choice is made
2Note 1 to entry: J.3 gives an overview over properties of C programs that lead to implementation-defined behavior.
3EXAMPLE An example of implementation-defined behavior is the propagation of the high-order bit when a
signed integer is shifted right.

So implementation-defined behavior = unspecified behavior + documentation requirement.

There is an open question of what constitutes the documentation requirement. Many OSS tools say “The source code is the documentation”. Personally, I disagree. I think the documentation should be something distinct from the source code, that acts as an authority. If the code does not behave as the documentation specifies, that is a bug. But this question also does not help us in managing UB.

There is also locale-specific behavior, but this also does not help us in managing UB. Those are all the types of behavior defined in C23.

UB Resolution Conclusion

We can conclude by saying that unspecified behavior may at times be a better remedy for UB than implementation-defined behavior; with the main difference between the two being the documentation requirement.

Replacing UB with unspecified behavior or implementation-defined behavior requires that the Standard dictate all the possible ways the platform may behave. More precisely, this requires that the platform must behave in any of two or more possible behaviors that are documented in the Standard. UBs that cannot be so constrained may not be resolved in this manner.

A Lone Demon: UB#90: Invalid #include directives

Annex J.2 lists UB#90 as:

90. The #include preprocessing directive that results after expansion does not match one of the two header name forms (6.10.3).

N3677 provides the following as a hypothetical example of this UB:

Consider a platform that allows token concatenation in #include directives.

``` c
#define str(s) # s
#define xstr(s) str(s)
#define INCFILE(n) z ## n

#include xstr(INCFILE(3).h
// Undefined Behavior: #include "z3.h"
```

Currently modern versions of GCC and Clang do not compile this code; they recognize the directive as a syntax error.

We believe this UB was intended to be in the ‘extensions’ camp; it is designed to indicate a behavior that could be defined by platforms or by a future version of the Standard. It was not in the ‘errors’ camp; as programmers are unlikely to make this mistake. It is more likely that this UB is encountered by porting a program from a compiler that supports certain extensions to one that doesn’t. Such a problem is outside the scope of the Standard.

So how can we resolve this UB?

One could argue that the possibility of extension is overblown and could be eliminated. Certainly ISO C has not supported new include directives (or any new punctuation) since the introduction of trigraphs, which have subsequently been removed. I suspect such an argument would be embraced by a vocal minority of Committee members, and rejected by a vocal minority, leading to lots of discussion, with little resolution. This paper will not promote such an argument, and so we will assume that the possibility of extending C in this manner should be preserved. That rules out eliminating the UB altogether (option 1).

Option 2 (constraint violation) is tempting, as compilers and preprocessors can easily detect and diagnose such directives. However, making UB#90 into a constraint violation would also eliminate the extension possibility. So constraint violations are, alas, off the table.

Option 3 (implementation-defined behavior) also sounds tempting, as the validity of such programs would be platform-specific. Unfortunately, implementation-defined behavior, and unspecified behavior both require the Standard to document all of the possible behaviors that a platform may take. Above is one such behavior, but the number of possible behaviors is unbounded, and beyond our scope to document.

So our current weapons are ineffective against this earthly demon! Is all hope lost?

A New Weapon

I would argue that to resolve this UB, we need a new weapon; one that is currently not in our arsenal.

There have been some initial attempts to separate the two purposes of UB (errors vs extensions). There was a proposal P2795 (also P2793) to add “erroneous behavior” to the C++ standard. However, for this UB, perhaps we should try the other approach of singling out and constraining behavior that may be defined in the future. So I will propose this term:

extensible behavior
behavior that is intended to be defined by individual platforms or by a future version of the Standard.
A platform shall either provide a definition for this behavior or shall issue a fatal diagnostic upon encountering this behavior.

Thus, UB#90 can be slain by making it ’extensible behavior”. To do this, change the second sentence of s6.10.3p4 from:

The directive resulting after all replacements shall match one of the two previous forms. 199

to

If a directive resulting after all replacements does not match one of the two previous forms, the behavior is extensible. 199