Defect With Wording Of restrict Specification

Document Number: N3025

Date: 2022-July-17

Authors: Hubert Tong, Martin Uecker, Bill Homer, Tom MacDonald (redactor)

Acknowledgement: Ralf Jung for raising this issue: https://www.ralfj.de/blog/2022/04/11/provenance-exposed.html

Introduction

An issue with the formal definition of restrict has been identified. Below is a discussion of the problem, followed by a proposed wording change to the restrict specification, and finally some examples that demonstrate why the changes address the problems.

Since C23 is expected to finish before this issue is resolved, this proposal is not intended to be considered for C23. This proposal involves a wording change to the existing formal definition of restrict.

There is another proposal that will be introduced in a separate WG14 document. That proposal introduces a provenance-style solution. Since that involves a complete rewrite of the formal definition of restrict, it is anticipated that it will be more complicated (but possibly a better fit) than this proposal. The recommendation is that WG14 consider the following options:

Defect Description

Some wording changes are needed for the formal definition of restrict to address an identified defect. The wording changes involve the based on concept defined in the current formal definition. The concept of directly based on is also introduced because based on now has a recursive definition, and it seems clearer if the start of the recursion is identified.

Here is the executive summary of the proposed changes:

This proposal attempts to clarify the issues, discuss what needs to change, and explain the new concepts that are intended to correct the issues. Finally, the intent is that these changes avoid creating new complexities for existing compilers. For consistency purposes, a goal of based on is that its definition should not affect compile time transformations like evaluation of conditions, forward substitution, introduction of intermediate variables, or other transformations a compiler may wish to make prior to doing alias analysis. It is still viewed as important that any aliasing information provided by the keyword restrict in a strictly conforming program can be ignored by a compiler, and the program is still strictly conforming.

The following explanations do not always use formal specification wording associated with restrict, but rather use a slightly less formal but (hopefully) a more understandable explanation. The exact wording changes to the current specification are included later on.


  // Example 1:

  int f(int * restrict p, int *q) {
    int *r = p;  // sequence point

    if (r == p) {
      r = q;
    }

    *r = 13; // here let E denote &*r == r
    *p = 42;

    return *r;
  }

On entry to the function, q is not based on p, and the intent is that after its assignment to r, r and therefore E are not based on p. But, if the value of p is modified to point to a 'copy' at the identified sequence point, then the value of E changes, because the assignment r = q would be skipped. Therefore, the unintended result of the current definition is that r is based on p. This means that the stores into *r and *p and the evaluation of *r in the return statement must be done in order, as they would if restrict was not present.

The problem is not that the usage of restrict in this example is ineffective, but that it conflicts with a goal that "based on" semantics should not be affected by compile time transformations compilers may wish to make prior to doing alias analysis. The specification for based on should not get in the compiler's way.

For Example 1, if the compiler evaluates r == p as true and converts the if statement to an unconditional assignment, then in the transformed program r is not based on p. The inconsistency with the original version is the problem.

One way to address this problem is to change the specification of based on to require that the expression must not only change, but change to point into the copy. In Example 1, the value of r would change, but would not refer to the copy, if the address of a copy were stored into p at the identified sequence point. So, with this new criterion, r is not based on p, and we now have the desired consistency for this example.

But the above change, by itself, is not yet enough to guarantee consistency in other cases, as the next example shows.


// Example 2:

int f(int * restrict p, int *q) {
  int *a = &p[p-q];  // If 'p' points to a copy here
                     // then the value of 'p-q' no
                     // longer has well defined behavior

  *a = 13;  // here let E denote &*a == a


  p[1] = 42;

  return *a;
}

extern int printf(const char *, ...);

int z[100];

int main(void) {
  int x = f(z+1, z);
  printf("x=%d\n", x);
}

The intent is that a is based on p because it is assigned a value that differs from p only by an offset within the object to which p refers. But because that offset happens to be expressed as a difference between p and q within the expression that is assigned to a, there is no sequence point at which storing into p the address of a copy of the object to which it refers would cause a to have a well-defined value that also points into that copy.

It follows that the behavior of the call to f() in main() has undefined behavior because *a and p[1] refer to the same object, but a is not based on p.

Now consider changing Example 2 by introducing an intermediate variable to hold the offset and an additional sequence point:


   int k = p-q; int *a = &p[k];

Now the new sequence point can be used to establish that a is based on p and so the call to f() does not have undefined behavior.

Thus, the consistency goal is not met.

The source of the problem is in modifying P to point to a copy by storing the address of the copy into P at a previous sequence point. The proposed change is to consider modifying the value of P, rather than the value in P, and specifically to consider the effect of replacing the values of some but not all references to P in the expression under consideration.

In Example 2, this allows the conclusion that a is directly based on p because replacing the value of only the first instance of p in the expression:


   &p[p-q]

with the address of a copy of the object to which p refers would cause the expression to be well-defined and to point into the copy. This would then be consistent with the version that uses the intermediate variable k.

The following wording changes to the formal definition of the restrict specification (below) are intended to address the based on issues informally described above. The changes are also intended to address the following variations of Example 2:


  a = &p[p-q]; => int * r = p; a = &r[r-q];
  // a is directly based on r and r is directly based on p, so a is based on p.

  a = &p[p-q];  =>  int * r = (X ? p : p+1); a = &r[r-q];
  // see that r is directly based on p requires replacing both instances of p

Modified Formal Definition of restrict

Let D be a declaration of an ordinary identifier that provides a means of designating an object P as a restrict-qualified pointer to type T.

If D appears inside a block and does not have storage class extern, let B denote the block. If D appears in the list of parameter declarations of a function definition, let B denote the associated block. Otherwise, let B denote the block of main (or the block of whatever function is called at program startup in a freestanding environment).

In what follows, a pointer expression E is said to be based on object P if (at some sequence point in the execution of B prior to the evaluation of E) modifying P to point to a copy of the array object into which it formerly pointed would change the value of E.146) Note that "based" is define only for expressions with pointer types. "directly based on" object P if there are any accesses to P in the evaluation of E (including any function calls) and replacing one or more of those references with a pointer to a copy of the object to which P refers, or if said object is an array element, with a pointer to the corresponding element in a copy of that array, would cause the evaluation of E to provide a pointer into (or one past the end of) the copy. E is said to be "based" on P if E is directly based on P or is directly based on another pointer object that holds the value of an expression that is based on P. 146) If the value of an expression based on a restricted pointer P is cast to a non-pointer type, or is stored in an object by any means other than assignment or initialization, then the resulting value (and any value derived from it) ceases to be based on the restricted pointer.

During each execution of B, let L be any lvalue that has &L based on P. If L is used to access the value of the object X (or the value of a bit-field within X) that it designates, and X (or the value of a bit-field within X) is also modified (by any means), then the following requirements apply: T shall not be const-qualified. &L shall not be based on a different restricted pointer object P2 associated with B. Every other lvalue used to access the value of X (or the value of a bit-field within X) shall also have its address based on P. Every access that modifies X (or the value of a bit-field within X) shall be considered also to modify P, for the purposes of this subclause. If P is assigned the value of a pointer expression E that is based on another restricted pointer object P2, associated with block B2, then either the execution of B2 shall begin before the execution of B, or the execution of B2 shall end prior to the assignment. If these requirements are not met, then the behavior is undefined.

Here an execution of B means that portion of the execution of the program that would correspond to the lifetime of an object with scalar type and automatic storage duration associated with B.

A translator is free to ignore any or all aliasing implications of uses of restrict.

146) In other words, E depends on the value of P itself rather than on the value of an object referenced indirectly through P. For example, if identifier p has type (int **restrict), then the pointer expressions p and p+1 are based on the restricted pointer object designated by p, but the pointer expressions *p and p[1] are not.

Revisiting Previous Examples

The changes to the formal definition are intended to eliminate the problem exposed in Example 1 and Example 2.

Re-examination of Example 1

In Example 1 the intended behavior is that r should not be based on p.


  // Example 1:

  int f(int * restrict p, int *q) {
    // #1 sequence point after arguments have been evaluated
    int *r = p;  // #2 sequence point after initialization

    if (r == p) {  // #3 sequence point after if expression evaluation
      r = q;  // #4 sequence point at end of statement
    }  // #5 sequence point at end of statement

    *r = 13; // here let E denote &*r == r
    *p = 42;

    return *r;
  }

To establish that E is not based on p, we examine the effect of modifying p to point to a copy of the object to which it refers at each prior sequence point. Modifying p at:

  • #5 - #3 does not affect the value of E.
  • #2 does affect the value of E because the subsequent assignment to r is skipped, "but does not cause E to point to the copy", as the new wording states.
  • #1 does not affect the value of E because of the subsequent assignment to r.

The quoted text reflects the new wording in the formal definition that ensures r is not based on p in this example. It follows that a call of the form f(&z, &z) has undefined behavior, because the object denoted by z is modified, is referenced using p, and is also referenced using *r which is not based on p.

Re-examination of Example 2

The changes to the formal definition are also intended to address the problem identified in Example 2, where the intent is that the example has well-defined behavior.


// Example 2:

int f(int * restrict p, int *q) {
  int *a = &p[p-q];  // 'a' is directly based on 'p' because replacing the
                     // value of only the first instance of p in &a[p-q]
                     // with the value of a copy causes 'a' to be well-defined

  *a = 13;  // here let E denote &*a == a


  p[1] = 42;

  return *a;
}

extern int printf(const char *, ...);

int z[100];

int main(void) {
  int x = f(z+1, z);
  printf("x=%d\n", x);
}

As the comment in Example 2 states, the reason that a is now directly based on p is because replacing the value of the first instance of p in the expression &a[p-q] with the value of the copy's address causes a to be a well defined pointer into the copy.

Limits On Converting Expressions based on a Restricted Pointer

The limitations placed on pointer expressions allowed to contribute values to a pointer that is based on a restricted pointer is a compromise. There are examples where the burden placed on compilers to determine that converting a pointer value to an integer value and back to back a pointer value (ptr2int2ptr) is manageable. Compilers exist with this capability.


// Example 3

#include <stdint.h>

void f(int * restrict p) {
  *p = 13;
  uintptr_t i = (uintptr_t) p;  // cast p to an integer, i is not based on p
  int *q = (int *) i;           // cast back to a pointer
  *q == 14;                     // q is not based on p

  return *p + *q;
}

In Example 3, pointer q is not based on p because there is an initialization from an integer value that is not based on p. Thus, the behavior is undefined. Essentially, an integer value can never be based on a restricted pointer. Considering that there are compilers capable of eliminating the ptr2int2ptr conversion, and then easily determining that q is based on p, is this a reasonable limitation? It does violate the goal of not getting in the compiler's way.

For now, let's assume that the ptr2int2ptr technique is allowed to create a based on expression. The following, more complicated example, is considered:


// Example 4

#include <stdint.h>

int f(int * restrict p, uintptr_t mask) {
  uintptr_t u1 = mask & (uintptr_t)p;
  uintptr_t u2 = (~mask) & (uintptr_t)p;
  *p =42;
  *(int *)(u1 | u2) = 13; // E is (int *)(u1 | u2)
  return *p;
}

Here the expression *(int *)(u1 | u2) is again not considered to be based on p. The reason is that the formal definition states that a based on expression must be based on a pointer object. The expression *(int *)(u1 | u2) is not based on a pointer object. Therefore, the assignment of the value 13 introduces undefined behavior.

It is possible to change the formal definition to state:

... E is said to be "based" on P if E is directly based on P or is directly based on another object that holds the value of an expression that is based on P. An expression having a type capable of holding a pointer value is based on P if its value when cast to a pointer type is based on P. If an expression semantically equivalent to E is based on P, then E is based on P. 146)


The point here is that not everything that produces the same value as P is allowed. There is now a requirement that there be a semantic equivalence, as in (int *)(uintptr_t)p is equivalent to p when p has type int *restrict.

This change to the formal definition addresses the ptr2int2ptr limitation, but also expands based on to other areas. For instance, use of sprintf/strtoul to decompose and recompose a pointer value, copy a value to a buffer and later copy the value back, write a value to a file and later read it back, are all examples of techniques that could be used to create based on expressions.

Finally, a provenance style formal definition could provide the desired semantics as there are similar issues discussed in the provenance proposal. The current limitations placed on conversions such as ptr2int2ptr is an area that is in need of more debate. The above is an attempt to provide needed examples to help facilitate further discussions.