Doc. No.: WG21/N2176
J16/07-0036
Date: 2007-03-09
Reply to: Hans-J. Boehm
Phone: +1-650-857-3406
Email: Hans.Boehm@hp.com

Memory Model Rationales

Contents

This paper contains an assortment of short rationales for decisions made in the memory model paper (N2171, and N2052 before that, and in the closely related atomics proposal N2145.

The contents benefitted significantly from discussions with many other people such as Lawrence Crowl, Peter Dimov, Paul McKenney, Doug Lea, Sarita Adve, Herb Sutter, and Jeremy Manson and others. Not all of the paticipants in those discussions agree with all of my conclusions.

This document is essentially the concatenation of five shorter ones that previously appeared on the author's C++ memory model web site:

Why Undefined Semantics for C++ Data Races?

Our proposed C++ memory model (or the later version appearing as N2171 in this mailing) gives completely undefined semantics to programs with data races. Many people have questioned whether this is necessary, or whether we can provide some guarantees in even this case. Here we give the arguments for completely undefined semantics.

Some arguments are already given in the introductory "Rationale" section of committee paper N1942. This is a more detailed exploration.

The basic arguments for undefined data race semantics in C++ are:

As an example of the last phenomenon, consider a relatively simple example, which does not include any synchronization code. Assume x is a shared global and everything else is local:
unsigned i = x;

if (i < 2) {
    foo: ...
    switch (i) {
    	case 0:
		...;
		break;
    	case 1:
		...;
		break;
	default:
		...;
    }
}
Assume the code at label foo is fairly complex, and forces i to be spilled and that the switch implementation uses a branch table. (If you don't believe the latter is a reasonable assumption here, assume that "2" is replaced by a larger number, and there are more explicit cases in the switch statement.)

The compiler now performs the following reasonable (and common?) optimizations:

Now consider the case in which, unbeknownst to the compiler, there is actually a race on x, and its value changes to 5 during the execution of the code labelled by foo. The results are:
  1. When i is reloaded for the evaluation of the switch expression, it gets the value 5 instead of its original value.
  2. The branch table is accessed with an out-of-bounds index of 5, resulting in a garbage branch target.
  3. We take a wild branch, say to code that happens to implement rogue-o-matic (the canonical form of C++ "undefined behavior").
Although it appears difficult to get existing compilers to exhibit this kind of behavior for a variety of reasons, I think compiler writers will generally refuse to promise that it cannot happen. It can be the result of very standard optimization techniques. As far as I know, precluding this behavior with certainty would require significant optimizer redesign. I do not see how to avoid this kind of behavior without fundamentally Even if we were to prohibit reloading of a local from a global shared variable, we would still have to explain similar behavior for the analogous program which tests x and then uses x as the switch expression. That becomes slightly easier to explain, but it still seems very confusing.

Some such redesign is necessary anyway to avoid speculative writes. But I think that is usually limited to a few transformations which can currently introduce such writes. I suspect the redesign required here would be significantly more pervasive. And I do not see how to justify the cost.

The one benefit of providing better defined race semantics seems to be somewhat easier debuggability of production code that encounters a race. But I'm not sure that outweighs the negative impact on race detection tools.

Why atomics have integrated ordering constraints

The interfaces for atomic objects that we have been considering provide ordering constraints as part of the atomic operations themselves. This is consistent with the Java and C# volatile-based approach, and our atomic_ops package, but inconsistent with some other atomic operation implementations, such as that in the Linux kernel, which often require the use of explicit memory fences..

Some rationale for this choice is provided as part of N2145 and N2047. Here we provide a somewhat expanded and updated version of that rationale.

Note that N2153 argues that explicit fences are still needed for maximum performance on certain applications and architectures. The arguments here do not preclude providing them as well.

Here we list our reasons for explicitly associating ordering semantics with atomic operations, and correspondingly providing different variants with different ordering constraints:

Should atomic operations be ordered based on dependencies?

Consider the simple program snippet, where x is an "atomic pointer to integer" variable, r1 is a local int * variable, and r2 is a local int variable:
r1 = x.load_relaxed();
r2 = *r1;

Most processors guarantee that, for a naive translation of this code, the second load becomes visible after the first, since there is a data dependency between the two. Indeed Java almost requires such a guarantee in order to allow efficient implementation of final fields.

Such a guarantee is occasionally useful. It would guarantee that, for the above example, if another thread initialized an object, and then stored a pointer to it into a previously null x in a way that ensured the ordering of the two stores, then we could not see a non-null r1 with an uninitialized value for r2.

Note however that all such guarantees become vacuous if only load_acquire and store_release are used. In the most interesting cases, dependency-based ordering allows us to use load_relaxed instead of load_acquire. The resulting performance impact varies from none on X86, SPARC, or PA-RISC to that of a full memory fence on Alpha. PowerPC and Itanium fall in between.

Our most recent C++ atomics (N2145) and memory model (N2171) proposals provide no dependency-based ordering guarantees. Java provides no dependence-based guarantees for non-volatile (unordered) accesses, except for final field guarantees, which rely on dependence-based reasoning at most in the implementation. In contrast, N2153 does assume dependency based ordering.

The point of this note is to discuss the issues that bear on the C++ decision. Much of this relies on an earlier discussion led by Bill Pugh and Jeremy Manson in the context of the Java memory model, some of which is reproduced in the POPL 2005 paper by Manson, Pugh, and Adve. However, the Java discussion is in the context of ordinary variable accesses, and the motivation there relies heavily on not impeding compiler optimizations. Since the C++ rules apply only to "relaxed" (unordered) accesses to atomic variables, which are expected to be infrequent, these arguments are less convincing for C++.

Dependency-based ordering at best makes sense for unordered atomics

The discussion here only makes sense if at least one of the operations involved in a dependency is an unordered atomic operation. If only ordinary operations are involved, ordering is not observable without introducing a data race, which would result in undefined semantics. In fact, if A depends on B, any implied ordering is observable only if either

Static dependencies cannot enforce ordering

Consider the following variant of our original example:
r1 = x.load_relaxed();
r3 = &a + r1 - r1;
r2 = *r3;
Statically, the value of r3 appears to depend on r1. However, many compilers will optimize this to at least
r1 = x.load_relaxed();
r3 = &a;
r2 = *r3;
removing the dependency in the process. As a result, much common hardware (e.g. PowerPC, ARM, Itanium) will no longer guarantee that the first and last load appear to be executed in order. And it is unclear that even later stages of the compiler should be required to preserve the order; one of the reasons for introducing load_relaxed was to allow compiler reordering around the operation.

As we stated earlier, we expect load_relaxed to be infrequent in practice. Thus it might not be unreasonable to disallow optimizations involving it. But note that the above optimization is performed entirely on the middle statement, which does not mention load_relaxed, and in fact might occur in a function call, whose implementation is in a separate translation unit.

Thus requiring memory ordering based on static dependencies would impede optimization in other translation units, which might not even mention load_relaxed. This is clearly unacceptable.

Data vs. control dependencies

Some processors enforce ordering constraints only for certain kinds of dependencies, e.g. for data dependencies, or they have even more specific rules about which kinds of dependencies enforce ordering. Although this is reasonable at the hardware level, these notions make little sense at the programming language level, since compilers routinely convert between different types of dependencies.

For example,

r1 = x.load_relaxed();
if (r1 == 0)
  r2 = *r1;
else
  r2 = *(r1 + 1);
might be transformed to
r1 = x.load_relaxed();
if (r1 == 0)
  r3 = r1;
else
  r3 = r1 + 1;
r2 = *r3;
Again, the transformations that potentially break the dependency may happen in a separate compilation unit. Thus restricting them appears impractical.

Dynamic dependencies

Based on a suggestion by Peter Dimov, we have instead pursued a notion of dependency that
  1. Is based purely on observable dynamic behavior, and is therefore not subject to breakage as a result of compiler transformations, and
  2. Does not distinguish between control and data dependencies, for the same reason.
We'll say that a memory operation B depends on another memory operation A if either depends on the value produced by A. This has the advantages that

Different instances must be distinct

Unfortunately, this definition is still incomplete for somewhat subtle reasons. The open question is what constitutes "a memory operation". Consider
r1 = x.load_relaxed();
if (r1) {
    r2 = y.a;
} else {
    r2 = y.a;
}
Either of the loads of the ordinary variable y are clearly conditionally executed.

If we identify the two loads of y since they both perform the same action, we end up with what almost certainly an unusable programming model. This would mean that these loads are not dependent on the initial load_relaxed, and hence are not ordered with respect to it.

To see why this is unusable, consider instead the variant that involves function calls:

r1 = x.load_relaxed();
if (r1) {
    f(&y);
} else {
    g(&y);
}
Assume that the implementations of f() and g() are defined elsewhere and not visible to the author of this code. If we identified the above loads in the original example, and both f() and g() happen to perform the same loads, then presumably we would have to say that the accesses to y by f() and g() are also unordered with respect to the load_relaxed. On the other hand, if they access different fields, the accesses would be ordered.

This leaves us in a position in which the author of the client code cannot tell whether memory accesses are ordered without inspecting the implementation of called functions. Even worse, ordering may be only partially guaranteed because the two functions happen to perform one common memory access. We do not believe this is viable.

By a similar argument, we need to treat two memory operations as distinct if they correspond to the same source statement, but are reached via different call chains. For example, if the functions f() and g() both call h(y), which is implemented as

int h(int *y) {
    return *y;
}
then the load is always performed by the same statement, but nothing substantive has changed.

Why this seriously breaks optimizations

If we reconsider the example from above
r1 = x.load_relaxed();
if (r1) {
    r2 = y.a;
} else {
    r2 = y.a;
}
and treat the two loads of y.a as distinct (as we must), then they depend on, and ordered with respect to, the load_relaxed. And for a naively compiled version, most hardware would implicitly enforce that.

However, if the above were compiled to

r1 = x.load_relaxed();
r2 = y.a;
as many compilers would, then the dependence has been removed, and the two loads are no longer ordered. Hence such optimizations would have to be disallowed.

It quickly becomes clear that this is intolerable, when we consider that this code may again be distributed among several compilation units. Consider:

int h(int r1) {
    if (r1) {
        return y.a;
    } else {
        return y.a;
    }
}

int f() {
    ...
    r1 = x.load_relaxed();
    r3 = h(r1); 
    ...
}
The dependency, and hence ordering, between the load_relaxed and the y.a loads is broken if h is optimized in the obvious way to just unconditionally return y.a. But h() could easily reside in a separate compilation unit, possibly one that hasn't been recompiled since the addition of atomics.

Our conclusion is that this approach is also not viable.

What might work?

It is possible to construct similar examples in which the dependent operation is a store. Restricting the dependent operation to another atomic operation doesn't improve matters much either; we can construct examples in which the offending optimization occurs somewhere in the middle of the call chain between the two, and hence could still occur in a compilation unit that doesn't mention atomics.

It may be feasible to restrict dependency-based ordering to the important special case in which the dependent operation is a load or store to a field of an object pointed to by the result of an initial load_relaxed. But this seems to still run afoul of some, admittedly much more obscure, optimizations, which would otherwise be legal, and could be carried out in separate compilation units. Consider:

r2 = x.load_relaxed();
r3 = r2 -> a;
If we have some reason to believe, e.g. based on profiling, that the resulting value in r2 is usually the same as in r1, it may well be profitable to transform this to
r2 = x.load_relaxed();
r3 = r1 -> a;
if (r1 != r2) r3 = r2 -> a;
since it usually breaks the dependency between the two loads. But by doing so, it also breaks the hardware-based ordering guarantee.

Thus even this case doesn't appear very clear-cut.

If we tried to extend this to dependencies through array subscripts, things become more dubious still. If we have

r1 = x.load_relaxed();
r2 = a[r1 -> index % a_size];
and assume that some other thread t' initializes an object q, sets a[r1 -> index], and then performs a store_release to set x to p, are we guaranteed that r2 will contain the value assigned by t' to the array element, instead of an earlier value?

Usually the answer would be "yes". However if the compiler happens to be able to prove that a only has a single element, e.g. because the program is built for a particularly small configuration, the answer unexpectedly becomes "no".

We could easily add a templatized library function loads and dereferences an atomic pointer, guaranteeing only that the dereferenced value was completely read after the pointer dereference. But that's of somewhat limited utility and error-prone.

Why ordering constraints are never limited to loads or stores

Many modern architectures provide fence instructions that are specific to only loads or stores. For example, SPARC provides an assortment of such fence instruction variants. Alpha's "write memory barrier" is another well-known example. We have not proposed to provide similar facilities in the C++ memory model or atomic operations library.

Although there are cases in which load-only or store-only ordering are useful, we believe that the correctness constraints are exceedingly subtle, even relative to the other constructs we have been discussing. Hence we believe it makes sense to consider support for these only if there is evidence that such restricted ordering is likely to be much cheaper on a reasonable selection of processors. (This is apparently not the case on PowerPC, ARM, X86, MIPS, or Itanium. I have not seen measurements for SPARC, where it might matter. I think wmb is a bit cheaper than mb on most Alphas, but that's of limited interest.)

To see why read-only and write-only fences are tricky to use correctly, consider the canonical use case for acquire/release ordering constraints. We use an atomic flag variable x_init to hand off data (an ordinary variable x from one thread to another:

Thread 1:
<Initialize x>
x_init.store_release(true);

Thread 2:
if (x_init.load_acquire())
    <use x>
At first glance, this should be the canonical use case that requires only store ordering in thread 1. However, that is actually only safe under very limited conditions. To explain that, we consider two cases, depending on whether the uses of x are entirely read-only or not.

Recipient writes to object

This case is highly problematic.

Clearly, and unsurprisingly, it is unsafe to replace the load_acquire with a version that restricts only load ordering in this case. That would allow the store to x in thread 2 to become visible before the initialization of x by thread 1 is complete, possibly losing the update, or corrupting the state of x during initialization.

More interestingly, it is also generally unsafe to restrict the release ordering constraint in thread 1 to only stores. To see this, consider what happens if the initialization of x also reads x, as in

x.a = 0; x.a++;
x_init.store_write_release(true);
and the code that uses x in thread 2 updates it, with e.g.
if (x_init.load_acquire())
    x.a = 42;
If the release store in thread 1 were restricted to only ensure completion of prior stores (writes), the load of x.a in thread 1 (part of x.a++) could effectively be reordered with the assignment x_init, and could hence see a value of 42, causing x.a to be initialized to 43.

This admittedly appears rather strange, since the assignment of 43 to x.a would have to become visible before the load, on which it depends, is completed. However, we've argued separately that it doesn't make sense to enforce ordering based on dependencies, since they cannot be reliably defined at this level. And machine architectures do not even consistently guarantee ordering based on dependencies through memory. Thus this is an unlikely and very surprising result, but not one we could easily (or apparently even reasonably) disallow in the presence of a store-only fence. And clearly, it is surprising enough that it is important to disallow.

One might argue that there are nonetheless interesting cases in which the initial operations are known to involve only stores, and hence none of this applies. There are three problems with this argument:

  1. In most cases any actual object initialization or the like will involve calls that cross abstraction boundaries. Interface specifications however do not normally specify whether a constructor call, for example, performs a load. Hence the programmer typically has no way to know whether or not (s)he is dealing with such a case.
  2. A number of operations, such as bit-field stores, perform loads that are not apparent to all but the most sophisticated programmers, and can be affected by this.
  3. The memory model otherwise allows transformations to, say, load a common subexpression value from a struct field if it had to be spilled. Thus not all such loads will even be programmer visible. (The offending transformations might of course be applied to a compilation unit that doesn't reference atomic operations, and was compiled by a third party.)

Recipient only reads x

In this case, it appears to be safe to limit the release and acquire constraints to loads and stores respectively.

However, I don't believe the argument here requires only that the operation here be logically read-only; it must truly not write to the object. Since I believe we generally allow library classes (e.g. containers) to avoid synchronization during construction, and then lock "under the covers" for read-only operations (e.g. for caching) this appears to be difficult to guarantee.

Treatment of out-of-thin-air values

Simple formulations of programming language memory models often allow "out-of-thin-air" values to be produced. To see how this might happen, consider the following canonical example (Example 1), where variables x and y are initially zero and atomic, variables starting with "r" are local ("register") variables, and assignment denotes unordered (raw, relaxed) loads or stores:

Thread 1:
r1 = x;
y = r1;

Thread 2:
r2 = y;
x = r2;
The code executed by each thread simply copies one global to another; we've just made the intermediate temporaries explicit to emphasize that both a load and a store is involved.

The question is whether each load can "see" the value written by the other thread's store. If this is possible, then we can get an apparently consistent execution in which each store stores a value of 42, which is then read by the load in the other thread, justifying the stored value.

At some level this appears "obviously absurd". But here are two arguments that, although we probably want to disallow the behavior, it is not as absurd as might appear at first glance:

  1. If the hardware uses naively implemented "load value prediction", and does not go out of its way to avoid this scenario and, e.g. based on different past executions, predicts that x and y contain 42, the prediction might be successfully confirmed, and actually result in this outcome. (We know of no hardware that actually does this. I believe that even on Alpha which for other reasons does not enforce memory ordering based on data dependence, this cannot actually happen. This is only a vague plausibility argument.)
  2. If we change the above example slightly, the analogous behavior needs to be allowed. Consider instead Example 2:
    Thread 1:
    r1 = x;
    y = r1;
    
    Thread 2:
    r2 = y;
    x = 42;
    
    Since the assignments represent unordered operations, we would like to allow either the compiler or hardware to reorder the two statements in the second thread. Once that happens, we get the questionable behavior if the first thread executes entirely between the two statements. This corresponds to exactly the same visibility of stores as in the original example. The only difference is the "data dependency" between the two statements in the second thread.
Unfortunately, this already makes it clear that this issue is tied to dealing with dependencies in the memory model, which we show elsewhere is difficult to do. In particular, the approach originally used in N2052 doesn't seem to stand up to close scrutiny, and has thus been dropped from the N2171 revision.

(Adding the dependency based ordering to only the "precedes" relation, as suggested earlier, doesn't help, as becomes apparent below.)

To see why things are so bad, consider another variant (example 3) of the above example:

Thread 1:
r1 = x;
y = r1;

Thread 2:
r2 = y;
if (r2)
  x = 42;
else
  x = 42;
This differs from the preceding version in that each assignment to x is fairly clearly dependent on y, for any reasonable definition of what that means. (Recall that considering only data dependencies doesn't work, since a data dependent assignment could be transformed to a switch statement with many constant assignments. Somehow looking at "the next assignment to x" no matter where it occurs syntactically also doesn't work, for the reasons given in the dependency discussion, and since we could easily replace x = 42; by x = 41; x = 42;.)

The problem of course is that example 3 can be easily optimized to example 2, and most people would expect a good compiler to do so. Allowing r1 = r2 = 42 in example 2, but not example 3, seems untenable and, based on the previous dependency discussion, probably infeasible.

As a result, we now have a second example, which has basically the same structure as example 1 and the same dependency pattern. The difference is really just that in one case the dependency can be "optimized away", and in the other case it cannot.

A different proposal

This seems to leave us several different options: Here I propose a combination of the last two: I suggest doing the latter with words along the following lines:

An unordered atomic store shall only store a value that has been computed from constants and program input values by a finite sequence of program evaluations, such that each evaluation observes the values of variables as computed by the last prior assignment in the sequence. The ordering of evaluations in this sequence must be such that

[Note: This requirement disallows "out-of-thin-air", or "speculative" stores of atomics. Since unordered operations are involved, evaluations may appear in this sequence out of thread order. For example, with x and y initially zero,

Thread 1: r1 = y.load_relaxed(); x.store_relaxed(r1);

Thread 2: r2 = x.load_relaxed(); y.store_relaxed(42);
is allowed to produce r1 = r2 = 42. The sequence of evaluations justifying this starts consists of:
y.store_relaxed(42); r1 = y.load_relaxed(); x.store_relaxed(r1); r2 = x.load_relaxed();
On the other hand,
Thread 1: r1 = y.load_relaxed(); x.store_relaxed(r1);

Thread 2: r2 = x.load_relaxed(); y.store_relaxed(r2);
may not produce r1 = r2 = 42, since there is no sequence of evaluations that results in the computation of 42.]

This has the large advantage that we simplify the core memory model and move the complexity to the section on unordered atomics, where it belongs.

It has the potential disadvantage that my characterization above of what constitutes an "out-of-thin-air" results may well again be wrong. On the other hand, even if we discover that it is, and there is no adequate replacement, I think this rule is sufficiently esoteric that it may be acceptable just to state it as "no speculative unordered atomic stores", and then leave the note.

Jeremy Manson has already pointed out that the above rule fails to preclude

Thread 1:
r1 = x;
if (r1 == 42)
  y = r1;

Thread 2:
r2 = y;
if (r2 == 42)
  x = 42;
from assigning 42 to r1 and r2, in spite of the fact that the initial 42 still comes "out-of-thin-air". This is clearly unacceptable if x and y are ordinary variables, since this code is data-race free, and should thus be sequentially consistent. And we guarantee that r1 = r2 = 0 for ordinary variables.

Whether this is acceptable for relaxed atomics, or whether we should specifically state that relaxed atomics will give sequentially consistent results if the relaxed atomics do not "race" is unclear to me. My impression is that users will not care because there is no reason to use relaxed atomics unless there is a race, and implementors will not care, since no reasonable implementation of atomics will exhibit weaker properties than for ordinary variables.

N2171 incorporates the first half of the change we proposed here, i.e. the last trace of dependency-based ordering was dropped from N2052.