Doc. No.:	WG21/N2338 J16/07-0198
Date:	2007-08-04
Reply to:	Hans-J. Boehm
Phone:	+1-650-857-3406
Email:	Hans.Boehm@hp.com

Concurrency memory model compiler consequences

This paper describes the compiler consequences of adopting the N2334 memory model proposal (a revision of N2300).

The concurrency memory model imposes constraints on the order in which memory operations on shared variables may become visible to other threads. Operations on locations known to be thread-local (e.g. operations on function scope variables whose address is not taken) are unaffected, and may be optimized as in the single-threaded case.

Thus the effect of the proposal is limited to three kinds of memory operations:

Normal memory operations that operate on potentially shared locations.
Operations on synchronization objects such as locks.
Atomic operations, as provided in the atomics library.

Clearly the first is the most interesting, since those operations are likely to be the most frequent. The frequency of the last in existing code is zero, and in purely lock-based code will remain there. Performance of the last is probably determined much more by the runtime implementation than compiler transformations. Nonetheless, we discuss them all in turn, at least briefly.

Ordinary operations on potentially shared memory

The proposed standard gives undefined semantics to code with any data races. A data race occurs whenever two operations in separate threads can access the same location at the same time in a sequentially consistent execution, and one of them is a store operation. (This is almost, but not quite equivalent to the N2334 definition. See N2335. There is another obscure way to introduce an N2334 race, but it is not relevant to this discussion.)

Thus source-to-source transformations on C++ programs may not introduce data races. Doing so would introduce undefined semantics.

Indeed, a weak converse of this statement is generally also true. A transformation that preserves sequential correctness, and preserves the state of memory at synchronization operations, and does not introduce data races, preserves correctness. If it failed to preserve correctness a thread would have to observe the changed sequence of state changes between synchronization operations. It cannot do so without performing a concurrent memory operation on a location altered between those synchronization points. This involves a data race, either in the original program, or introduced by the compiler. We excluded the latter case, and the former preserves correctness.

Thus the crucial question is to identify transformations that may introduce data races. We examine what we expect to be some of the most common examples.

Adjacent field overwrites

Consider an update to the field b in a potentially shared struct declared as

struct { char a; int b:9; int c:7; char d; }

Many existing compilers on conventional 32-bit machines will implement the update by loading the entire 1 word structure, replacing the bits corresponding to b, and then rewriting the entire structure. By overwriting other fields, including for example a, in the process, this will introduce a race with another thread concurrently updating field a. According to the proposed standard, field a is in a separate memory location; thus this introduces a race not present in the original source code.

Indeed, it can unexpectedly overwrite such a concurrent update to a, thus resulting in incorrect results.

Generally updates to one memory location, as defined in the standard, may not result in implicit updates to others, even if the original value is written back.

Speculative code motion involving stores

It is generally not allowable to introduce stores to potentially shared memory locations that might not otherwise have been written between the same synchronization points. Such an introduced store might race with a concurrent store to the same location in another thread, and potentially overwrite the value written by the original store.

(This constraint can be weakened a bit, since acquire and release operations generally allow one-way motion of ordinary memory operations across them. Thus an update to x immediately after a lock release allows bogus write to x to be inserted just before the lock release. A race with the latter would imply a race with the former. But current compilers rarely perform this sort of transformation, and we expect it to rarely be substantially profitable.)

Thus if we have the code:

switch (y) {
    case 0: x = 17; w = 1; break;
    case 1: x = 17; w = 3; break;
    case 2: w = 9; break;
    case 3: x = 17; w = 1; break;
    case 4: x = 17; w = 3; break;
    case 5: x = 17; w = 9; break;
    default: x = 17; w = 42; break;
}

and x is potentially shared, it is not acceptable to reduce code size by transforming this to

tmp = x; x = 17;
switch (y) {
    case 0: w = 1; break;
    case 1: w = 3; break;
    case 2: x = tmp; w = 9; break;
    case 3: w = 1; break;
    case 4: w = 3; break;
    case 5: w = 9; break;
    default: w = 42; break;
}

Doing so would introduce a race if y were 2, and another thread were concurrently updating x. Again the concurrent update might be lost as a result of the spurious store.

Speculative register promotion

A very similar problem occurs more commonly if a variable is promoted to a register, and unconditionally loaded into that register and later written back, even though the original variable may have not been written along all control paths.

A typical example is

for (p = q; p = p -> next; ++p) {
    if (p -> data > 0) ++count;
}

where the variable count is potentially shared. Many compilers currently translate this to something along the lines of

register int r1 = count;
for (p = q; p = p -> next; ++p) {
    if (p -> data > 0) ++r1;
}
count = r1;

This is unsafe if the list q is known to the programmer to contain no positive data, and there is a concurrent thread updating count. In this rather esoteric case, the transformation introduces a race. Since it potentially hides an update that did not race in the original source, it is disallowed by the proposal.

(This example is designed to explain the compiler consequences of the proposed rules, not motivate them. For a more motivational example, see Boehm, Threads Cannot be Implemented as a Library, PLDI 2005. Particularly section 4.3.)

There are usually alternative ways to optimize such code. The following transformed version is safe, and approximately equally efficient:

register int r1 = 0;
for (p = q; p = p -> next; ++p) {
    if (p -> data > 0) ++r1;
}
if (r1) count += r1;

This relies on a fairly special purpose transformation. By the arguments of the next section, the following more general variant is usually also safe in an optimizing compiler context, in spite of the fact that it may introduce a race:

register int r1 = count;
register bool count_modified = false;
for (p = q; p = p -> next; ++p) {
    if (p -> data > 0) { ++r1; count_modified = true; }
}
if (count_modified) count = r1;

This potentially involves a race, since it introduces a load of count that was not requested by the original source.

Speculative code motion involving loads

Races can be introduced by introducing either a speculative store or load. A source-to-source C++ to C++ transformation is invalid if it does so in either form, since the proposed standard gives the result undefined semantics.

However, there is a significant difference between the two: A newly introduced store always has the potential to hide the value of a concurrent, now racing, store. Thus such an introduction not only produces a C++ program with formally undefined semantics; the resulting program is also virtually guaranteed to compile to machine code that is an incorrect translation of the original code.

On the other hand, a compiler that introduces a racing load at the machine code level, and then discards the result of the load, generally will not have changed the semantics of the program.

This is a subtle distinction: C++-source-to-C++-source transformations that introduce a racing load are invalid. On the other hand compiler transformations that do so on the way to generating object code for conventional machines may be completely correct, since races in the object code typically do not have undefined semantics. At this level, the semantics are defined by the architecture specification, not the C++ standard.

This may seem like a spurious distinction, but it is not. The author of C++ source code provides an implicit promise to the compiler: There are no data races on ordinary variables. This allows the compiler to safely perform some common optimizations that would otherwise be invalid. (See the relevant section of N2176 for details.) If the compiler itself introduces potentially racing loads in intermediate code, it can, and has to, remember that these loads may race, and have to be treated as such by subsequent optimizations.

This distinction also allows the relatively easy use of race detection tools at various levels. If we had hypothetical hardware (or a virtual machine) that aborted the program when a race was detected, compilers would not be allowed to introduce potentially racing loads. In this environment, adding such a race, even if the result of the load were unused would change the semantics of the program. Although such environments are not commonplace, they have certainly been proposed, and may well turn out to be very useful.

This again argues that whether or not it is safe to introduce racing loads should depend on the target architecture and its semantics. If the target is C++ source code, it is disallowed. If it is a standard, for example X86, processor, it usually is allowed. If the target provides race detection for debugging purposes, it is again disallowed.

Thus if our source program reads

if (x) z = w + 1;

and our target is a conventional hardware platform, it is acceptable to compile this as

r1 = w; if (x) z = r1 + 1;

to better overlap load latencies for x and w. This may introduce a new race if x is false. But the result of the racing load is not used in that case. And the semantics of the resulting object program on a standard architecture is unaffected.

The final transformation of the previous section has essentially the same characteristics.

Synchronization operations

Conventional implementations of synchronization operations like locks rely on a combination of the following to enforce memory ordering:

On weakly ordered machines, library routines implementing the synchronization operations include the necessary fences.
Synchronization operations are mostly treated as opaque by compilers. Thus memory operations cannot be moved across them, since the compiler cannot determine whether the synchronization routine might read or write such a memory location.

This approach will continue to work, without modification.

The one significant change over some past standards is that the memory model makes it explicit that lock acquisition only has "acquire" semantics, and movement of memory operations past a lock acquisition into the critical section is allowed. This allows the compiler some flexibility that is technically disallowed under the current Posix thread specification. More importantly, it often means that lock acquisition requires one, not two, memory fence instructions. This may speed it up significantly. (This is what motivated the slightly more complicated N2334 definition of a data race.)

However, we believe that this optimization is already performed in many cases, in spite of its dubious standing with respect to existing standards. Thus even this does not appear to be imply a real change to existing practice.

Atomic operations

Note that this section takes some liberties with atomics syntax, relative to the current atomics proposal N2381 or its predecessor N2324. Hopefully the intent is clear, and this can be updated as we pin down the real syntax.

We use variable v to represent an atomic in the examples. Other variables are not atomic.

We expect that atomic operations will initially be treated as opaque by the compiler, just like the other synchronization operations from the last section, and thus prevent movement of other memory operations across atomic operations. The only substantial difference from locking primitives is that atomic operations would usually be expanded in-line.

This is safe. And for most applications, we should expect that only small amounts of performance will be left on the table as a result of this simplification. In most cases, the presence or absence of an additional memory fence is likely to have far more impact than compiler optimizations involving reordering of atomics.

However, some more aggressive compiler optimizations on atomics are allowed by the model:

As mentioned above, acquire and release ordered atomics, like locks, really prevent reordering in only one direction. Thus
```
x = 1; r1 = v.load_acquire();
```
can be safely transformed to
```
r1 = v.load_acquire(); x = 1;
```
and
```
v.store_release(1); x = 1;
```
can safely be transformed to
```
x = 1; v.store_release(1);
```
Similarly, an atomic release store followed by an atomic acqire load may be reordered.
Relaxed atomic operations can be reordered in both directions, with respect to either ordinary memory operations or other relaxed atomic operations. But the requirement that updates must be observed in modification order disallows this if the two operations may apply to the same atomic object. (The same restriction applies to the one-way reordering of acquire/release atomic operations.)
The current formulation of the memory model does not prevent combining of consecutive atomic reads from the same location. But this needs to be constrained by quality of implementation considerations, and is probably rarely desirable. We may want to consider imposing further constraints in this regard, though those may be difficult to state.
Currently there is no explicit prohibition against translating
```
while (v.load_acquire()) {}
```
to
```
r1 = v.load_acquire(); while (r1) {}
```
though I think most of us would agree that this should be avoided in a high quality implementation. However, such transformations may also be difficult to disallow, without disallowing implementations that simply never schedule this thread. And I think it would be a mistake to try to dictate fairness at this level, since that would disallow some special-purpose cooperative threading implementations.
The current memory model also does not prohibit combining adjacent stores to the same location, i.e. simply replacing them with the final store. This is generally indistinguishable from scheduling threads so that the intermediate stores are simply not seen.
Once again, combining an unbounded number of consecutive stores in this way is strongly discouraged, but currently not formally disallowed.
In this way atomics are very different from volatiles. Volatiles guarantee that the right number of memory operations are performed, but do not guarantee atomicity. Atomics do guarantee atomicity, but allow consecutive operations to be combined.

Note that the reordering constraints imposed by atomics also have implications on common subexpression elimination or partial redundancy elimination. For example, it is not acceptable to transform

r1 = x + 1; r2 = v.load_acquire(); r3 = x + 1;

r1 = x + 1; r2 = v.load_acquire(); r3 = r1;

This would effectively move the second load of x to before the load_acquire(), i.e. in the wrong direction. This optimization would clearly not be performed by the more naive implementation that treats load_acquire() as an opaque function call.

On most fence-based architectures, we expect v.store_ordered(x) to be implemented as something like

fence_preventing_store_load_and_store_store_reordering;
v = x;
full_fence;

The full fence is typically needed to prevent reordering with a later atomic load. (Really we only need a fence that prevents store-load reordering. But typically this requires the most expensive fence.) It can be removed if the compiler can see that the next atomic load is preceded by a fence for another reason. Depending on the architecture, this may happen, for example, in the double-checked locking case.

On X86 processors, we expect that the standard implementation of an ordered store will consist of a single xchg instruction, whose result is ignored. This is expected to provide both the necessary fencing and total ordering guarantees.

The store_release operation can usually be implemented without the trailing expensive fence. We expect that on X86 hardware it will be implemented as a plain store instruction, which implicitly provides the required ordering guarantees on that architecture.

On most architectures, both a sequentially consistent load and a load_acquire operation will be implemented with a load instruction, followed by a fence that provides both load-store and load-load ordering. In a few cases, a more expensive fence may be required. On X86, a simple load instruction is expected to suffice.

The proposed C++ sequentially consistent atomics are essentially equivalent to Java volatiles. Thus Doug Lea's JSR133 Cookbook also provides some useful implementation advice.