Doc. No.:	WG21/N2138 J16/06-0208
Date:	2006-11-03
Reply to:	Hans-J. Boehm
Phone:	+1-650-857-3406
Email:	Hans.Boehm@hp.com

A Less Formal Explanation of the Proposed C++ Concurrency Memory Model

Overview
Visibility of assignments
The impact of unordered atomic operations
Inter-thread data races
Simpler rules for programmers
Acknowledgements

This is an attempt to informally explain the C++ memory model, as it was proposed in committee paper WG21/N2052=J16/06-0122.

Overview

The meaning of a multithreaded program is effectively given there in three stages:

We clarify the order in which expressions in a single-threaded program must be evaluated. This part is not intended to substantively change the existing standard at all. It's purpose is only to give us a better defined foundation on which we can build the concurrent semantics. As part of this clarification, we now state that A is sequenced before B when we would previously have stated that there is a sequence point between A and B.
We explain when a particular evaluation (effectively load of an object) can yield or "see" the value stored by a particular assignment to the object. This effectively defines what it means to run multiple threads concurrently, if we already understand the behavior of the individual threads. However, as discussed below, it also gives meaning to programs we would rather disallow.
We define when, based on the definition from the preceding step, a program contains a data race (on a particular input). We then explicitly give such a program (on that input) undefined semantics.

If we omitted the last step, we would, for example, guarantee that if we have

long long x = 0;

then the following program:

Thread1	Thread2
`x = -1;`	`r1 = x;`

could never result in the local variable r1 being assigned a variable other than 0 or -1. In fact, it is likely that on a 32-bit machine, the assignment of -1 would require two separate store instructions, and thread 2 might see the intermediate value. And it is often expensive to prevent such outcomes.

The preceding example is only one of many cases in which attempting to fully define the semantics of programs with data races would severely constrain the implementation. By prohibiting conflicting concurrent accesses, we remain consistent with pthread practice. We allow nearly all conventional compiler transformations on synchronization-free code, since we disallow any program that could detect invalid intermediate states introduced in the process.

Disallowing data races also has some more subtle consequences. In particular, when constructing an object with a virtual function table, we do not need to generate code that guards against another thread "seeing" the object with an uninitialized pointer to the table. Doing so would often require inserting expensive memory fence instructions.

Here we do not focus on the treatment of sequential code, since that is assumed to be understood. We instead try to clarify how N2052. assigns meaning to concurrent programs by defining which values can be "seen" by a particular evaluation or object access, and the definition and role of data races.

We will assume that each thread performs a sequence of evaluations in a known order, described by a sequenced before relation. If a and b are performed by the same thread, and a "comes first", we say that a is sequenced before b. (In reality, C++ allows a number of different evaluation orders for each thread, notably as a result of varying argument evaluation order, and this choice may vary each time an expression is evaluated. Here we assume that each thread has already chosen its argument evaluation orders in some way, and we simply define which multithreaded executions are consistent with this choice.)

For the sake of simplicity, it may help to view sequenced before as a relation that orders all evaluations within a thread. (In fact, it does not, even taking the previous parenthetical comment into account. The language specification allows certain unsequenced operations within a thread to themselves be interleaved, as opposed to ordered in one of several ways (indeterminately ordered). But that is not crucial to understanding the rest of this paper.)

Visibility of assignments

Consider a simple sequence of assignments to scalar variables that is sequentially executed, i.e. for every pair of distinct assignments, each one is sequenced before the other. An evaluation b that references x "sees" a previous assignment a to x if a is the last prior assignment to x, i.e. if

a is not sequenced after b, and if
There is no intervening assignment m to x such that a is sequenced before m and m is sequenced before b.

We phrased this second, more precise, formulation to also make sense if not all of the assignments are ordered, though in that case b might be able to see the effect of more than one possible a. If we very informally consider the (pseudo-code, not C++) program

x = 1;
x = 2;
in parallel do
	Thread 1: x = 3;
	Thread 2: r1 = x;

The reference to x in thread 2 may "see" either a value of 2 or 3, since in each case the corresponding assignment is not required to be executed after the assignment to r1, and there is no other intervening assignment to x. Thread 2 may not see the value of 1, since the assignment x = 2; intervenes.

Thus our goal is essentially to define the multi-threaded version of the sequenced before relation. We do this by defining an additional relation called happens-before. For reasons of technical convenience, we define happens-before so that it only fully reflects ordering constraints that involve inter-thread communication. An evaluation a will thus have to become visible before an evaluation b if either a happens-before b, or a is sequenced-before b.

A thread T1 normally communicates with a thread T2 by assigning to some shared variable x and then synchronizing with T2. Most commonly,this synchronization would involve T1 acquiring a lock while it updates x, and then T2 acquiring the same lock while it reads x. Certainly any assignment performed prior to releasing a lock should be visible to another thread when acquiring the lock.

We describe this in several stages:

Any evaluation such as the assignment to x performed by a thread before it releases the lock, i.e. any operation sequenced before the lock release, is inter-thread-ordered-before the lock release. It is best to think of the inter-thread-ordered-before relation as that part of the sequenced before relation that is visible to other threads.
The lock release operation synchronizes with the next acquisition of the same lock. The synchronizes with relation expresses the actual ordering constraints imposed by synchronization operations.
The lock acquire operation is again inter-thread-ordered-before evaluations that are sequenced after it, such as the one that reads x.

In general, an evaluation a happens before an evaluation b if they are ordered by a chain of synchronizes with and inter-thread-ordered-before relationships. More formally happens before is the transitive closure of the union of the synchronizes with and inter-thread-ordered-before relationships.

So far our discussion has been in terms of threads that communicate via lock-protected accesses to shared variables. This should indeed be the common case. But it is not the only case we wish to support.

Atomic variables are another, less common, way to communicate between threads. Experience has shown that such variables are most useful if they have at least the same kind of acquire-release semantics as locks. However, there are occasionally situations in which such ordering is not desired. (Additional ordering properties are also commonly desired. We expect that these will be specified in the atomics library, and are not addressed here.)

If atomic variables have acquire/release properties, then we can ensure that the following code does not result in an assertion failure. (This is of course still not proper C++ syntax, which is still TBD.)

int x = 0;
atomic_int y = 0;
in parallel do
	Thread 1: x = 17; y = 1;
	Thread 2: while (!y); assert(x == 17);

In this case, the assignment to y has release semantics, while the reference to y in the while condition has acquire semantics. The pair behaves essentially like a lock release and acquisition with respect to the memory model. As a result, the assignment x = 17 is inter-thread ordered before the release operation y = 1. The release operation synchronizes with the last evaluation of the while condition, which is an acquire operation and loads the value stored by y = 1. The evaluation of y in the condition is again inter-thread ordered before the evaluation of x in the assertion. Thus the assignment x = 17 happens-before the evaluation of x in the assertion, and hence the assertion cannot fail, since the initialization of x to zero is no longer visible.

Once we have defined our happens-before ordering in this way, we largely define visibility as in the sequential case:

Each read must see a write that does not happen after it, and
there must be no intervening second write between the write and the read.

The first condition is actually expressed in the proposed 1.10p9 and 1.10p10 in a slightly roundabout way. As we will see below this helps to avoid some other undesirable behaviors. We define a precedes relation based on happens-before and visibility of writes. An evaluation a precedes an evaluation b if there is a chain of evaluations from a to b such that each element in the chain either happens-before its successor, or stores a values that is seen by the succeeding reference to the same location. We then require that no evaluation precedes itself, i.e that such chains never contain any cycles.

If a happens-before b but a sees a value assigned by b, we immediately get a precedes b and b precedes a, and hence, by transitivity, a precedes itself. Below we show that it also precludes more complicated "causal cycles".

(The current version of N2052. misstates this, in that it inadvertently neglects to state that we allow a chain of such relationships. This is an error that will be fixed. We really need to consider the transitive closure of the precedes relation defined in N2052.)

The condition prohibiting intervening assignments is stated directly as part of 1.10p10.

The impact of unordered atomic operations

Our memory model proposal differs from the Java one (see Manson, Pugh, Adve The Java Memory Model or the authors' expanded version) in that we do not just include sequenced-before in the happens-before relation. This is motivated by our desire to support "raw" atomic operations that do not ensure acquire-release ordering.

To understand the impact of such raw atomic operations it is best to look at a sequence of examples. We will write =raw for assignments that involve either an unordered load or store. We will assume that all variables have atomic integer type and initially have zero values.

First observe that in the absence of acquire and release operations, we effectively do not impose any ordering on operations performed within a thread. Consider the (perhaps overused) example:

Thread1	Thread2
`x =raw 1; r1 =raw y;`	`y =raw 1; r2 =raw x;`

If we view threads as being executed by interleaving the evaluation of statements (or instructions) from each thread, then we should never get r1 = r2 = 0. In reality, this is possible because both

The statements in each thread appear independent, and thus can be reordered by a compiler, as we mentioned above.
The underlying hardware may use a write buffer to temporarily store the assignments to x and y, allowing the following load instruction to proceed before the corresponding store is visible to the other thread.

This kind of hardware-based reordering can in fact occur on most multiprocessors, including recent X86 multiprocessors. Preventing it typically involves the insertion of fence instructions, which may cost dozens, or even hundreds, of cycles.

This is already reflected in our definition of happens-before. There are no happens-before relationships between any of the operations performed by this program. (Formally, only the static initializations of x and y happen-before any of the expression evaluations in the program.) Thus neither of the loads of x and y are restricted from seeing either the corresponding stores or static initializations, and r1 = r2 = 0 does not contradict any of our rules.

(Note that in this case there is actually no difference between raw operations and acquire-release operations. The latter might introduce synchronizes with relationships, but since the stores precedes the loads, there are no interesting inter-thread ordered before relationships. This will be different in the following examples.)

(So far the rules are also the same for ordinary integer variables as for atomics with unordered ("raw") assignments. The crucial difference is that if we used ordinary variables, all the examples in this section would contain data races, and thus exhibit undefined behavior.)

Next we consider two slightly different examples, the first one of which we already handle correctly. Consider first:

Thread1	Thread2
`r1 =raw x; y =raw 1;`	`r2 =raw y; x =raw 1;`

Again there are no happens-before relationships, except for those involving static initialization. Hence both r1 and r2 may obtain either a zero or one value. Although r1 = r2 = 1 is less likely in practice than r1 = r2 = 0 in the preceding example, and fewer hardware architectures will allow reordering in this case, it has to be allowed for essentially the same reasons.

But now contrast this with

Thread1	Thread2
`r1 =raw x; y =raw r1;`	`r2 =raw y; x =raw r2;`

By the rules we have specified so far, it is also permissible for the loads appearing first in each thread to see either the initial zero value, or the value stored by the other thread. This means that r1 = r2 = 1 continues to be possible if both of the initial loads see a value of 1, causing the stores to store 1 into x and y. The happens-before rules are not violated and each thread in isolation executes according to the language rules.

Unfortunately, the same reasoning can be used to justify r1 = r2 = 42 in the preceding example. The reasoning to justify this is blatantly circular, but nothing prevents this. So far, there are no cyclic precedes relationships, since there is no inter-thread ordered before relationship between the two statements in each thread.

In order to avoid this kind of circularity, and the resulting potential for "out of thin air" results, we added a second clause to the definition of inter-thread ordered before in 1.10p5 to order stores that depend on prior loads. Thus in the actual N2052 model, this example, but not the preceding one, results in a cyclic precedes relation, and r1 = r2 = 1 (or r1 = r2 = 42) is disallowed.

Inter-thread data races

The above definitions tell us which assignments to scalar objects can be seen by particular evaluations. Hence they tell us how threads interact and, together with the single threaded semantics already in the standard, give us a basic multithreaded semantics. This semantics is used in the following two ways:

It helps us to define when the execution of a program encounters a data race (or inter-thread data race).
It defines the semantics of race-free programs.

For reasons illustrated previously, it does not define the semantics of programs with races.

There are several possible definitions of a data race. Probably the most intuitive definition is that it occurs when two ordinary accesses to a scalar accesses, at least one of which is a write, are performed simultaneously by different threads. Our definition is actually quite close to this, but varies in two ways:

Instead of restricting simultaneous execution, we ask that conflicting accesses by different threads be ordered by happens-before. This is equivalent in most cases, but fits better with our other definitions. As shown in Boehm. Reordering Constraints for Pthread-Style Locks it also potentially allows a cheap implementation of locks at the cost of disallowing some really obscure and undesirable coding practices.
We have to accommodate the fact that updates to bit-fields are normally implemented by loading and storing back groups of adjacent bit-fields. Thus a store to a bit-field may conflict with a concurrent store to an adjacent bit-field by another thread. If they overlap, one of the updates may be lost. This is reflected in the definition by defining data races in terms of abstract memory locations which include entire sequences of adjacent bit-fields, instead of just scalar objects.

Simpler rules for programmers

Based on this definition, it becomes a theorem that programs using simple locks to protect shared variables from simultaneous access (other than simultaneous read access) behave as though they were executed by simply interleaving the actions of each thread. Depending on the final semantics for atomics, it may be possible to generalize this to some uses of atomics. We expect that this is the rule that will be taught to most programmers.

Acknowledgements

As mentioned earlier, this description relies heavily on the prior work on the Java memory model. Beyond the Java work, and alter discussion with its authors, it has benefitted substantially from the various C++ threading discussions, particularly also those with Clark Nelson who wrote most of the words in N2052 and helped to clean up the memory model description, and those with Herb Sutter and Doug Lea.