Doc. no.	N1942=06-0012
Date:	2006-02-26
Reply to:	Hans Boehm <Hans.Boehm@hp.com>

A Memory Model for C++: Strawman Proposal

Rationale for the Overall Approach
Data Races and our Approach to Memory Consistency
Memory Model Examples
Consequences
Constructors
Function local statics
Volatile variables and data members
Thread-local variables and stack locations
Library Changes
Exceptions, signals, cancellation ...

This is an attempt to outline a "memory model" for C++. It addresses the multi-threaded semantics of C and C++, particularly with respect to memory visibility. We concentrate on the question of what values a load of an object (i.e. an l-value to r-value conversion) may observe.

Much of this proposal is still rather informal. It has benefitted from input from many people, including Andrei Alexandrescu, Kevlin Henney, Ben Hutchings, Doug Lea, Jeremy Manson, Bill Pugh, Alexander Terekhov, Nick Maclaren, and others. It is mostly an attempt by Hans Boehm to turn the discussion results into a semi-coherent document. It builds on the earlier papers we have submitted to the committee: N1680=04-0120, N1777=05-0037, and N1876=05-0136, as well as the work cited there.

We use a style in which the core document is presented in the default font and color. We occasionally insert additional more detailed discussion, motivation, and status as bracketed text in a smaller green font. This additional text is not fundamental to understanding the proposal.

A more dynamic, evolving, and probably less consistent, version of this document, along with further background material, can be found here.

Rationale for the overall approach

One possible approach to a memory model would be to essentially copy the Java memory model.

The reasons we are not currently pursuing that route, in spite of the overlap among the participants are:

The Java approach is required to ensure type-safety, and more particularly to ensure that no load of a shared pointer can ever result in a pointer to an object that was constructed with a different type. C++ does not have this property to start with, and hence we are free to continue to violate it.
This would be different from the approach taken by pthreads (and probably implicitly by other thread libraries) which give undefined semantics to any code involving a data race.
The Java approach requires ordinary reads to be atomic for certain data types, including pointers. We expect this would be problematic on some embedded hardware with narrow memory buses.
The Java approach requires a complex model to deal with causality. Our current proposal avoids this for ordinary memory operations. For atomic operations, which we expect to be much less frequently used, we use an alternate formulation, which is probably harder to formalize completely, but appears more easily comprehensible.
The current pthreads approach allows some compiler transformations that are not legal in Java, but probably moderately common in the context of C++. These produce unexpected results in the presence of data races. But given the pthreads approach of outlawing data races, they appear benign, and can improve performance. In particular, assuming we have no intervening synchronization actions, compilers currently may:
1. "Rematerialize" the contents of a spilled register by reloading its value from a global, if single-threaded analysis shows that the global may not have been modified in the interim. This is only detectable by programs that involve a data race. (This means that r1 == r1 may return false if r1 is a local integer loaded via a data race. That's OK since the program has undefined behavior.)
2. Introduce redundant writes to shared globals. For example, if we are optimizing a large synchronization-free switch statement for space, all branches but one contain the assignment x = 1;, and the last one contains x = 2; we can, under suitable additional assumptions, put x = 1; before the switch statement, and reassign x in the one exceptional case.
3. Insert code to do the equivalent of if (x != x) <start rogue-o-matic> at any point for any global x that is already read without intervening synchronization. That's perhaps not very interesting. But it does point out that analysis in current compilers is based on the assumption that there are no concurrent modifications to shared variables. We really have no idea what the generated code will do if that assumption is violated. And the behavior cannot necessarily be understood by modeling reads of globals to return some sort of nondeterministic value.
None of these transformations are legal in Java. Variants of all of them appear potentially profitable and/or useful as part of a debugging facility.
The current pthreads approach does not require memory fences (sometimes call barriers) between the time an object is constructed, and the time it is "published" by storing a pointer to it in a location that is accessible to a different thread. Without such a fence, another thread may see an apparently valid pointer to an object with an invalid vtable pointer. Hence virtual function calls may fail if such a virtual function of such an object is invoked from another thread. This is hard to describe without allowing completely undefined behavior for races. And we conjecture that it is hard to remedy without significant slowdown of constructors. (Note that in Java, a constructor also requires object allocation, and hence more overhead is expected.)

Hence we are currently pursuing an approach which has been less well explored, and is thus probably riskier. We concentrate on precisely defining the notion of a data race (a deficiency of the pthread definition), but then largely stick with the pthreads approach of leaving the semantics of data races undefined.

This is complicated somewhat by the desire to support "atomic operations" which are effectively synchronization operations that can legitimately participate in data races.

[ This is at least our third attempt at a description of the memory model. The first attempt was to define the semantics as sequential consistency for programs that had no data races, where a data race was defined as consecutive execution of conflicting operations in a sequentially consistent execution. That was nice and simple. But it makes it very difficult to define synchronization primitives that allow some reordering around them. These include even a pthread-like locking primitive that has only "acquire" semantics, i.e. allows preceding memory operations to be moved into the critical section. Surprisingly, at least for one of us, pthread_mutex_lock does not have that property, due to the presence of pthread_mutex_trylock. ]

Data Races and our Approach to Memory Consistency

We approach the problem of defining a memory model or multi-threaded semantics for C++ as follows:

We define a memory location to be a scalar variable, data member, or array element, or a maximal sequence of contiguous bit-fields in a struct or class. (This almost corresponds to the meaning of "object" in the current standard.)
We define an action performed by a thread to be a particular program point in the source program, together with the value it produced and/or stored. Each action corresponds to either a call, or a primitive operation, or an access of a memory location. (We view built-in assignments of struct or class objects as sequences of assignments to their fields.) We are primarily interested in the latter, which we classify as either load or store actions. Operations provided by the atomic operations library result in atomic or synchronization actions, as do locking primitives. Each such action may be an atomic load, atomic store, both or neither.
[ There is an argument for introducing alternate syntax for atomic operations, e.g. by describing variables as __async volatile. (It was concluded that recycling the plain volatile keyword introduces a conflict with existing usage, a relatively small, but nonzero, fraction of which is demonstrably legitimate. Whether or not that constitutes sufficient reason to leave the current largely platform-dependent semantics of volatile in place remains controversial.) Introducing some such syntax may make it easier to code idioms like double-checked locking, and to use consistent idioms across programming languages. Without a better understanding of the form the atomic operations library will take, it is unclear whether this argument is valid. The argument against it is simplicity, and elimination of an ugly or reused keyword. ]
Below we define a notion of a consistent execution. An execution will consist of a set of actions, corresponding to the steps performed by each of the threads, together with some relations that describe the order in which each thread performs those actions, and the ways in which the threads interact. For an execution to be consistent, both the actions and the associated relations have to satisfy a number of constraints, described below.
In the absence of atomic operations, which may concurrently operate on the same data for different threads, this notion is equivalent to Lamport's definition of sequential consistency.
We define when a consistent execution has a data race on a particular input. Note that the term is used to denote concurrent conflicting accesses to a location using ordinary assignments. Concurrent accesses through special "atomic operations" do not constitute a data race.
Programs have undefined semantics on executions on which a data race is possible. We effectively view a data race as an error.
If a program cannot encounter a data race on a given input, then it is guaranteed to behave in accordance with one of its consistent executions.

Note that although our semantics is essentially defined in terms of sequential consistency, it is much weaker than sequential consistency. We essentially forbid all programs which depend implicitly on memory ordering. Hence we continue to allow substantial amounts of memory reordering by either the compiler or hardware.

We now define two kinds of order on memory operations (loads, stores, atomic updates) performed by a single thread. The first one is intended to replace the notion of sequence point that is currently used in parts of the standard:

The is-sequenced-before relation

If a memory update or side-effect a is-sequenced-before another memory operation or side-effect b, then informally a must appear to be completely evaluated before b in the sequential execution of a single thread, e.g. all accesses and side effects of a must occur before those of b. This notion does not directly imply anything about the order in which memory updates become visible to other threads.

We will say that a subexpression A of the source program is-sequenced-before another subexpression B of the same source program to indicate that all side-effects and memory operations performed by an execution of A occur-before those performed by the corresponding execution of B, i.e. as part of the same execution of the smallest expression that includes them both.

We propose roughly that wherever the current standard states that there is a sequence point between A and B, we instead state that A is-sequenced-before B. This will constitute the precise definition of is-sequenced-before on subexpressions, and hence on memory actions and side effects.

A very similar change in the specification of intra-thread sequencing of operations is being simultaneously proposed by Clark Nelson in N1944=06-0014, which explores the issue in more detail. Our hope is that a proposal along these lines will be accepted, and can serve as the precise definition of is-sequenced-before.

[ Based on recent email discussions, at this point there appears to be some uncertainty in the interpretation of the current standard, which is complicating matters. The main issue seems to be the precise meaning of the restriction on interleaving of function calls in arguments. It appears important to resolve this even in the single-threaded case. ]

The is-inter-thread-ordered-before relation

An action A is-inter-thread-ordered-before an action B if they are both executed by the same thread, and one of them is an atomic or synchronization operation that guarantees appropriate inter-thread visibility ordering. We specify these ordering constraints with the atomic operations library.

An ordinary memory operation is never inter-thread-ordered-before another ordinary memory operation.

Most atomic operations will specify some combination of acquire and release ordering constraints, which enforce ordering with respect to subsequent and prior memory actions, respectively. These constraints are reflected in the is-inter-thread-ordered-before relation.

Lock acquisition imposes at least an acquire constraint, and lock release will normally impose a release constraint. Whenever an action A has an acquire constraint, and A is-sequenced-before B, then A is-inter-thread-ordered-before B. Whenever A has a release constraint, and B is-sequenced-before A, then B is-inter-thread-ordered-before A.

The depends-on relation

Consider a given execution of a particular thread, i.e. the sequence of actions that may be performed by a particular thread as part of the execution of the containing program. If, as a result of changing only the value read by an atomic load L, a subsequent atomic store S either can no longer occur, or must store a different value, then S depends on L.

Note that the definition of depends-on is relative to a particular execution, and always involves a dependence of an atomic store on an atomic load. Ordinary memory operation do not depend-on each other. We need the depends-on relation only to outlaw certain anomalous executions of atomic operations that informally violate causality, i.e. in which an atomic operation causes itself to be executed in a particular way.

We next discuss relations on actions between threads.

The communicates-with relation

We specify interactions between threads in an execution using a communicates-with relation. Informally, an action A communicates-with another action B if B "sees" the result of A. The definition of each kind of atomic operation will specify the other operations with which it can communicate. A store to an ordinary variable communicates-with a load that retrieves the stored value.

Informally, a lock release communicates-with the next acquisition of the same lock. A barrier (in the pthread_barrier_wait sense or OpenMP sense) communicates-with all corresponding executions of the barrier in other threads. A memory fence (or barrier in the other sense) communicates-with the next execution of the fence, usually by another thread. An atomic store communicates-with all atomic loads that read the value saved by the store, i.e. for this purpose they behave like ordinary loads and stores.

We now have enough machinery to describe the ordering we really use to describe memory visibility.

The happens-before relation

[ Note that this definition of happens-before is a bit different from that used in Lamport's original 1978 paper ("Time, Clocks, and the Ordering of Events in Distributed Systems", CACM 21,7), and eventually in the Java model, but it is essentially just an adaptation of Lamport's definition to a system in which actions within a thread are also not totally ordered. The detailed style of definition grew out of a discussion with Bill Pugh, though we're not yet sure he approves of the result. ]

Given is-sequenced-before, is-inter-thread-ordered-before, depends-on and communicates-with relations on a set of actions, we define the happens-before relation to be the smallest transitively closed relation satisfying the following constraints:

If A and B are ordinary (not atomic!) memory references, and A is-sequenced-before B, then A happens-before B.
If A is-inter-thread-ordered-before B (and hence they are executed by the same thread, at least one of them is an atomic operation with an ordering constraint, and A is-sequenced-before B), then A happens-before B.
If an atomic store B depends-on an earlier atomic load A in the same thread, then A happens before B. In particular, assuming all assignments are atomic operations, there is no happens-before ordering between the load of x and the store of y in r1 = x; y = 1 (r1 not shared between threads), but there is such an ordering between the two halves of r1 = x; y = r1.
If A communicates-with B, then A happens-before B.

Consistent executions

A program execution is a quintuple consisting of

set of thread actions, and corresponding
is-sequenced-before,
is-inter-thread-ordered-before,
depends-on, and
communicates-with relations.

These give rise to a corresponding happens-before relation. We say that an execution is consistent if:

The actions of any particular thread (excluding values read from potentially shared locations), and the corresponding is-sequenced-before relation, is-inter-thread-ordered-before relation, and depends-on relation, are all consistent with the normal sequential semantics as given in the rest of the standard.
The communicates-with relation is such that for every ordinary load L which sees a value stored by another thread, a store S communicates-with L, such that S stores the value seen by L
The communicates-with relation is consistent with the constraints imposed by the definitions of the synchronization primitives. For example, if S is an atomic store which communicates-with an atomic load L, then the loaded and stored values must be the same.
(intra-thread visibility) If a load L sees a store S from the same thread, then L must not occur-before S, and there must be no intervening store S' such that S is-sequenced-before S' and S' is-sequenced-before L.
(inter-thread visibility) Each load L of any shared variable (including synchronization variables) sees a store S, such that L does not happen-before S and such that there is no intervening store S' such that S happens-before S' and S' happens-before L.
The happens-before relation is "acyclic", i.e. no action happens-before itself.
[ This means we view the relation as normally irreflexive. If we normally want it to be reflexive, we can tweak this slightly. ]

Note that if no thread performs any synchronization actions then the happens-before relation requires that the actions of a given thread effectively occur in "is-sequenced-before" order, which is as close as C++ gets to purely sequential execution. This in this case an execution is consistent iff it is sequentially consistent.

If lock/unlock enforce only acquire/release ordering, and there is no other form of synchronization, then it is less apparent that our definition is equivalent to sequential consistency. However, this can be proven if there is no trylock primitive.

[ At least some of us believe that the most plausible interpretation of the existing pthread semantics can be closely approximated by defining the various lock() primitives such that they have both acquire and release semantics. This still leaves issues related to failing pthread calls, etc. We believe that these introduce no fundamental technical challenges, but the details are not currently completely clear. ]

[ The fact that we insist that we require much stronger ordering for ordinary memory accesses than for atomic accesses initially seems out of place here. But, as Bill Pugh points out, simple Java-like happens-before consistency is otherwise insufficient. (For an example, see below.) And the ordering constraints on ordinary memory actions really only affect the definition of a data race; the meaning of data-race-free program is not affected, since this ordering is invisible. ]

Data races

We define a memory location to be a variable, (non-bitfield) data member, array element, or contiguous sequence of bitfields. We define two actions to be conflicting if they access the same memory location, and at least one of them is a store access.

We define an execution to contain an intra-thread race if a thread performs two conflicting actions, and neither is-sequenced-before the other.

We define an execution to contain an inter-thread race if two threads perform two conflicting actions, and neither happens-before the other.

[ I'm not sure this is quite the best way to state this, since the "communicates-with" relation on ordinary memory accesses contributes to happens-before. Thus "conflicting" actions may "communicate-with" each other, and thus "happen-before" each other, and thus no longer conflict. I think this technically doesn't matter because non-conflicting actions on ordinary memory normally only imply happens-before relationships that must exist anyway. And if there is an execution with an initial conflict that is eliminated by the "communicates-with" edge generated by the conflict, then there is an alternate consistent execution in which that edge doesn't exist, and there is a real conflict. But I don't like the fact that a subtle argument is required to demonstrate that this definition is sane.

We do need to include the extra "communicates-with" edges in the requirement that happens-before is acyclic. Otherwise we get nothing like sequential consistency in the data-race-free case. ]

If, for a given program and input, there are consistent executions containing either kind of race, then the program has undefined semantics on that input. Otherwise the program behaves according to one of its consistent executions.

[ Bill Pugh points out that that the notion of "input" here isn't well defined for a program that interacts with its environment. And we don't want to give undefined semantics to a program just because there is some other sequence of interactions with the environment that results in a data race. We probably want something more along the lines of stating that every program behavior either

corresponds to a consistent execution in which loads see stores that happen-before them, or

there is a consistent execution with a data race, such that calls to library IO functions before the data race are consistent with observed behavior.

I think the notion of "before" in the second clause is easily definable, since we can insist that IO operations be included in the effectively total order of ordinary variable accesses.

It is unclear to me whether this is something that needs to be addressed with great precision, since the current standard doesn't appear to, and I think the danger of confusion is minimal. ]

For purposes of the above definitions, object destruction is viewed as a store to all its sub-objects, as is assignment to an object through a char * pointer if the target object was constructed as a different type. Assignment to a component of a particular union member is treated as a store into all components of the other union members. Different threads may not concurrently access different union members.

Discussion and Examples

Unlike the Java memory model, we do not insist on a total order between synchronization actions. Instead we insist only on a communicates-with relation, which must be an irreflexive partial order. (This follows from the fact that it is a subset of happens-before, which is irreflexive and transitive.) This means that synchronization actions such as atomic operations are themselves not guaranteed to be seen in a consistent order by all other threads, and may become visible at other threads in an order different from the intra-thread is-sequenced-before order.

Simple Locks

In the case of simple locks, this is possible only in that a "later" (in is-sequenced-before order) lock acquisition may become visible before an "earlier" unlock action on a different thread. Thus (with hypothetical syntax and lock and unlock primitives that take references and which have only acquire and release semantics, respectively):

lock(a); x = 1; unlock(a); lock(b); y = 1; unlock(b);

may appear to another thread to be executed as

lock(a); lock(b); y = 1; x = 1; unlock(a); unlock(b);

Unlike in Java, a hypothetical observer thread might see the assignments occur out of order. However, so long as x and y are ordinary variables, and the assignments do not reflect atomic operations, this is not observable, since such an observer thread would introduce a race.

Note that although our semantics allows the above reordering in a particular execution, compilers may not in general perform such rearrangements, since that might introduce deadlocks.

We claim that for data-race-free programs using only simple locks and no atomic operations, our memory model is identical to the Java one.

In the case of simple locks, we effectively insist on a total order among synchronization operations, happens-before is an irreflexive partial order, and everything behaves as in the Java memory model.

Simple Atomic Operations

For the remainder of this section, we assume that all variables are initialized to zero, variables whose names begin with r are local, and all other variables are shared, and that all operations are considered to be synchronization operations, and hence may be safely involved in data races, even if they are written using simple assignment syntax. Acquire and release operations will be explicitly specified, however.

Next consider load_acquire and store_release primitives, where there must be an action that communicates-with every load_acquire: either an initialization or a store_release on the same variable. But there are no other restrictions on communicates-with.

Consider the following example:

Thread1 Thread2

x = 1;
store_release(&flag, 1); r1 = load_acquire(&flag);
r2 = x;

Due to the acquire and release ordering constraints on the references to flag, the individual pairs of assignments in each thread are ordered by is-inter-thread-ordered-before.

Only two possible actions can communicate-with the load_acquire in this example: The initialization of the flag variable, or the store_release action. It follows that if we get r1 = 1, then the store_release must have communicated-with the load-acquire. Given the ordering constraints,this implies that the assignment to x happens-before the assignment to r2. Hence r1 = 1 and r2 = 0 is an impossible outcome, as desired.

Next consider the following example, under the same ground rules:

Thread1 Thread2

store_release(&y, 1);
r1 = load_acquire(&x); store_release(&x, 1);
r2 = load_acquire(&y);

There is no is-inter-thread-ordered-before ordering between the statements in each thread. Initialization operations may synchronize with both load_acquire operations. Hence we can get r1 = r2 = 0, as expected.

We can model memory fences as synchronization operations with both acquire and release semantics, which must be totally ordered, and the communicates-with relation must respect the total order, i.e. each fence instance communicates-with the next fence instance in the total order. Consider:

Thread1 Thread2

x = 1;
r1 = z;
fence();
y = 1;
r2 = w; w = 1;
r3 = y;
fence();
z = 1;
r4 = x;

In any given execution either the thread 1 fence communicates-with the thread 2 fence, i.e. the thread 1 fence executes first (a), or vice-versa (b). If r3 = 1, then we must be in the first case. (If thread 2's fence came first, the load of y happens-before the store, and hence cannot see the store.) Hence r4 must also be one. Similarly r1 = 1 implies r2 = 1.

Next consider the following suggestive example, where initially self_destruct_left and self_destruct_right are false:

Thread1 Thread2

while (!self_destruct_left);
self_destruct_right = true;
blow_up_port_side_of_ship(); while (!self_destruct_right);
self_destruct_left = true;
blow_up_starboard_side_of_ship();

We would like to avoid the situation in which each while-loop condition sees the store from the other thread, and the ship spontaneously self-destructs without intervention of a third thread.

Assume there is no other thread which sets either of the self_destruct... variables. Assume the thread 1 loop nonetheless terminates. This is only possible if it saw the store in thread 2. For such an execution to be intra-thread consistent, thread 2's loop must have seen the store from thread1. Thus in each case, the store depends-on the load in the while loop in the same thread, and the store communicates-with the loop in the other thread. The happens-before relation must be consistent with all of these and transitively closed. Hence all actions mentioned are in a cycle, i.e. happen-before themselves. Thus such an execution is not allowed.

If we use ordinary memory operations instead of atomic operations in the above example, then the loads in the while loops are sequenced-before the stores in the same thread. We need the same communicates-with relationships as before, thus happens-before will again be cyclic, and this version of the program is also safe.

[ Unlike earlier versions of this proposal, the presence of the depends-on relation seems to introduce enough of a causality requirement here to prevent the anomalous outcome if we use unordered atomic operations. With ordinary memory operations there is nothing surprising. By adding communicates-with relationships for all matching cross-thread store-load pairs, and insisting that the result not contain cycles, we are essentially insisting on sequential consistency. Needs more examples. ]

Member and Bitfield Assignments

As we stated above, struct and class members are normally considered to be separate memory locations, and thus assignments to distinct fields do not conflict. The only exception to this is that an assignment to a bit-field conflicts with accesses to any other bit-field in the same sequence of contiguous bit-fields. For example, consider the declaration:

struct s {
  char a;
  int  b:9;
  int  c:5;
  char d;
  int  e:1;
};

An assignment to the bit-field b conflicts with an access to c, but no accesses to a different pair of fields conflicts.

Note that in some existing ABIs, fields a,b, c, and d are allocated to the same 32-bit word. With such ABIs, compilers may not implement an assignment to b as a 32-bit load, followed by an in-register bit replacement, followed by a 32-bit store. Such implementations do not produce the correct result if a or d are updated concurrently with b.

[ Note that the above example illustrates the most controversial aspect of this rule that we have found. It will take work to make existing compilers conform. For example, gcc on X86 currently does not. However, the resulting slowdown in object code appears to be very minor. To our knowledge, cases like the above exist, but are rare, in production code. And the necessary overhead amounts to some additional in-register computation, and a second store instruction to the same cache line as the first.

The alternatives would add complexity to the programming task, which some of us believe we want to avoid unless the benefits are much larger than this. Consider the slight variant of the above example:

struct s {
  something_declared_third_party_header a;
  int  b:9;
  int  c:5;
  char d;
  int  e:1;
};

Whether or not accesses to a and b conflict is now hard to predict without implementation knowledge of a's type, even if we understand the ABI conventions.

Probably the only plausible alternative is to allow bit-field accesses to conflict with accesses to any other data member in the same struct or class, and to thus encourage bit-fields to only appear in a nested struct. But this would add a rather obscure rule to the already complicated set that must be understood by threads programmers. ]

As far as the programmer is concerned, padding bytes are not written as part of a bit-field or data member update, i.e. such writes do not need to be considered in determining whether there is a data race. But we are not aware of cases in which a compiler-generated store that includes a padding byte would adversely impact correctness.

Consequences

The preceding rules have some significant non-obvious consequences. Here we list some of these rather informally. Proofs would clearly be useful, where they are missing.

Programs using no synchronization operations other than simple locks, and which allow no data races, behave sequentially consistently. This is critical, in that it programmers a much simpler rule to follow, which is sufficient in perhaps 98% of all cases. An informal proof is here.
Compilers may generally move ordinary memory references, such that both source and target instruction locations appear in the same executions, subject to the usual sequential correctness constraints, provided the reference is not moved across synchronization operations. If such movement were observable, the observer must see it without an intervening release operation in the original thread and corresponding acquire operation in the observing thread, or vice-versa. Thus a load must see a store that does not happen-before it, and the program must have invoked undefined behavior to start with.
By arguments similar to the above, compilers may turn a load from a shared memory location into two adjacent loads from a memory location and treat the two values interchangeably. Using the preceding argument, these reads may then be moved apart. Thus if the contents of a register were generated by reading the value of a shared global, and then spilled, in the same basic block without intervening synchronization operations, it is acceptable to reload the register from the shared global.
By a similar argument, it is also acceptable to duplicate stores. As was pointed out earlier, this may be profitable in code size if a store is repeated in all but one branch of a large switch statement.
Compilers may not generally move operations forward across constructs that may block or in some other way provide mutual exclusion (potentially infinite loops, lock acquisitions), nor backward out of regions that may provide mutual exclusion (lock releases). Doing so might introduce races and cause the operations to occur at a point at which they could not occur in a sequentially consistent execution. Reuse of common subexpressions counts as movement in this sense.
Compilers should generally ensure that acquire and release operations become visible in the specified order, except that a release operation may become visible later than a subsequent acquire operation. They must also ensure that if release operation R in one thread may affect a later acquire operation A in another thread then generally any memory operations preceding R should be visible to code following A. To understand this, consider the first load_acquire/store_release example above. On some architectures (we believe current X86 implementations qualify here) this is only a compiler constraint. Other architectures (e.g. PowerPC, Alpha) will require some sort of memory fence after an acquire operation and before a release operation. Itanium provides some more direct support for acquire and release operations.
Compilers may not generally introduce new stores to potentially shared memory locations, which are not otherwise guaranteed to be written along the same path, since these may race with stores in other threads, and potentially cause those other stores to be lost. This is true even if the write writes a value that was previously read from the same location. The only exception is that a write to a bit-field may involve reading and rewriting of adjacent bit-fields.

Here we list some more detailed implications of the last statement:

Structure or class data member assignments may not be implemented in a way that overwrites data members not assigned to in the source, unless the assigned and overwritten members are adjacent bit-fields.
Some (many) ABIs provide very weak alignment guarantees for bit-fields following, say, a char member. In these cases, stores to the bit-field must be implemented with multiple stores, rather than by overwriting adjacent fields, as in the discussion above.
Compilers may not introduce speculative stores to potentially shared locations, for example as a result of speculative register promotion. To our knowledge, nearly all optimizing compilers currently perform speculative register promotion in ways that would no longer be allowed under this proposal. Such optimizations are not currently thread-safe in any real sense. In most cases there seem to be alternate transformations that achieve similar performance.

Constructors

We make no specific guarantees that initialization performed by a constructor will be seen to have been completed when another thread accesses the object. A correct program must use some sort of synchronization to make a newly created object visible to another thread. Otherwise the store of the object point and the load by the other thread constitute a data race, and the program has undefined semantics.

If proper synchronization is used, there is no need for additional guarantees, since the synchronization will ensure visibility as needed.

This is consistent with current practice, which does not guarantee even vtable visibility in the absence of synchronization, and allows insufficiently synchronized programs to crash, jump to the wrong member function, etc.

Function local statics

If the declaration of a function local static is preceded by the keyword protected, then the access to the implicit flag used to track whether a function local static has been constructed is a synchronization operation. Otherwise it is not, and it is the programmer's responsibility to ensure that neither construction of the object not reference to the implicit flag variable introduces a data race.

[There appeared to be consensus among those attending the Lillehammer C++ standards meeting that both options should be provided to the programmer. Subsequent discussion pointed out that it is more reasonable than some of us had thought to always require thread-safety. In particular, there seem to be no practical cases in which a compiler decision to implement an initialization statically breaks ordering guarantees that would reasonably be expected. The down side is that this imposes some overhead on uses that do not require synchronization. On X86, this overhead can be significant for the initialization, but probably not for later uses. On some other architectures significant overhead may be introduced even for later references. Currently I think this issue is unresolved.

The protected keyword was chosen arbitrarily, and should be considered more carefully. ]

Volatile variables and data members

Accessed to regular volatile variables are not viewed as synchronization operations. Volatile implies only safety in the presence of implicit or unpredictable actions by the same thread.

If the atomic operations library turns out to be insufficiently convenient to provide for lock-free inter-thread communication, we propose that accesses to __async volatile variables and data members are viewed as synchronization operations.

Loads of such variables would have an acquire ordering constraint, and stores would have a release constraint.

[ It seems to make sense to put this on hold until we have a better handle on the atomic operations library, so that we can tell whether that would be a major inconvenience.

The possible reasons to retain this are (1) possibly improved convenience, and (2) possibly better consistency in programming idioms across languages (in this case Java and C#). The argument for discarding it is simplicity.

If we want to retain it, we now have to ask whether there is a total order among volatile accesses.

Current implementations of volatile generally use weaker semantics, which do not prevent hardware reordering of volatiles. This appears to have no use in portable code for threads, since such code cannot take advantage of the fact that operations are reordered "only by the hardware". It is occasionally useful for variables that are either modified after a setjmp, that may be accessible through multiple memory mappings, or the like. ]

There are no atomicity guarantees for accesses to volatile variables. Accesses to __async volatile variables of pointer types, integral types (other than long long variants), bool type, enumeration types, and type bool are atomic. The same applies to the individual data member accesses in e.g. struct assignments, but not to the assignment as a whole. There is no defined order between these individual atomic operations.

[We can't talk about async-signal-safety here. We might suggest that __async volatile int and __async volatile pointers be async-signal-safe where that's possible and meaningful. My concern here is with uniprocessor embedded platforms, which might have to use restartable critical sections to implement atomicity, and might misalign things. }

Thread-local variables and stack locations

This issue is addressed more thoroughly in Lawrence Crowl's proposal (N1874=05-0134). We defer to that discussion.

Library Changes and Clarifications

We need the following kinds of changes to the library specification:

Clarify thread safety

The library specification needs to be clear as to which pieces of the library are thread-safe, and in what sense, and how various calls interact with the memory model. We propose the following basic approach, consistent with the approach used by the SGI STL implementation:

Any data types (e.g. containers) implemented by the library are thread-safe in the same weak sense as an ordinary scalar variable in the base language: The client must ensure that an operation that logically updates a piece of data may not be executed concurrently with another operation that reads or writes the same data. The implementation must protect against accesses to shared data that do not correspond to conflicting accesses at the abstract level, i.e. updates that occur in response to logical "read" operations, or against accesses to a data structure shared by multiple abstract objects. For example, implementations of "read operations" that maintain an internal shared cache will often need internal locks to protect that cache, as will any implementations that maintain other forms of per class, as opposed to per object, data.
A few operations will provide stronger guarantees, e.g. that accesses behave atomically. In these cases, unless otherwise stated, a write operation to a particular piece of shared data communicates-with a read of the corresponding data by another thread.
Normal race-free library calls, i.e. calls which do not guarantee atomicity, introduce no communicates-with relationships. This is true even if the implementation uses internal locks. (Often such internal locks can be replaced by thread-local states, or clever lock-free algorithms, which might no longer guarantee memory ordering.)
Accesses to external data (e.g. files) are treated as though they were accesses to memory data.
[ We need to better understand current conventions here. Presumably a file read is logically a "write" operation, since it updates the current position in the file. But I don't know how much locking is done by current C++ I/O implementations. I suspect it's more than we would like.]
Operations such as allocator<T>::allocate() that return freshly allocated memory are not considered to write shared data. Hence the implementation, not the client, must either guard against concurrent calls, or make them safe, e.g. by using some form of thread-local allocation pools.

We expect that some effort will be required to pin down exactly which operations "logically update" shared state.

[ Paragraph 21.3(5), which deals with basic_string copy-on-write support, will be difficult to support here in any reasonable fashion. I've long advocated stripping it to prohibit copy-on-write since I'm not convinced it makes sense without threads either. Unfortunately, removing it will drastically change the performance characteristics of existing implementations, often for the better, but occasionally for the worse. ]

Add thread-specific library components

At least two kinds of additions will be needed:

The atomic operations library mentioned elsewhere in this document.
Primitives for creating and synchronizing threads.

In the long term, a library containing basic lock-free and scalable data structures is also highly desirable.

All of these are discussed elsewhere.

Exceptions, Signals, and Thread Cancellation

[ It is unclear to what extent this needs to or should be addressed here. I think there is agreement that thread cancellation (though not PTHREAD_CANCEL_ASYNCHRONOUS-style) and exceptions should be unified. But the details are controversial, and that seems to be more of a threads API issue.

Nick Maclaren argues that we need to say something about the state that is visible to an exception handler that was thrown to reflect a synchronous error, such as an arithmetic overflow. Since we are effectively respecifying intra-thread memory visibility, there are strong interactions with threads issues, and the presence of synchronization primitives gives us an opportunity for a meaningful specification that is at least somewhat useful to a programmer, I'm inclined to agree. What follows is an approximate restatement of one of the options he proposed.

This essentially requires that compilers treat operations that may generate exceptions as memory operations, and not move them out of critical sections etc. I would be surprised if existing implementations did so.

This may need further work, even if we go with substantially this statement. In particular, the handler kind of needs to be modelled as a new thread replacing the old one, since it can have an inconsistent view of updates performed by the original thread. But on the other hand, it potentially has access to local variables whose address was not taken, and hence can see otherwise private state of the original thread. ]

If an action A throws an intra-thread out-of-band exception, then all actions that happen-before a synchronization action that happens-before A are visible to the exception handler. Conversely, if A happens-before another synchronization action B, then no action C such that B happens-before C is visible to the exception handler.

For this purpose, there are implicit synchronization actions with both acquire and release semantics (effectively memory fences) at the beginning and end of each thread execution.

[ I'm not sure whether the preceding paragraph really buys us anything. ]

Thread1	Thread2
x = 1; store_release(&flag, 1);	r1 = load_acquire(&flag); r2 = x;

Thread1	Thread2
store_release(&y, 1); r1 = load_acquire(&x);	store_release(&x, 1); r2 = load_acquire(&y);

Thread1	Thread2
x = 1; r1 = z; fence(); y = 1; r2 = w;	w = 1; r3 = y; fence(); z = 1; r4 = x;

Thread1	Thread2
while (!self_destruct_left); self_destruct_right = true; blow_up_port_side_of_ship();	while (!self_destruct_right); self_destruct_left = true; blow_up_starboard_side_of_ship();