Doc. No.:	WG21/N2300 J16/07-160
Date:	2007-06-22
Reply to:	Clark Nelson	Hans-J. Boehm
Phone:	+1-503-712-8433	+1-650-857-3406
Email:	clark.nelson@intel.com	Hans.Boehm@hp.com

Concurrency memory model (revised)

This paper is a follow-on to N2171, which was effectively divided into N2239 and this paper. N2239 is included in the current working paper. This is an update to the remaining changes, which are all related to concurrency.

Changes to the corresponding section of N2171 include:

Various changes to notes, including fixing an editing mistake in 1.10p10, and the addition of some explanatory notes.
Switched to a weaker "synchronizes with" formulation in which a load must see either the value of the store itself, or a derivative obtained by a sequence of RMW operations.
Rephrased the interaction of modification order and visibility in 1.10p10. This version imposes stronger restrictions, and generally disallows "flickering" of values.
Switched to a more conventional happens-before definition, which includes sequenced-before.
Switched to a more conventional formulation in which the "precedes" relation was replaced by a statement that no evaluation can see a store that happens after it, or is sequenced after it.
This avoids some concerns about synchronization elimination. In particular, the old formulation allowed (everything initially zero, atomic):

Thread1 Thread2

r1 = x.load_relaxed(); // yields 1
y.store_relaxed(1); r1 = y.load_relaxed(); // yields 1
x.store_relaxed(1);

but disallowed the corresponding test case if the two statements in each thread were separated by acq_rel read-modify-write operations to dead variables, or a locked region. That interferes with the elimination of locks if the compiler decides to statically combine threads, which is likely to become important for certain programming styles.
Explicitly mentioned the input when referring to a data race. A program may exhibit a data race on some inputs and have well-defined semantics on others.
Added back a proposed paragraph to explicitly address nonterminating loops.
On Beman Dawes' suggestion, added the change to 15.3p9 dealing with uncaught exceptions, and renamed "inter-thread data race" to just "data race".

This version has benefitted from feed back from many people, including Sarita Adve, Paul McKenney, Raul Silvera, Lawrence Crowl, and Peter Dimov.

The definition of "memory location"
Multi-threaded executions and data races
Nonterminating loops
Treatment of uncaught exceptions

The definition of "memory location"

New paragraphs inserted as 1.7p3 et seq.:

A memory location is either an object of scalar type, or a maximal sequence of adjacent bit-fields all having non-zero width. Two threads of execution can update and access separate memory locations without interfering with each other.

[Note: Thus a bit-field and an adjacent non-bit-field are in separate memory locations, and therefore can be concurrently updated by two threads of execution without interference. The same applies to two bit-fields, if one is declared inside a nested struct declaration and the other is not, or if the two are separated by a zero-length bit-field declaration, or if they are separated by a non-bit-field declaration. It is not safe to concurrently update two bit-fields in the same struct if all fields between them are also bit-fields, no matter what the sizes of those intervening bit-fields happen to be. —end note ]

[Example: A structure declared as struct {char a; int b:5, c:11, :0, d:8; struct {int ee:8;} e;} contains four separate memory locations: The field a, and bit-fields d and e.ee are each separate memory locations, and can be modified concurrently without interfering with each other. The bit-fields b and c together constitute the fourth memory location. The bit-fields b and c can not be concurrently modified, but b and a, for example, can be. —end example.]

Multi-threaded executions and data races

Insert a new section between 1.9 and 1.10, titled "Multi-threaded executions and data races".

1.10p1:

Under a hosted implementation, a C++ program can have more than one thread of execution (a.k.a. thread) running concurrently. Each thread executes a single function according to the rules expressed in this standard. The execution of the entire program consists of an execution of all of its threads. [Note: Usually the execution can be viewed as an interleaving of all its threads. However some kinds of atomic operations, for example, allow executions inconsistent with a simple interleaving, as described below. —end note ] Under a freestanding implementation, it is implementation-defined whether a program can have more than one thread of execution.

1.10p2:

The execution of each thread proceeds as defined by the remainder of this standard. The value of an object visible to a thread T at a particular point might be the initial value of the object, a value assigned to the object by T, or a value assigned to the object by another thread, according to the rules below. [Note: Much of this section is motivated by the desire to support atomic operations with explicit and detailed visibility constraints. However, it also implicitly supports a simpler view for more restricted programs. See 1.10p11. —end note ]

1.10p3:

Two expression evaluations conflict if one of them modifies a memory location and the other one accesses or modifies the same memory location.

1.10p4:

The library defines a number of operations, such as operations on locks and atomic objects, that are specially identified as synchronization operations. These operations play a special role in making assignments in one thread visible to another. A synchronization operation is either an acquire operation or a release operation, or both, on one or more memory locations. [Note: For example, a call that acquires a lock will perform an acquire operation on the locations comprising the lock. Correspondingly, a call that releases the same lock will perform a release operation on those same locations. Informally, performing a release operation on A forces prior side effects on other memory locations to become visible to other threads that later perform an acquire operation on A. We do not include "relaxed" atomic operations as "synchronization" operations though, like synchronization operations, they cannot contribute to data races. —end note ]

1.10p5-6, previously containing the definition of "inter-thread ordered before", have been deleted from this revision. 1.10-p6 was subsequently replaced by the following paragraph, which was at one point part of 1.10-p7. Subsequent paragraphs will be renumbered eventually.

1.10p6:

All modifications to a particular atomic object M occur in some particular total order, called the modification order of M. [Note: These are separate orders for each scalar object. There is no requirement that these can be combined into a single total order for all objects. In general this will be impossible since different threads may observe modifications to different variables in inconsistent orders. —end note ]

1.10p7:

This was weakened since N2171 not to require synchronizes-with for all later reads. Some weakening of the older specs appears to be necessary to preserve efficient cross-platform implementability of low-level atomics. This is probably not the only possible such weakening. But all of them appear to either:

Make the memory model much harder to describe, or
Allow somewhat counterintuitive outcomes for some test cases.

Without the special exemption for read-modify-write operations, we would allow the particularly counterintuitive outcome for one of Peter Dimov's examples: (x, y ordinary, v atomic, all initially zero)

Thread1 Thread2 Thread3
x = 1;
fetch_add_release(&v, 1); y = 1;
fetch_add_release(&v, 1); if (load_acquire(&v) == 2)
assert (x + y == 2);

Here the assertion could fail, since only the later fetch_add_release would ensure visibility of the preceding store. The value written by the earlier might not seen by thread3. The special clause for RMW operations prevents the assertion from failing here and in similar examples.

An evaluation A that performs a release operation on an object M synchronizes with an evaluation B that performs an acquire operation on M and reads either the value written by A or, if the following (in modification order) sequence of updates to M are atomic read-modify-write operations or sequentially consistent atomic stores, a value written by one of these read-modify-write operations or sequentially consistent stores. [Note: Except in the specified cases, reading a later value does not necessarily ensure visibility as described below. Such a requirement would sometimes interfere with efficient implementation. —end note ] [Note: The specifications of the synchronization operations define when one reads the value written by another. For atomic variables, the definition is clear. All operations on a given lock occur in a single total order. Each lock acquisition "reads the value written" by the last lock release. —end note ]

1.10p8:

This has been strengthened since N2171 to include sequenced before in happens before.

An evaluation A happens before an evaluation B if:

A is sequenced before B or
A synchronizes with B; or
for some evaluation X, A happens before X and X happens before B.

1.10p9 was once proposed to define a "precedes" relation, which is no longer needed. It should eventually be renumbered out of existence. Insisting on an acyclic "precedes" relation potentially interfered with synchronization elimination.

1.10p10:

This paragraph has been revised repeatedly, as we have tried to pin down the interaction with "modification" order, i.e. what's normally known as "cache coherence". Note that directly including modification order in "happens before" is too strong. To see this, consider (everything again initially zero):

Thread1 Thread2

x.store_relaxed(1);
v.store_relaxed(1);
r1 = y.load_relaxed(); y.store_relaxed(1);
v.store_relaxed(2);
r2 = x.load_relaxed();

If we had a happens-before ordering between the two stores to v, in either direction, we would preclude r1 = r2 = 0, which could usually only be enforced with a fence.

This version was also altered by the removal of the "precedes" relation. Note that the new first clause here may be technically redundant, but I think it is clearer to state it explicitly.

A multi-threaded execution is consistent if

no evaluation happens before itself,
if a side effect W to scalar object M happens before another side effect W' to the same scalar object M, then W must precede W' in M's modification order,
and if for every read access R to scalar object M that observes value a written by side effect W the following conditions hold:

R does not happen before W.
There is no side effect W' to M such that

W happens before W', and
W' happens before R.

If read access R happens before read access R' to the same scalar object M which observes value b, then the corresponding side effect assigning b to M may not precede the side effect W in M's modification order.

[Note: The first condition states essentially that the happens-before relation consistently orders evaluations. We cannot have A happens before B, and B happens before A, since that would imply A happens before A. The second condition states that the modification orders must respect happens before. The third condition implies that a read operation R cannot "see" an assignment W if R happens before W. The fourth condition effectively asserts that later assignments hide earlier ones if there is a well-defined order between them. The fifth condition states that reads of the same object must observe a sequence of changes that is consistent with that object's modification order. This last condition effectively disallows compiler reordering of atomic operations to a single object, even if both operations are "relaxed" loads. By doing so, we effectively make the "cache coherence" guarantee provided by essentially all hardware available to C++ atomic operations.—end note ]

1.10p11:

An execution contains a data race if it contains two conflicting actions in different threads, at least one of which is not atomic, and neither happens before the other. Any data race results in undefined behavior. A multi-threaded program that does not allow a data race for the given inputs exhibits the behavior of a consistent execution, as defined in 1.10p10. [Note: It can be shown that programs that correctly use simple locks to prevent all data races, and use no other synchronization operations, behave as though the executions of their constituent threads were simply interleaved, with each observed value of an object being the last value assigned in that interleaving. This is normally referred to as "sequential consistency". However, this applies only to race-free programs, and race-free programs cannot observe most program transformations that do not change single-threaded program semantics. In fact, most single-threaded program transformations continue to be allowed, since any program that behaves differently as a result must perform an undefined operation. —end note ]

1.10p12:

[Note: Compiler transformations that introduce assignments to a potentially shared memory location that would not be modified by the abstract machine are generally precluded by this standard, since such an assignment might overwrite another assignment by a different thread in cases in which an abstract machine execution would not have encountered a data race. This includes implementations of data member assignment that overwrite adjacent members in separate memory locations. —end note ]

1.10p13:

[Note: Transformations that introduce a speculative read of a shared variable may not preserve the semantics of the C++ program as defined in this standard, since they potentially introduce a data race. However, they are typically valid in the context of an optimizing compiler that targets a specific machine with well-defined semantics for data races. They would be invalid for a hypothetical machine that is not tolerant of races or provides hardware race detection. —end note ]

Nonterminating loops

It is generally felt that it is important to allow the transformation of potentially nonterminating loops (e.g. by merging two loops that iterate over the same potentially infinite set, or by eliminating a side-effect-free loop), even when that may not otherwise be justified in the case in which the first loop never terminates.

Existing compilers commonly assume that code immediately following a loop is executed if and only if code immediately preceding a loop is executed. This assumption is clearly invalid if the loop fails to terminate. Even if we wanted to prohibit this behavior, it is unclear that all relevant compilers could comply in a reasonable amount of time. The assumption appears both pervasive and hard to test for.

The treatment of nonterminating loops in the current standard is very unclear. We believe that some implementations already eliminate potentially nonterminating, side-effect-free, loops, probably based on 1.9p9, which appears to impose very weak requirements on conforming implementations for nonterminating programs. We had previously arrived at a tentative conclusion that nonterminating loops were already sufficiently weakly specified that no changes were needed. We no longer believe this, for the following reasons:

On closer inspection, it is at best unclear that this reasoning would continue to apply in a world in which the program may terminate even if one of the threads does not.
In the presence of threads, the elimination of certain side-effect-free potentially infinite loops (e.g. while (!please_self_destruct.load_acquire()) {}; self_destruct()) is clearly hazardous, and a bit more clarity seems appropriate.

Hence we propose the following addition:

6.5p5:

A nonterminating loop that

performs no I/O operations, and
does not access or modify volatile objects, and
performs no synchronization or atomic operations
invokes undefined behavior. [Note: This is meant to allow compiler transformations, such as removal of empty loops, even when termination cannot be proven. —end note]

We had previously discussed limiting "undefined" behavior to certain optimizations. But it is unclear how to do that usefully, such that there are any programs that could usefully take advantage of such a statement.

This formulation does have the advantage that it makes it possible to painlessly write nonterminating loops that cannot be eliminated by the compiler, even for single-threaded programs.

Treatment of uncaught exceptions

15.3p9:

[Beman Dawes suggestion, reflecting an earlier discussion:] Change "a program" to "the current thread of execution" in

If no matching handler is found in ~~a program~~ the current thread of execution, the function std::terminate() is called; whether or not the stack is unwound before this call to std::terminate() is implementation-defined (15.5.1)."

Thread1	Thread2
r1 = x.load_relaxed(); // yields 1 y.store_relaxed(1);	r1 = y.load_relaxed(); // yields 1 x.store_relaxed(1);

Thread1	Thread2	Thread3
x = 1; fetch_add_release(&v, 1);	y = 1; fetch_add_release(&v, 1);	if (load_acquire(&v) == 2) assert (x + y == 2);

Thread1	Thread2
x.store_relaxed(1); v.store_relaxed(1); r1 = y.load_relaxed();	y.store_relaxed(1); v.store_relaxed(2); r2 = x.load_relaxed();

Concurrency memory model (revised)

Contents

The definition of "memory location"

Multi-threaded executions and data races

Nonterminating loops

Treatment of uncaught exceptions