Doc. No.:	WG14/N1479
Revision of:	WG14/N1411
Date:	2010-05-24
Reply to:	Hans-J. Boehm
Phone:	+1-650-857-3406
Email:	Hans.Boehm@hp.com

N1479: Memory Model Rationale

This paper contains an draft of a rationale for the C concurrency memory model and associated atomics library ( N1349), which are very similar to the analogous C++ features. It is based in part on some C++ committee papers, notably N2176: Memory Model Rationales, N2138: A Less Formal Explanation of the Proposed C++ Concurrency Memory Model, and N2427: Atomic Types and Operations

This clearly requires some further work, especially since some details of the memory model are still under discussion. This preliminary version will hopefully aid such discussion.

There is clearly much more that could be said about this topic. I've attempted to limit the discussion to a brief overview of the points that are likely to be considered both important and controversial. It's unclear to me that I hit exactly the right set. Comments are appreciated.

This relies heavily on prior joint work with many others, including Sarita Adve, Lawrence Crowl, Paul McKenney, Clark Nelson, Herb Sutter, and others. Several of them also provided helpful feedback on earlier versions of this document.

5.1.2.4: Multi-threaded executions and data races

This section provides a foundation for reasoning about concurrent C programs by precisely defining the behavior of shared objects that are accessed or updated by multiple threads.

Variables shared between threads raise a number of complex issues that have often only been incompletely addressed. The simplest possible interpretation of a multi-threaded program is that of simply interleaving the steps performed by each of the threads. Each use of a shared memory location sees the last assignment to that location in this interleaving. Following Lamport, this is normally referred to as sequential consistency.

Unfortunately, sequential consistency is expensive to guarantee, especially in the presence of imperfect compiler analysis. Both compilers and hardware obtain improved performance by reordering operations in ways that become visible to multithreaded programs.

Perhaps even more troubling is that under pure sequential consistency the meaning of a program becomes dependent on the granularity of the individual steps. A program doesn't have the same meaning if memory operations are performed a byte at a time as it does if they are performed a word at a time. In the first case, the interleaving is performed at a finer granularity; if a thread changes the value of x from 0 to 100000, another thread may observe a "half assigned" value of 34464.

In order to avoid both implementation cost and granularity issues, many prior approaches (e.g. Ada 83, Posix Threads, Java, and foundational work by Adve and Hill) approximate a model in which sequential consistency is promised only if there are no "data races", i.e. no memory location can ever be concurrently accessed by two or more threads, where one or more of the accesses modifies the object. Data races are defined in terms of sequentially consistent executions.

Data-race-free programs can observe neither critical reordering optimizations nor differences in access granularity. Any thread that tried to observe such differences would introduce a data race.

C, like C++, follows a variant of this approach, in which also

Programs with data races have completely undefined semantics, and are thus effectively considered erroneous. This preserves a number of existing optimization opportunities, in that compilers may continue to assume that ordinary variables are not asynchronously updated by other means. It also avoids the largely unsolved research problem of specifying reasonable semantics for data races. It allows and encourages implementations to detect data races.
Data races are most commonly prevented using mutexes or similar synchronization mechanisms.
However, it is also possible to introduce atomic objects that may be used for synchronization, i.e. can be concurrently accessed and updated by multiple threads. They are effectively exempt from data races. Unless an explicit memory_order_ argument is specified for operations on these objects, it remains the implementation's responsibility to ensure that programs without data races (on non-atomic objects) exhibit sequentially consistent behavior. On many platforms, such atomic operations will be implemented using memory fence instructions that prevent hardware reordering of memory operations.
Since the memory fences required to implement such atomic operations on current hardware tend to be expensive, the C library additionally provides operations on atomic objects that explicitly specify memory visibility constraints. These operations violate sequential consistency, and provide improved performance at the expense of significantly increased complexity.

The presence of atomic objects makes it easy to replace any existing uses of intentional data races. (Such uses typically don't conform to older standards either, but are nonetheless fairly common.) The presence of explicit memory ordering constraints makes it possible to precisely express the requirements on visibility to other threads.

Unfortunately, the need for explicitly ordered atomics also significantly complicates the description of shared object semantics (the "concurrency memory model"). The specification has to cover the issues arising from such atomics, and the simple "sequential consistency for data-race-free programs" becomes a non-obvious theorem that holds in the absence of explicitly ordered atomics, and is only alluded to in a note. (See Boehm, Adve, Foundations of the C++ Concurrency Memory Model, PLDI 08 for more details.) Nonetheless, we expect programmers to program to this DRF model nearly all the time, and only rarely rely on the more intricate specification in the standard.

Since the specification must handle explicitly ordered atomics, it defines executions and data races differently from the informal description we gave here. In the presence of such non-sequentially-consistent behavior, we can no longer describe program executions as interleavings of thread executions. Instead we describe when a set of thread executions, taken together, describes a valid program execution. Based on which modifications are observed by which accesses, we get a set of ordering constraints that give rise to a "happens before" relation, relating actions that must appear to happen in a specific order. A set of thread executions is valid only if the values produced by accesses are actually consistent with this relation.

A happens-before relationship is introduced because either two actions are performed by the same thread in a well-defined order (sequenced before), or a thread T2 observes a synchronization action performed by another thread T1 (synchronizes with), and hence should also see all prior memory operations performed by T1. For example, if x is declared as int and y is declared as atomic_int, and r1 are local int variables, all initially zero, then

Thread 1	Thread 2
`x = 1;` `atomic_store(&y, 1);`	`r1 = atomic_load(&y);` `if (r1) r2 = x;`

ensures that r2 must be 1 whenever r1 is. This is true because, assuming r1 = 1, the atomic_store operation synchronizes with the atomic_load operation. Together with the sequencing relationships, this implies the assignment to x in thread 1 happens before the access to x in thread 2.

The same argument continues to apply if the weaker memory_order_acquire and memory_order_release orderings are specified:

Thread 1	Thread 2
`x = 1;` `atomic_store_explicit(&y, 1, memory_order_release);`	`r1 = atomic_load_explicit(&y, memory_order_acquire);` `if (r1) r2 = x;`

But the observation would no longer hold if the weakest memory order specification, memory_order_relaxed were used in either position.

There is a subtle issue related to when a thread observes a synchronization operation from another thread. What if, instead of the value of x written by thread 1, thread 2 read a later value? Requiring the "synchronizes with" relationship to be maintained in all such cases introduces implementation issues on some platforms. But allowing a third thread to invalidate the relationship by atomically adding zero to x in the middle, even with memory_order_relaxed, seemed surprising, unintuitive, and unnecessary. The notion of a release sequence represents a compromise between these two positions that does not create implementation difficulties, but disallows the most unintuitive behaviors.

Explicitly ordered atomics still often allow surprising results when multiple objects are involved. For example, assume x and y are atomic_ints which are initially zero, and r1 and r2 are unshared int variables:

Thread 1	Thread 2
`atomic_store_explicit(&x, 1, memory_order_release);` `r1 = atomic_load_explicit(&y, memory_order_acquire);`	`atomic_store_explicit(&y, 1, memory_order_release);` `r2 = atomic_load_explicit(&x, memory_order_acquire);`

It is perfectly possible for r1 and r2 to both be zero after execution. This reflects the fact that even in the absence of compiler optimization, this can easily happen if hardware stores are buffered and do not immediately become visible to other processor cores, as is almost always the case.

However, it was concluded that this kind of counterintuitive result should not be possible if only a single variable is involved, even if memory_order_relaxed is specified. For example, if an atomic counter x is periodically incremented with atomic_fetch_add_explicit(&x, 1, memory_order_relaxed) and repeatedly read in a different thread with atomic_load_explicit(&x, memory_order_relaxed), we still want to guarantee that the loads see non-decreasing values of x. Most hardware platforms provide such guarantees with respect to individual variables, even in the absence of memory fences. Thus the additional cost is usually limited to inhibited compiler optimizations. It was felt that programmers generally expect this kind of single-variable ordering (often referred to as cache coherence), and enforcing it unconditionally is well worth the small implementation cost.

As a result, modification orders were introduced to require that even memory_order_relaxed operations on a single location appear to be executed in a single global order, thus enforcing the preceding requirement.

Again, it is important to remember that all of these issues can be ignored by the programmer in the absence of explicit memory_order specifications. The real specification for those programs remains "sequential consistency for data-race-free programs".

The desire to provide this simple model for most programmers does impose some burdens on implementations. The more interesting ones are:

No adjacent field overwrites

The semantics, and particularly the definition of "memory location" severely limit the freedom of compilers supporting threads to implement a store to a small object by loading a larger piece of memory, modifying it, and then writing back the larger piece.

For example a store to the field b bit-field in struct {char a; int b:11; int c:5; char d; } may not load a and then write back its original value, since that might hide a concurrent update to a by another thread, even in a data-race-free program.

Implementations of older versions of the C standard often overwrote adjacent data, particularly in cases like the preceding one, even in the presence of threads. Unfortunately, such implementations made it unclear when programs really contained a data race. On modern hardware, violating these restrictions rarely results in substantial performance benefit, unless the rules become so relaxed that multi-threaded programs are nearly impossible to write.

These restrictions do not apply to single-threaded implementations. They can also be avoided by making the read and rewrite of for example, an adjacent field indivisible through an atomic-read-modify-write instruction. This is usually impractically expensive for multiprocessors, but it may be a useful option on uniprocessors, restartable atomic sequences as the implementation of atomic read-modify-write operations. (See Bershad, Redell, and Ellis, "Fast Mutual Exclusion for Uniprocessors", ASPLOS 1992.)

No other spurious stores

It was also quite common for compilers to introduce stores by temporarily keeping a potentially shared variable in a register for the duration of a loop, and unconditionally storing it back at the end of the loop, even if the loop executed for zero iterations. In the zero iteration case, the original value would be stored back. But concurrent updates of the variable by another thread could again be hidden. This can again happen even for data-race-free program. (See Boehm, "Threads Cannot be Implemented as a Library", PLDI05 for more details.)

Although this can be a profitable optimization, and it rarely causes problems in practice, we know of no reasonable methodology to ensure correctness of multi-threaded programs without restricting it. This optimization can invalidate programs in surprising ways, since it may apply after other transformations, such as function inlining.

Total store order

The implementation has to implement sequentially consistent (default) atomic updates so that they appear to execute as though the thread actions were interleaved.

Consider the following example (x and y atomic, initially zero):

Thread 1	Thread 2	Thread 3	Thread 4
`atomic_store(&x, 1);`	`atomic_store(&y, 1);`	`r1 = atomic_load(&x);` `r2 = atomic_load(&y);`	`r3 = atomic_load(&y);` `r4 = atomic_load(&x);`

Threads 3 and 4 cannot see the assignments to x and y in opposite order, i.e. r1 = r3 = 1 and r2 = r4 = 0 must be impossible. This is unpleasantly expensive to enforce on a few architectures. In particular, it is not immediately clear that inserting fence instructions between every pair of adjacent instructions is sufficient. (It can always be implemented if atomics are always implemented using locks. But that is clearly undesirable.)

Although much energy was invested in designing a simple and usable memory model that did not impose this requirement, these attempts were ultimately judged unsuccessful, both in that they were too complex, and also in that they either allowed unintuitive results on more practically important examples, or failed to avoid the implementation difficulties. Thus we continue to require sequential consistency for examples such as this. Some major architectural specifications have since been clarified to make this less expensive to implement.

The Java programming language already previously imposed similar requirements, though those were not fully appreciated. Thus this does not impose a new requirement on hardware.

Atomics <stdatomic.h>

C atomic types provide objects that can safely be concurrently accessed and updated by multiple threads. Operations appear to happen indivisibly. In the absence of explicit memory_order specifications, programs using such atomic variables continue to provide an interleaving-based memory semantics.

This functionality is often confused with, but significantly different from, that provided by the volatile type qualifier in C. (Other languages, notably Java, use volatile to mean something much closer to C's atomic. The committee had no control over this unfortunate conflict in terminology.) In particular C volatile variables can be structures of arbitrary size, and there is no promise that accesses will appear indivisible. Historically, implementations of sufficiently small volatile variables have varied in the extent to which they ensure ordering of memory accesses, and hence interleaving-based semantics. We are not aware of any that provide the guarantees we expect for atomic variables on modern multiprocessors.

Although there was initial sentiment in favor of strengthening volatile to include atomic thread synchronization semantics (see Boehm & Maclaren, "Should volatile acquire atomicity and thread visibility semantics"), this was found to be impractical, both due to performance implications on existing code and, even more importantly, due to the inherent conflict between existing large volatile structures that were never meant to be accessed indivisibly and the need to provide indivisible accesses for variables used for thread communication.

Previous implementations of threads with C have generally either not provided features analogous to the atomic types, or have provided them only incompletely. As a result, many systems attempted to provide their own, either to provide more efficient concurrent access to shared counters, flags, or the like, or sometimes to provide for safe access from signal handlers. Aside from the signal handler case, accesses to these shared objects could instead have been protected by a mutex. However the difference in cost between, for example, testing an atomic initialization flag and acquiring a mutex is often large enough to make the mutex option very unattractive.

C's (and C++'s) atomic types deviate from prior library designs primarily in that:

They are careful about implications on memory visibility and interactions with the memory model.
Unlike many, they provide atomic load operations in addition to store and read-modify-write operations. Prior packages often used ordinary or volatile accesses for read operations. This is generally unsafe, both in that it fails to guarantee indivisibility, and in the case of ordinary variables, in that compiler optimizations may legitimately assume that ordinary variables are not asynchronously modified. Violations of the latter assumption can result in surprising failures, beyond just reading an unexpected value.
They explicitly guarantee sequential consistency by default, with a mechanism to override that for performance when necessary. Programming with non-sequentially-consistent constructs is extremely error-prone. But current hardware does not always provide acceptable performance for sequentially consistent atomics.
Explicit memory ordering constraints are usually attached to atomic operations, instead of requiring explicit "memory fence" instructions.

Atomics support the following explicit memory_order_ specifications:

memory_order_seq_cst:

This is equivalent to the default. Atomic operations with this specification preserve sequential consistency for data-race-free programs.

memory_order_acquire (used for accesses), memory_order_release (used for updates), memory_order_acq_rel (for read-modify-write operations):

These provide the visibility guarantees required for simple thread-to-thread communication using a shared flag. If we have the following program segment, where flag is initially zero:

Thread 1	Thread 2
`init_foo();` `atomic_store_explicit(&flag, 1, memory_order_release);`	`r1 = atomic_load_explicit(&flag, memory_order_acquire);` `use_foo();`

it is guaranteed that use_foo() will see all memory_updates performed by init_foo(), or later ones, and that init_foo() will see no memory updates performed by use_foo().

Informally, a memory_order_release update of on an atomic object X, makes all prior memory operations available to any thread that later performs a memory_order_acquire access to x and observes the new value. Both memory_order_release and memory_order_acquire are equivalent to memory_order_relaxed, i.e. provide no useful memory ordering guarantees, without a paired operation on the same atomic object performed by another thread.

These are in most ways similar to "release consistency" (e.g. RCpc from Gharachorloo et al.'s "Memory Consistency and Event Ordering in Scalable Shared Memory Multiprocessors, or Adve & Gharachorloo's "Shared Memory Consistency Models: A Tutorial", or "Lazy Release Consistency" from the work by Keleher, Cox, and Zwaenepoel.) Unlike some of these descriptions, our semantics do not promise implicit fence-like properties: An atomic variable accessed by only one thread has no effect on memory visibility. This may allow all overhead associated with an atomic variable to be removed in some cases, e.g. after "inlining" all threads accessing it into a single thread. The detailed semantics were chosen to simplify the description, and to ensure implementability on current and anticipated future hardware; the differences between the release consistency variants mentioned here do not affect common programming idioms.

Memory_order_acquire and memory_order_release fail to guarantee sequential consistency for data-race-free programs in basically two distinct ways:

[Important] They fail to ensure that atomic stores followed by atomic loads become visible to other threads in that order. Traditional algorithms to implement critical sections in terms of loads and stores (e.g. Dekker's algorithm) fail with only memory_order_acquire/memory_order_release, as do many modern lock-free algorithms.
[Relatively obscure] They fail to ensure that independent writes by different threads are observed in a consistent order by all other threads.

Memory_order_acquire/memory_order_release are more difficult to use than memory_order_seq_cst, but they suffice for many common uses of atomic operations. On some, but not all, current (2010) hardware platforms they provide substantial performance benefits over memory_order_seq_cst.

memory_order_relaxed:

This supports a few distinct use cases. First, it allows atomic variables to be accessed with minimal overhead in cases in which no concurrent access is actually possible. Second, for expert designers of lock-free algorithms, it allows concurrent access with no memory ordering guarantees. The latter requires extremely careful and unintuitive reasoning, but is occasionally required for low-level parallel libraries, or when updating existing code containing explicit memory fences. Third, there appear to be a small number of relatively simple idioms in which memory_order_relaxed can be safe even in the presence of concurrent accesses, and the violation of sequential consistency remains very local. Most notably, simple counters that are concurrently incremented but not read, can be safely implemented this way. The result may perform substantially better than the sequentially consistent version, but probably not as well as a more elaborate approach maintaining multiple thread-local counters. (Note that reference counting does rely on the resulting value, and requires either sequentially consistent counter operations or much greater care.)

memory_order_consume:

This is a weaker replacement for memory_order_acquire that allows an expert programmer to take advantage of common hardware guarantees that implicitly ensure memory ordering for dependent memory operations. It typically allows data structures to be atomically updated by copying the structure and replacing a pointer, without incurring substantial reader overhead on any common hardware architecture. This may have substantial performance benefits for data structures that are mostly read and only rarely updated.

This set of ordering specifications was chosen to allow expert use of atomic variables with minimal unnecessary overhead, across all currently common hardware architectures. At the same time, memory_order_seq_cst supports much more straightforward use of such atomic variables, often at only modest cost. A programmer may choose to completely avoid memory ordeering issues and rely on sequential consistency for data-race-free programs simply by never specifying an explicit memory_order_ arrgument to any atomic operation.

We considered, but rejected:

Ordering constraints weaker than sequential consistency but stronger than acquire/release ordering. Though potentially useful for some less common use cases on some architectures, we found such models exceedingly difficult to specify precisely and understandably, which may explain the lack of precedents in this area.
Ordering constraints specific to only loads or only stores. Relatively few hardware platforms usefully support fences that enforce such constraints. Code using them seems to be quite brittle and error prone. For example, it is tempting to use such a facility with the initialization flag example above, but that would allow init_foo()'s updates to potentially become visible after use_foo()'s.

Although explicit memory fences are provided, we do not rely on them. This enables our default "sequential consistency for data-race-free programs" approach. Many people also feel that, even with weaker explicit ordering attached ordering constraints are easier to use, at least for newly written code. Perhaps even more importantly in the long run, fences tend to needlessly overconstraining ordering in many cases. For example, consider the program segment:

z = ...;
atomic_store_explicit(&z_done, 1, memory_order_release);
atomic_store_explicit(&y, ..., memory_order_relaxed);

This prevents reordering of stores to z and z_done, but does not prevent reordering of the stores to z and y, and hence allows the store to y to proceed without waiting for the store to z to complete. This is commonly the intent for flags used for thread communication, but it is not expressible with fences. Although this does not affect most current architectures, it is important for at least one, and is likely to enable future hardware optimizations.

Along the same lines, ordering constraints associated with an operation on atomic variable a can be removed if the compiler can determine that a is accessible from only one thread, e.g. due to thread inlining. Ordering constraints associated with fences are almost impossible to remove.

There are however a few cases in which explicit memory fences can more accurately express intent and result in faster code on current hardware. For example, they allow a programmer to better express that when multiple spin-locks are released one after the other, the atomic stores to the spin lock variables may be reordered, but operations in the critical section should not be moved outside. They also potentially facilitate porting of legacy code written with platform-dependent fences. Thus atomic_thread_fence() is provided to handle such cases. The atomic_signal_fence() provides similar functionality for the case in which the only concurrency is with a signal handler, and hence hardware fence instructions are typically not required. Thus atomic_signal_fence() is intended to replace the "compiler-only" fences found in some other systems.

Although atomic types are useful primarily when the underlying hardware allows their implementation without the use of locks, they may (with the exception of atomic_flag) be implemented by hashing locations to one of a collection of locks, and acquiring the corresponding lock before operating on a location. The atomic_is_lock_free() function allows client code to query if such an implementation is in use, and thus the corresponding operations are slower, and unusable from asynchronous signal handlers. Note that this property is often not known until the program is actually loaded, since atomic operation support commonly varies between different "minor" variants of the same architecture.

Implementations will have to resort to a lock-based implementation if any of the atomic operations corresponding to that type are not supported by the hardware. (But note that compare_exchange on smaller objects can usually be safely emulated by cmpxchg on larger objects, and cmpxchg always suffices for a lock-free emulation of the others.) Hence the set of provided operations is conservatively restricted to those available across nearly all modern platforms.