Linux-Kernel Memory Model

ISO/IEC JTC1 SC22 WG21 N4374 - 2015-02-06

Paul E. McKenney, paulmck@linux.vnet.ibm.com

History

This is a revision of N4322, with updates based on subsequent discussions. This revision adds references to litmus tests in userspace RCU, a paragraph stating goals, and a section discussing the relationship between volatile atomics and loop unrolling.

Introduction

The Linux-kernel memory model is currently defined very informally in the memory-barriers.txt and atomic_ops.txt files in the source tree. Although these two files appear to have been reasonably effective at helping kernel hackers understand what is and is not permitted, they are not necessarily sufficient for deriving the corresponding formal model. This document is a first attempt to bridge this gap.

The hope is that this document will help the C and C++ standard committees understand the existing practice and the constraints from the Linux kernel, and also that it will help the Linux community evaluate which portions of the C11 and C++11 memory models might be useful in the Linux kernel.

  1. Variable Access
  2. Memory Barriers
  3. Locking Operations
  4. Atomic Operations
  5. Control Dependencies
  6. RCU Grace-Period Relationships
  7. Summary

Variable Access

Loads from and stores to normal variables should be protected with the ACCESS_ONCE() macro, for example:

r1 = ACCESS_ONCE(x);
ACCESS_ONCE(y) = 1;

A ACCESS_ONCE() access may be modeled as a volatile memory_order_relaxed access. However, please note that ACCESS_ONCE() is defined only for properly aligned machine-word-sized variables. Applying ACCESS_ONCE() to a large array or structure is unlikely to do anything useful.

Note that the volatile is absolutely required: Non-volatile memory_order_relaxed is not sufficient. To see this, consider that ACCESS_ONCE() is used to prevent concurrently modified accesses from being hoisted out of a loop or out of unrolled instances of a loop. For example, given this loop:

	while (tmp = atomic_load_explicit(a, memory_order_relaxed))
		do_something_with(tmp);

The compiler would be permitted to unroll it as follows:

	while (tmp = atomic_load_explicit(a, memory_order_relaxed))
		do_something_with(tmp);
		do_something_with(tmp);
		do_something_with(tmp);
		do_something_with(tmp);
	}

This would be unacceptable for real-time applications, which need the value to be reloaded from a on each iteration, unrolled or not. The volatile qualifier prevents this transformation.

Note that the Linux kernel will likely be replacing ACCESS_ONCE() with a pair of APIs, one for loading and the other for storing. This will have the benefit of mapping more directly onto C and C++ primitives.

At one time, gcc guaranteed that properly aligned accesses to machine-word-sized variables would be atomic. Although gcc no longer documents this guarantee, there is still code in the Linux kernel that relies on it. These accesses could be modeled as non-volatile memory_order_relaxed accesses.

An smp_store_release() may be modeled as a volatile memory_order_release store. Similarly, an smp_load_acquire() may be modeled as a memory_order_acquire load.

r1 = smp_load_acquire(x);
smp_store_release(y, 1);

Members of the rcu_dereference() family can be modeled as memory_order_consume loads. Members of this family include: rcu_dereference(), rcu_dereference_bh(), rcu_dereference_sched(), and srcu_dereference(). However, rcu_dereference() should be representative for litmus-test purposes, at least initially. Similarly, rcu_assign_pointer() can be modeled as a memory_order_release store.

The set_mb() function assigns the specified value to the specified variable, then executes a full memory barrier, which is described in the next section. This isn't as strong as a memory_order_seq_cst store because the following code fragment does not guarantee that the stores to x and y will be ordered.

smp_store_release(x, 1);
set_mb(y, 1);

That said, set_mb() provides exactly the ordering required for manipulating task state, which is the job for which it was created.

Memory Barriers

The Linux kernel has a variety of memory barriers:

  1. barrier(), which can be modeled as an atomic_signal_fence(memory_order_acq_rel) or an atomic_signal_fence(memory_order_seq_cst).
  2. smp_mb(), which does not have a direct C11 or C++11 counterpart. On an ARM, PowerPC, or x86 system, it can be modeled as a full memory-barrier instruction (dmb, sync, and mfence, respectively). On an Itanium system, it can be modeled as an mf instruction, but this relies on gcc emitting an ld,acq for an ACCESS_ONCE() load and an st,rel for an ACCESS_ONCE() store.
  3. smp_rmb(), which can be modeled (overly conservatively) as an atomic_thread_fence(memory_order_acq_rel). One difference is that smp_rmb() need not order prior loads against later stores, or prior stores against later stores. Another difference is that smp_rmb() need not provide any sort of transitivity, having (lack of) transitivity properties similar to ARM's or PowerPC's address/control/data dependencies.
  4. smp_wmb(), which can be modeled (again overly conservatively) as an atomic_thread_fence(memory_order_acq_rel). One difference is that smp_rmb() need not order prior loads against later stores, nor prior loads against later loads. Similar to smp_rmb(), smp_wmb() need not provide any sort of transitivity.
  5. smp_read_barrier_depends(), which is a no-op on all architectures other than Alpha. On Alpha, smp_read_barrier_depends() may be modeled as a atomic_thread_fence(memory_order_acq_rel) or as a atomic_thread_fence(memory_order_seq_cst).
  6. smp_mb__before_atomic(), which provides a full memory barrier before the immediately following non-value-returning atomic operation.
  7. smp_mb__after_atomic(), which provides a full memory barrier after the immediately preceding non-value-returning atomic operation. Both smp_mb__before_atomic() and smp_mb__after_atomic() are described in more detail in the later section on atomic operations.
  8. smp_mb__after_unlock_lock(), which provides a full memory barrier after the immediately preceding lock operation, but only when paired with a preceding unlock operation by this same thread or a preceding unlock operation on the same lock variable. The use of smp_mb__after_unlock_lock() is described in more detail in the second on locking.

There are some additional memory barriers including mmiowb(), however, these cover interactions with memory-mapped I/O, so have no counterpart in C11 and C++11 (which is most likely as it should be for the foreseeable future).

Some use cases for these memory barriers may be found here. These are for the userspace RCU library, so drop the leading cmm_ to get the corresponding Linux-kernel primitive. For example, the userspace cmm_smp_mb() primitive translates to the Linux-kernel smp_mb() primitive.

Locking Operations

The Linux kernel features “roach motel” ordering on its locking primitives: Prior operations can be reordered to follow a later acquire, and subsequent operations can be reordered to precede an earlier release. The CPU is permitted to reorder acquire and release operations in this way, but the compiler is not, as compiler-based reordering could result in deadlock.

Note that a release-acquire pair does not necessarily result in a full barrier. To see this consider the following litmus test, with x and y both initially zero, and locks l1 and l3 both initially held by the threads releasing them:

Thread 1                      Thread 2
--------                      --------
y = 1;                        x = 1;
spin_unlock(&l1);             spin_unlock(&l3);
spin_lock(&l2);               spin_lock(&l4);
r1 = x;                       r2 = y;

assert(r1 != 0 || r2 != 0);

In the above litmus test, the assertion can trigger, meaning that an unlock followed by a lock is not guaranteed to be a full memory barrier. And this is where smp_mb__after_unlock_lock() comes in:

Thread 1                      Thread 2
--------                      --------
y = 1;                        x = 1;
spin_unlock(&l1);             spin_unlock(&l3);
spin_lock(&l2);               spin_lock(&l4);
smp_mb__after_unlock_lock();  smp_mb__after_unlock_lock();
r1 = x;                       r2 = y;

assert(r1 != 0 || r2 != 0);

In contrast, after addition of smp_mb__after_unlock_lock(), the assertion cannot trigger.

The above example showed how smp_mb__after_unlock_lock() can cause an unlock-lock sequence in the same thread to act as a full barrier, but it also applies in cases where one thread unlocks and another thread locks the same lock, as shown below:

Thread 1              Thread 2                        Thread 3
--------              --------                        --------
y = 1;                spin_lock(&l1);                 x = 1;
spin_unlock(&l1);     smp_mb__after_unlock_lock();    smp_mb();
                      r1 = y;                         r3 = y;
                      r2 = x;

assert(r1 == 0 || r2 != 0 || r3 != 0);

Without the smp_mb__after_unlock_lock(), the above assertion can trigger, and with it, it cannot. The fact that it can trigger without might seem strange at first glance, but locks are only guaranteed to give sequentially consistent ordering to their critical sections. If you want an observer thread to see the ordering without holding the lock, you need smp_mb__after_unlock_lock(). (Note that there is some possibility that the Linux kernel's memory model will change such that an unlock followed by a lock forms a full memory barrier even without the smp_mb__after_unlock_lock().)

The Linux kernel has an embarrassingly large number of locking primitives, but spin_lock() and spin_unlock() should be representative for litmus-test purposes, at least initially.

Atomic Operations

Atomic operations have three sets of operations, those that are defined on atomic_t, those that are defined on atomic_long_t, and those that are defined on aligned machine-sized variables, currently restricted to int and long. However, in the near term, it should be acceptable to focus on a small subset of these operations.

Variables of type atomic_t may be stored to using atomic_set() and variables of type atomic_long_t may be stored to using atomic_long_set(). Similarly, variables of these types may be loaded from using atomic_read() and atomic_long_read(). The historical definition of these primitives has lacked any sort of concurrency-safe semantics, so the user is responsible for ensuring that these primitives are not used concurrently in a conflicting manner.

That said, many architectures treat atomic_read() atomic_long_read() as volatile memory_order_relaxed loads and a few architectures treat atomic_set() and atomic_long_set() as memory_order_relaxed stores. There is therefore some chance that concurrent conflicting accesses will be allowed at some point in the future, at which point their semantics will be those of volatile memory_order_relaxed accesses.

The remaining atomic operations are divided into those that return a value and those that do not. The atomic operations that do not return a value are similar to C11 atomic memory_order_relaxed operations. However, the Linux-kernel atomic operations that do return a value cannot be implemented in terms of the C11 atomic operations. These operations can instead be modeled as memory_order_relaxed operations that are both preceded and followed by the Linux-kernel smp_mb() full memory barrier, which is implemented using the DMB instruction on ARM and the sync instruction on PowerPC. Note that in the case of the CAS operations atomic_cmpxchg(), atomic_long_cmpxchg, and cmpxchg(), the full barriers are required in both the success and failure cases. Strong memory ordering can be added to the non-value-returning atomic operations using smp_mb__before_atomic() before and/or smp_mb__after_atomic() after.

The operations are summarized in the following table. An initial implementation of a tool could start with atomic_add(), atomic_sub(), atomic_xchg(), and atomic_cmpxchg().

Operation Class int long
Add/Subtract void atomic_add(int i, atomic_t *v)
void atomic_sub(int i, atomic_t *v)
void atomic_inc(atomic_t *v)
void atomic_dec(atomic_t *v)
void atomic_long_add(int i, atomic_long_t *v)
void atomic_long_sub(int i, atomic_long_t *v)
void atomic_long_inc(atomic_long_t *v)
void atomic_long_dec(atomic_long_t *v)
Add/Subtract,
Value Returning
int atomic_inc_return(atomic_t *v)
int atomic_dec_return(atomic_t *v)
int atomic_add_return(int i, atomic_t *v)
int atomic_sub_return(int i, atomic_t *v)
int atomic_inc_and_test(atomic_t *v)
int atomic_dec_and_test(atomic_t *v)
int atomic_sub_and_test(int i, atomic_t *v)
int atomic_add_negative(int i, atomic_t *v)
int atomic_long_inc_return(atomic_long_t *v)
int atomic_long_dec_return(atomic_long_t *v)
int atomic_long_add_return(int i, atomic_long_t *v)
int atomic_long_sub_return(int i, atomic_long_t *v)
int atomic_long_inc_and_test(atomic_long_t *v)
int atomic_long_dec_and_test(atomic_long_t *v)
int atomic_long_sub_and_test(int i, atomic_long_t *v)
int atomic_long_add_negative(int i, atomic_long_t *v)
Exchange int atomic_xchg(atomic_t *v, int new)
int atomic_cmpxchg(atomic_t *v, int old, int new)
int atomic_long_xchg(atomic_long_t *v, int new)
int atomic_long_cmpxchg(atomic_code_t *v, int old, int new)
Conditional
Add/Subtract
int atomic_add_unless(atomic_t *v, int a, int u)
int atomic_inc_not_zero(atomic_t *v)
int atomic_long_add_unless(atomic_long_t *v, int a, int u)
int atomic_long_inc_not_zero(atomic_long_t *v)
Bit Test/Set/Clear
(Generic)
void set_bit(unsigned long nr, volatile unsigned long *addr)
void clear_bit(unsigned long nr, volatile unsigned long *addr)
void change_bit(unsigned long nr, volatile unsigned long *addr)
Bit Test/Set/Clear,
Value Returning
(Generic)
int test_and_set_bit(unsigned long nr, volatile unsigned long *addr)
int _atomic_dec_and_lock(atomic_t *atomic, spinlock_t *lock)
int test_and_clear_bit(unsigned long nr, volatile unsigned long *addr)
int test_and_change_bit(unsigned long nr, volatile unsigned long *addr)
Lock-Barrier Operations
(Generic)
int test_and_set_bit_lock(unsigned long nr, unsigned long *addr)
void clear_bit_unlock(unsigned long nr, unsigned long *addr)
void __clear_bit_unlock(unsigned long nr, unsigned long *addr)
Exchange
(Generic)
T *xchg(T *p, v)
T *cmpxchg(T *ptr, T o, T n)

The rows marked “(Generic)” are type-generic, applying to any aligned machine-word-sized quantity supported by all architectures that the Linux kernel runs on. The set of types is currently those of size int and those of size long. The “Lock-Barrier Operations” have memory_order_acquire semantics for test_and_set_bit_lock() and _atomic_dec_and_lock(), and have memory_order_release for the other primitives. Otherwise, the usual Linux-kernel rule holds: If no value is returned, memory_order_relaxed semantics apply, otherwise the operations behave as if there was smp_mb() before and after.

Control Dependencies

The Linux kernel provides a limited notion of control dependencies, ordering prior loads against control-depedendent stores in some cases. Extreme care is required to avoid control-dependency-destroying compiler optimizations. The restrictions applying to control dependencies include the following:

  1. Control dependencies can order prior loads against later dependent stores, however, they do not order prior loads against later dependent loads. (Use memory_order_consume or memory_order_acquire if you require this behavior.
  2. A load heading up a control dependency must use ACCESS_ONCE(). Similarly, the store at the other end of a control dependency must also use ACCESS_ONCE().
  3. If both legs of a given if or switch statement store the same value to the same variable, then those stores cannot participate in control-dependency ordering.
  4. Control dependencies require at least one run-time conditional that depends on the prior load and that precedes the following store.
  5. The compiler must perceive both the variable loaded from and the variable stored to as being shared variables. For example, the compiler will not perceive an on-stack variable as being shared unless its address has been taken and exported to some other thread (or alias analysis has otherwise been defeated).
  6. Control dependencies are not transitive. In this regard, their behavior is similar to ARM or PowerPC control dependencies.

The C and C++ standards do not guarantee any sort of control dependency. Therefore, this list of restriction is subject to change as compilers become increasingly clever and aggressive.

RCU Grace-Period Relationships

The publish-subscribe portions of RCU are captured by the combination of rcu_assign_pointer(), which can be modeled as a memory_order_release store, and of the rcu_dereference() family of primitives, which can be modeled as memory_order_consume loads, as was noted earlier.

Grace periods can be modeled as described in Appendix D of User-Level Implementations of Read-Copy Update. There are a number of grace-period primitives in the Linux kernel, but rcu_read_lock(), rcu_read_unlock(), and synchronize_rcu() are good places to start. The grace-period relationships can be describe using the following abstract litmus test:

Thread 1                      Thread 2
--------                      --------
rcu_read_lock();              S2a;
S1a;                          synchronize_rcu();
S1b;                          S2b;
rcu_read_unlock();

If either of S1a or S1b precedes S2a, then both must precede S2b. Conversely, if either of S1a or S1b follows S2b, then both must follow S2a. Additional litmus tests may be found here. Again, these are for the userspace RCU library, so drop the leading cmm_ to get the corresponding Linux-kernel primitives.

Summary

This document makes a first attempt to present a formalizable model of the Linux kernel memory model, including variable access, memory barriers, locking operations, atomic operations, control dependencies, and RCU grace-period relationships. The general approach is to reduce the kernel's memory model to some aspect of memory models that have already been formalized, in particular to those of C11, C++11, ARM, and PowerPC.