Dependency Ordering for C Memory Model

ISO/IEC JTC1 SC22 WG14 N1444 - 2010-03-22

Paul E. McKenney, paulmck@linux.vnet.ibm.com
Clark Nelson, clark.nelson@intel.com
Hans-J. Boehm, Hans.Boehm@hp.com, boehm@acm.org
Lawrence Crowl, crowl@google.com, Lawrence@Crowl.org


Introduction
    Problem
    Prior Work
    Alternatives Considered
Solution
    Dependency Root
    Dependency Tree
    Informal Specification
Examples
    Indirection
    Code Removal
    Control-Sensitive Indirection
    Constant Propagation
    Control Elimination
    Control Dependence
    Conditional Subexpression Elimination
    Constant Results
    Selective Dependency Ordering
Implementation
    Promote to Acquire
    Avoid Some Optimizations
    Track Optimizations
    Truncate Data-Dependency Trees
    Annotate Functions
Proposed Wording
    5.1.2.4 Multi-threaded executions and data races
    7.16.1 Introduction
    7.16.2 Order and Consistency
        7.16.2.1 The kill_dependency macro
        7.16.3.1 The atomic_thread_fence function
        7.16.6.1 The atomic_store generic functions

Introduction

The efficiency of data structures that are read frequently and written rarely can substantially affect the scalability of some applications. Based on experience in making the Linux operating system scalable, and on more recent user-application work, we propose addenda to the C memory model and atomics library for inter-thread data-dependency ordering. This proposal is based on a similar proposal for C++ (WG21).

This proposal admits a trivial implementation, limiting significant implementation investment to those compilers and platforms where that investment will be recovered.

Problem

There are two significant use cases where the current working draft (WG21 N2461) does not support scalability near that possible on some existing hardware.

read access to rarely written concurrent data structures
Rarely written concurrent data structures are quite common, both in operating-system kernels and in server-style applications. Examples include data structures representing outside state (such as routing tables), software configuration (modules currently loaded), hardware configuration (storage device currently in use), and security policies (access control permissions, firewall rules). Read-to-write ratios well in excess of a billion to one are quite common.
publish-subscribe semantics for pointer-mediated publication
Much communication between threads is pointer-mediated, in which the producer publishes a pointer through which the consumer can access information. Access to that data is possible without full acquire semantics.

In such cases, use of inter-thread data-dependency ordering has resulted in order-of-magnitude speedups and similar improvements in scalability on machines that support inter-thread data-dependency ordering, which includes ARM, PowerPC, and embedded MIPS. Such speedups are possible because such machines can avoid the expensive lock acquisitions, atomic instructions, or memory fences that are otherwise required.

On other hardware architectures, including x86, SPARC TSO, and the IBM mainframe, use of inter-thread dependency ordering can permit the compiler to undertake code-movement and value-speculation optimizations that would prohibited by use of alternatives such as load-acquire.

A simplified example use of inter-thread data-dependency ordering, found within the Linux kernel, looks something like the following:


struct foo {
    int a;
    struct foo *next;
};
struct foo *head;

int b;  /* equivalent to atomic_int with relaxed access on Linux platforms. */

void insert(int a) {
   struct foo *p = kmalloc(sizeof(*p), GFP_KERNEL); /* cannot fail */
   spin_lock(&mylock);
   p->a = 1;
   p->next = head;
   rcu_assign_pointer(head, p); /* Can be thought of as a store-release. */
   spin_unlock(&mylock);
}

int getfirstval(void) { /* requires head non-NULL */
   struct foo *q = rcu_dereference(head);  /* see discussion below. */
   int retval = q->a + b;
   return retval;
}

The effect of getfirstval is to return the value at the head of the list with little more (or even no more) overhead than would be required if the list were immutable, but while still allowing updates. The rcu_dereference() API used in getfirstval() can be fully implemented in different ways, optimized for different classes of machines:

strong memory ordering (e.g., TSO)
rcu_dereference() simply prevents the compiler from performing optimizations that would order operations with data dependencies on q before the load from head. In this case, the code relies on the strong ordering to prevent the assignment to retval from seeing the pre-initialized version of the a field because the store to a must precede the store to head->next.
weak memory ordering with enforced data-dependency ordering
rcu_dereference() again prevents the compiler from performing optimizations that would order operations with data dependencies on q before the load from head. However, in this case, the code relies on the the machine's enforcement of data-dependency ordering to prevent the assignment to retval from seeing the pre-initialized version of the a field, because q->a depends on q.
weak memory ordering without data-dependency ordering
rcu_dereference() is promoted to a load-acquire operation. Because the acquire prevents all subsequent memory references from being reordered with the load from head, it must prevent any subsequent operations depending on h from being reordered with the load from head.

The current working document (WG14 N1401) would require that these machines implement rcu_dereference() using either an acquire fence or a load-acquire. In both cases, this prohibits useful classes of compiler optimizations that involve code motion that does not break dependencies on the load from head. In the above example, the compiler would be needlessly prohibited from reordering the fetch of b to precede the rcu_dereference(). Worse yet, this requires emitting expensive memory fences for the second class of machines, which can result in unacceptable performance degradation.

Current practice in the Linux kernel uses a combination of gcc extensions and volatile casts for rcu_dereference() as shown below:


#define ACCESS_ONCE(x) (*(volatile typeof(x) *)&(x))

#define rcu_dereference(p) \
        ({ \
         typeof(p) _________p1 = ACCESS_ONCE(p); \
         smp_read_barrier_depends(); \
         (_________p1); \
        })

The “({})” notation allows C-language statements to be used in contexts expecting an expression. ACCESS_ONCE() undertakes the volatile cast, using typeof() to preserve type checking. rcu_dereference() uses ACCESS_ONCE() for the volatile cast, which prevents merging or redoing loads. The smp_read_barrier_depends() function is a no-op on all CPUs other than DEC Alpha, where it supplies the required memory fence. The final “(_________p1);” returns the value of the pointer. Optimizations that might break dependency chains must be suppressed. A user-level implementation of RCU is now available under the LGPL licence.

Although this approach has worked for some decades, it would be preferable to have formal support for this usage pattern.

More elaborate examples are described in a presentation at the Oxford 2007 WG21 meeting, describing use cases from the Linux kernel. These uses cases begin on slide 37 and include traversal of multiple levels of pointers, indexing arrays, and casts.

Prior Work

WG21 N2171 and WG21 N2176 are the basis for the current memory model. These proposals support a wide range of memory-ordering use cases, but do not support dependency ordering.

WG21 N2153 by Silvera, Wong, McKenney, and Blainey was the first proposal to explicitly address weakly ordered architectures and the issues surrounding dependency ordering. It was succeded by WG21 N2237. These papers also presented a number of use cases motivating non-SC memory ordering, including dependency ordering.

WG21 N2195 by Peter Dimov proposes an atomic_load_address() template function that protects a single level of indirection. Although this suffices for the very simple example above, it does not handle other examples given in a presentation at the WG21 Oxford 2007 meeting describing use cases from the Linux kernel (beginning on slide 37). In particular, WG21 N2195, does not support data dependencies that traverse multiple levels of indirection nor that traverse array accesses.

An alternative proposal in WG21 N2195, introduces the notion of dynamic dependencies. Use of dynamic dependencies would permit the data-dependency trees to be scanned after performing those optimizations that do not break dynamic data-dependency trees. However, this proposal was rejected due to software-engineering concerns, which loom especially large in cases where the compiler is able to perform optimizations that the programmer cannot anticipate. For example, the programmer might be forgiven for assuming that an argument to a given function was variable, but a compiler doing inter-procedural analysis might discover that it was in fact constant, or, worse yet, zero. The compiler is therefore required to propagate dependency trees regardless of optimization.

WG21 N2492, by Paul E. McKenney, Hans-J. Boehm, and Lawrence Crowl, and WG21 N2493, and by Paul E. McKenney and Lawrence Crowl, present an approach combining elaborations to the memory model, atomics API, and annotations. Finally, WG21 N2556 by Paul E. McKenney, Hans-J. Boehm, and Lawrence Crowl incorporated feedback from the Bellevue meeting. This document further includes feedback from the Core Working Group at the Sophia Antipolis meeting.

Alternatives Considered

Although control dependencies are extremely intuitive, there are comparatively few known control-dependency use cases, and ARM CPUs only partially support control dependencies. Furthermore, some of the more troublesome optimization issues with switch statements involve control rather than data dependencies. Therefore, there is no support for control dependencies. If experience indicates that control dependencies also need to be enforced, a separate proposal will be put forward for them.

Prohibiting dependency-breaking optimizations would remove the need for annotations. This faced severe resistance, as a number of people felt that this would prohibit valuable optimizations. Therefore, this proposal requires annotations for function arguments and return values through which data dependencies are required to flow. As inter-compilation-unit analysis becomes more common, it is hoped that tools will appear that check annotations or perhaps even produce them automatically. However, individual implementations are free to avoid the dependency issue entirely by simply refraining from breaking data dependencies, or by emitting compensating memory fences when breaking data dependencies. (Full disclosure: this was in fact the original proposal.)

Simply relying on acquire fences would remove the need for dependency ordering. Although this is a reasonable strategy for many machines, it is inappropriate for weakly ordered machines that support data-dependency ordering.

Solution

We propose explicit program support for inter-thread data-dependency ordering. Programmers will explicitly mark the root of tree of data-dependent operations, and implementations will respect that ordering.

Dependency Root

To mark the root of a inter-thread data-dependency tree, programmers will use new variant of the atomic load defined in WG21 N2427. Specifically, this proposal augments WG21 N2427 by adding a memory_order_consume option that may be supplied to operations for which data-dependency semantics are permitted. The memory_order enumeration in WG21 N2427 would then read as follows, keeping the enumeration in rough order of increasing memory-ordering strength:


typedef enum memory_order {
    memory_order_relaxed, memory_order_consume, memory_order_acquire,
    memory_order_release, memory_order_acq_rel, memory_order_seq_cst
} memory_order;

Dependency Tree

Given the load value as the root of a data-dependency tree, the tree is loosely defined as any operation or function (within the same thread) that has a data-dependent argument within the tree or reads a variable stored by a data-dependent assignment within the tree.

Note that it is possible for a given value to be part of multiple dependency trees. One way that this might happen would be to add a value in one dependency tree to another value in a different dependency tree. The sum would then be in both dependency trees.

The compiler must preserve the dependency tree through all optimizations. In particular, if the compiler is able to optimize a member of a dependency tree to a constant, then the compiler must either produce code that preserves the dependency tree or emit a memory fence appropriate to the target architecture.

A data-dependency tree ends by the death of values within the tree. Since trees extend into called functions and out through return values, these trees may extend until the end of program execution. The section on implementation describes strategies for dealing this unbounded extent in the normal compilation process.

When normal compilation of an unbounded extent proves too inefficient, the programmer may explicitly prune a data-dependency tree by passing a value through the identity function std::kill_dependency. The result is, by definition, not inter-thread data-dependent on the argument, even though the values are identical.

Informal Specification

Informally, we define inter-thread data-dependency ordering in terms of

a 'consume' operation
that is a weaker form of the 'acquire' operation,
a 'carries dependency to' relationship
which is a strict subset of the 'sequenced before' relationship, a subset describing how data dependencies propagate,
a 'dependency-ordered before' relationship
that captures the operations depending on the 'consume' operation, and
a 'inter-thread happens before' relationship
that is a term of the general 'happens before' relationship.

The full details are within the formal wording.

Examples

In WG21 N2176, Hans Boehm lists a number of example optimizations that can break dependency trees, which are discussed in the following subsections.

Indirection

WG21 N2176 example code:


r1 = x.load(memory_order_relaxed);
r2 = *r1;

Recoding to this proposal's API:


r1 = x.load(memory_order_consume);
r2 = *r1;

Assuming that x is an atomic, the x.load(memory_order_consume) will form the root of a dependency tree. The dependency tree extends to the indirection through r1, so the dependency is ordered.

Code Removal

WG21 N2176 example code:


r1 = x.load(memory_order_relaxed);
r3 = &a + r1 - r1;
r2 = *r3;

This could legitimately be optimized to the following, breaking the dependency tree:


r1 = x.load(memory_order_relaxed);
r3 = &a;
r2 = *r3;

However, recoding to this proposal's API:


r1 = x.load(memory_order_consume);
r3 = &a + r1 - r1;
r2 = *r3;

Again assuming that x is an atomic, the x.load(memory_order_consume) will form the root of a dependency tree. The dependency tree extends to the indirection through r1, so the dependency is ordered. Because the dependency trees must be traced prior to optimization, if the optimization is performed, a countervailing memory fence or artificial data dependency must be inserted.

Control-Sensitive Indirection

WG21 N2176 example code, recoding to this proposal's API:


r1 = x.load(memory_order_consume);
if (r1 == 0)
        r2 = *r1;
else
	r2 = *(r1 + 1);

Assuming that x is an atomic, the x.load(memory_order_consume) will form the root of a dependency tree. The dependency tree extends to the indirection through r1, so the dependency is ordered.

Constant Propagation

WG21 N2176 example code, as modified during email discussions, where x is known to be either 0 or 1:


if (x.load(memory_order_consume))
	...
else
	...
y = 42 * x / 13;

This might be optimized to the following:


if (x.load(memory_order_consume)) {
	...
	y = 3;
} else {
	...
	y = 0;
}

assuming that x is an atomic, the x.load(memory_order_consume) will form the root of a dependency tree. The dependency tree extends to the assignment to y, so the dependency is ordered. If the underlying machine preserves control-dependency ordering for writes, this optimization is perfectly legal. If the underlying machine does not preserve control-dependency ordering, then either this optimization must be avoided, a memory fence must be emitted after the load of x, or an artificial data dependency must be manufactured. An example artificial data dependency might be as follows:


if (r1 = x.load(memory_order_consume)) {
	...
	y = 3;
} else {
	...
	y = 0;
}
y = y + r1 - r1;

The compiler would need to decide whether the add and subtract was better than the multiply and divide.

Control Elimination

WG21 N2176 example code:


r1 = x.load(memory_order_consume);
if (r1)
        r2 = y.a;
else
	r2 = y.a;

This might be optimized to the following in order to break dependency trees:


r1 = x.load(memory_order_relaxed);
r2 = y.a;

This is a control dependency, so falls outside the scope of this proposal.

Control Dependence

WG21 N2176 example code:


r1 = x.load(memory_order_consume);
if (r1)
	f(&y);
else
	g(&y);

Assuming that x is an atomic, the x.load(memory_order_consume) will form the root of a dependency tree. However, there is no data dependency between the load and either of the function calls. There is instead a control dependency, which does not force ordering in this proposal.

If this example were to be modified so that the variable r1 were passed to f() and g() (rather than y as shown above), then the functions would have a data dependency on the load.

Conditional Subexpression Elimination

WG21 N2176 example code:


r2 = x.load(memory_order_consume);
r3 = r2->a;

There might be at temptation to optimize the code as follows:


r2 = x.load(memory_order_consume);
r3 = r1->a;
if (r1 != r2) r3 = r2->a;

However, assuming that x is an atomic, the x.load(memory_order_consume) will form the root of a dependency tree. The dependency tree extends to the indirection through r2, so the dependency is ordered and the optimization prohibited, at least in absence of a compensating fence or artificially generated data dependency.

Constant Results

WG21 N2176 example code:


r1 = x.load(memory_order_consume);
r2 = a[r1->index % a_size];

If the variable a_size is known to the compiler to have the value one, then there might be a temptation to optimize as follows:


r1 = x.load(memory_order_consume);
r2 = a[0];

However, again assuming that x is an atomic, the x.load(memory_order_consume) will form the root of a dependency tree. The dependency tree extends to the indirection through r1, so the dependency is ordered. Therefore, this optimization is prohibited unless accompanied by a compensating memory fence or artificial data dependency.

Selective Dependency Ordering

In some cases, dependency ordering is important only for some fields of a structure. For example:


r1 = x.load(memory_order_consume);
r2 = r1->index;
do_something_with(a[r2]);

Indexing a[] with an uninitialized field could be fatal, but once the corresponding array element has been fetched, we might not care about subsequent dependencies. The std::kill_dependency primitive enables the programmer to tell the compiler that specific dependencies may be broken, for example, as follows:


r1 = x.load(memory_order_consume);
r2 = r1->index;
do_something_with(a[std::kill_dependency(r2)]);

This allows the compiler to reorder the call to do_something_with, for example, by performing speculative optimizations that predict the value of a[r2].

Implementation

There are several implementation strategies. The first strategy is acceptable on all machines and compilers. The subsequent strategies are appropriate to subsets thereof.

This proposal is expected to have minimal effect on strongly ordered machines (e.g., x86) and on weakly ordered machines that do not support data-dependency ordering (e.g., Alpha). The major burden of this proposal would fall on weakly ordered machines and their compilers that reorder data-dependent operations, such as MIPS, ARM, and PowerPC. Even for these architectures, a fully conforming compiler could use the same approach as weakly ordered machines that do not support data-dependency ordering, albeit at a performance penalty.

Promote to Acquire

Simply promoting all memory_order_consume operations to memory_order_acquire will meet the requirements of this proposal.

For weakly ordered machines without data-dependency ordering, this implementation is also necessary. For other machines, it also serves as trivial first implementation.

Avoid Some Optimizations

Compilers can implement memory_order_consume loads as regular loads, so long as the compiler attempts no optimizations that break data dependencies. This strategy will be particularly useful for non-optimizing compilers.

This strategy does not apply to weakly ordered machines without data-dependency ordering, but only to strongly ordered machines or weakly ordered machines with data-dependency ordering.

Track Optimizations

For implementations on strongly ordered machines or weakly ordered machines with data-dependency ordering, compilers can implement memory_order_consume loads as regular loads, so long as the compiler tracks operations within a data-dependency tree and avoids optimizations that break data dependencies of those operations. Note, however, the caveat in the next subsection.

In terms of the implementation burden on compilers, some of the compiler work to implement this strategy is also required to respect the existing memory_order_acquire loads.

This strategy applies primarily to weakly ordered machines with data-dependency ordering, secondarily to strongly ordered machines, and does not apply to weakly ordered machines without data-dependency ordering.

Truncate Data-Dependency Trees

The above strategy implies that the compiler is avoiding optimizations in all functions dynamically called on a data-dependency tree. This implication is unacceptable for compilers that see only a portion of those functions.

However, the compiler does not need to see all functions; it can simply emit an acquire fence on the tree root (which is atomic) before a tree extends into a function call or out of a function return. Given such a convention, the compiler can assume that there are no optimization restrictions at the start of a function. This strategy enables fully-optimized per-function compilation, with run-time performance no worse than, and often much better than, the first strategy.

This strategy becomes more effective when performed after inlining or when considered in inter-procedural optimization.

Annotate Functions

Many uses of data-dependency operations will be in the implementation of data structures. If their (presumably non-inline) access functions must truncate the data-dependency tree on return, much of the potential performance of data-dependency ordering may be lost.

To address this performance opportunity, we propose to annotate function parameters and results to indicate that the compiler should assume that code on the other side of the function will handle depencencies correctly.

As these annotations are not essential to data-dependency ordering, they are covered in a separate proposal, WG21 N2493.

Proposed Wording

This section proposes wording changes to working draft N1401.

5.1.2.4 Multi-threaded executions and data races

Edit paragraph 5 as follows:

The library defines a number of atomic operations (7.16) and operations on locks mutexes (lib - ??? 7.24) that are specially identified as synchronization operations. These operations play a special role in making assignments in one thread visible to another. A synchronization operation on one or more memory locations is either an acquire operation or a release operation, or both an acquire and release operation, or a consume operation. A synchronization operation without an associated memory location is a fence and can be either an acquire fence, a release fence, or both an acquire and release fence. In addition, there are relaxed atomic operations, which are not synchronization operations, and atomic read-modify-write operations, which have special characteristics.

Editing note: Changing "lock" to "mutex" is an editorial change of terminology which should happen globally through 5.1.2.4, for consistency with the terminology used in 7.24. It is indicated herein, only in paragraphs that need to be changed for this technical proposal, solely as an aid to the reader.

Edit paragraph 6 as follows:

NOTE 2 For example, a call that acquires a lock mutex will perform an acquire operation on the locations composing the lock mutex. Correspondingly, a call that releases the same lock mutex will perform a release operation on those same locations. Informally, performing a release operation on A forces prior side effects on other memory locations to become visible to other threads that later perform an acquire or consume operation on A. We do not include relaxed atomic operations as synchronization operations although, like synchronization operations, they cannot contribute to data races.

Before paragraph 14, add the following paragraphs:

14 An evaluation A carries a dependency to an evaluation B if

15 NOTE The "carries a dependency" relation is a subset of the "sequenced before" relation, and is similarly strictly intra-thread.

16 An evaluation A is dependency-ordered before an evaluation B if

17 NOTE The "dependency-ordered before" relation is analogous to the "synchronizes with" relation, but uses release/consume in place of release/acquire.

Edit the existing paragraph 14 (which should be renumbered to paragraph 18) as follows:

14 18 An evaluation A inter-thread happens before an evaluation B if A synchronizes with B, or A is dependency-ordered before B, or for some evaluation X:

Editing note: In the C++ WD, this paragraph has two levels of bullets. The bullets that appear in the C WD are at the second level, and the conditions expressed in the lead-in are at the first level.

Immediately after the previous paragraph (formerly numbered 14), add a new paragraph (which should be numbered 19):

19 NOTE The "inter-thread happens before" relation describes arbitrary concatenations of "sequenced before", "synchronizes with" and "dependency-ordered before" relationships, with two exceptions. The first exception is that a concatenation is not permitted to end with "dependency-ordered before" followed by "sequenced before". The reason for this limitation is that a consume operation participating in a "dependency-ordered before" relationship provides ordering only with respect to operations to which this consume operation actually carries a dependency. The reason that this limitation applies only to the end of such a concatenation is that any subsequent release operation will provide the required ordering for a prior consume operation. The second exception is that a concatenation is not permitted to consist entirely of "sequenced before". The reasons for this limitation are (1) to permit "inter-thread happens before" to be transitively closed and (2) the "happens before" relation, defined below, provides for relationships consisting entirely of "sequenced before".

7.16.1 Introduction

Edit paragraph 2 as follows:

The macros defined are

ATOMIC_INTEGRAL_LOCK_FREE
ATOMIC_ADDRESS_LOCK_FREE

which indicate the general lock-free property of integer and address atomic types; and

ATOMIC_FLAG_INIT

which expands to an initializer for an object of type atomic_flag. ; and

kill_dependency

which terminates a dependency chain.

7.16.2 Order and Consistency

Edit paragraph 1 as follows:

The enumerated type memory_order specifies the detailed regular (non-atomic) memory synchronization operations as defined in 5.1.2.4 and may provide for operation ordering. Its enumeration constants are as follows:

memory_order_relaxed
memory_order_consume
memory_order_acquire
memory_order_release
memory_order_acq_rel
memory_order_seq_cst

After paragraph 4, insert a new paragraph:

For memory_order_consume, a load operation performs a consume operation on the affected memory location.

7.16.2.1 The kill_dependency macro

Insert a new section with the preceding title and the following content:

Synopsis

1 #include <stdatomic.h>
type kill_dependency(type y);

Description

2 The argument does not carry a dependency to the return value (5.1.2.4).

Returns

3 The value of y.

7.16.3.1 The atomic_thread_fence function

Edit the second bullet of paragraph 2 as follows:

7.16.6.1 The atomic_store generic functions

Edit paragraph 2 as follows:

The order argument shall not be memory_order_consume, memory_order_acquire, nor memory_order_acq_rel. Atomically replace the value pointed to by object with the value of desired. Memory is affected according to the value of order.