C++ Atomic Types and Operations

ISO/IEC JTC1 SC22 WG21 N2427 = 07-0297 - 2007-10-03

Hans-J. Boehm, Hans.Boehm@hp.com, boehm@acm.org
Lawrence Crowl, crowl@google.com, Lawrence@Crowl.org

This document is a revision of N2393 = 07-0253 - 2007-09-10.

Introduction
    Rationale
    Goals
Discussion of Design
    A Library API
    Atomic Types, Operations, and Volatile
    C and C++ Interoperability
    Memory Ordering
    Consistency
    Fences
    Dependent Memory Order
    Lock-Free Property
    Flag Type and Operations
    Integral and Address Types
    Atomic Operations
    Generic Types
    Implementability
Prior Approaches
Changes to the Standard
    Chapter 17 Library introduction [library]
    Chapter 20 General utilities library [utilities]
    20.1 Concepts [utility.concepts]
    20.1.2 Comparisons [concept.comparison]
    20.1.5 Copy and move [concept.copymove]
    Chapter 29 Atomics library [atomics]
    29.1 Order and Consistency [atomics.order]
    29.2 Lock-Free Property [atomics.lockfree]
    29.3 Flag Type and Operations [atomics.flag]
    29.4 Integral and Address Types [atomics.types]
    29.5 Operations [atomics.operations]
    29.6 Generic Types [atomics.generic]
    29.7 Synopsis [atomics.synopsis]
Implementation
    Notes on the Presentation
    Implementation Files
    C++0x Features
    Memory Order
    Flag Type and Operations
    Implementation Macros
    Lock-Free Macro
    Integral and Address Types
        Boolean
        Address
        Integers
        Integer Typedefs
        Characters
    Template Types
        Fully Generic Type
        Pointer Partial Specialization
        Integral Full Specializations
    C++ Core Functions
    C Core Macros
    C++ Methods
    Implementation Header Cleanup
    Standard Headers
Examples of Use
    Flag
    Lazy Initialization
    Integer
    Event Counter
    List Insert
    Update
    Main

Introduction

We propose standard atomic types and operations. In addition to normative wording for the interface, we present a minimal conforming implementation and example uses of the interface.

The traditional shared-memory notion that every store is instantly visible to all threads induces an unacceptable performance loss on modern systems. Therefore programmers must have a mechanism to indicate when stores in one thread should be communicated to another.

Specifically, a program that wishes to communicate the fact that a particular piece of data prepared by one thread is ready to be examined by another thread, needs a shared variable flag, that both:

Allows atomic accesses, in the sense that concurrent reads and writes are allowed and that the reads result in only one of the assigned values and never undefined behavior.
Ensures that any ordinary data written before the flag is set (i.e. the prepared data) is seen correctly by another thread after it sees a set flag.

Although the second aspect is often glossed over, it is usually not automatic with modern hardware and compilers, and is just as important as the first in ensuring correctness.

In POSIX, this communication is achieved by calling certain functions, particularly mutex lock and unlock. While mutexes are adequate for many applications, there is a significant need for finer-grained atomic operations:

Lock-free concurrent data structures: Lock-free concurrent data structures are difficult to design and implement correctly. As such, there is a significant need for programmers capable of doing so to write portably, and for that they need standards.
Inter-thread memory synchronization: Occasionally synchronization with locks offers insufficient performance, and other, more specific idioms suffice.

Goals

We had several goals for our design. Each had significant impact on the details.

Provide the primitives critical to inter-thread memory synchronization.
Provide the primitives critical to effective lock-free algorithms.
Provide a very high degreee of C and C++ interoperability.
Cover the common range of existing hardware primitives.
Expose those hardware primitives efficiently to programmers, that is, the design should minimize overhead.
Provide a syntax that reduces errors.
Support generic programming.

Discussion of Design

While our proposal is based on existing practice, that practice is somewhat fragmented, so some design choices may not be obvious. In this section, we discuss our choices and their reasoning.

A Library API

We propose to add atomic types and operations purely as a library API. In practice, for C++, this API would have to be implemented largely with either compiler intrinsics or assembly code. (As such, this proposal should be implemented by compiler vendors, not library vendors, much as the exception facilities are implemented by compiler vendors.) For C, a compiler implementation is required for the type-generic macros.

Atomic Types, Operations, and Volatile

We propose atomic types rather than atomic operations on general-purpose types. Doing so enforces a single synchronization protocol for all operations on an object. Using two protocols simultaneously will result in synchronization failure.

We chose to specify atomic types with conventional type names rather than modify the volatile type qualifier for concurrency. The volatile type qualifier has a long history within C and C++, and changing its meaning is both risky and unnecessary. In addition, the existing meaning of volatile, "may be modified by external agents", is somewhat orthogonal to "may be modified concurrently by the program". See N2016 Should volatile Acquire Atomicity and Thread Visibility Semantics? for a more complete discussion. As a consequence, objects of atomic type may also be volatile qualified. Compilers may optimize some non-volatile atomic operations, where they would not be able to optimize volatile operations.

C and C++ Interoperability

The proposal defines the atomic types as C++ standard-layout structs. In C, these would be simply opaque types. Headers common to both C and C++ use exactly the same syntax to declare atomic variables.

Furthermore, the proposal defines the C++ atomic types such that the static initialization syntax can be identical to C aggregate initialization syntax. That is, atomic variables can be initialized in common headers as well.

The core atomic operations are free functions that take pointers to atomic types. Programmers use the same syntax for these operations in both C and C++. That is, a header included from both C and C++ can provide inline functions that operate on atomic types.

The proposal defines the core functions as overloaded functions in C++ and as type-generic macros in C. This approach helps programmers avoid changing code when an atomic type changes size.

Because free functions are occasionally clumsy, the proposal additionally provides member operators and member functions so that C++ programmers may use a more concise syntax.

Memory Ordering

Synchronization operations in the memory model (N2052, N2138, N2171, N2176, N2177, N2300, N2392) may be either acquire or release operations, or both. These operations govern the communication of non-atomic stores between threads. A release operation ensures that prior memory operations become visible to a thread performing subsequent acquire operation on the same object.

Rather than have atomic operations implicitly provide both acquire and release semantics, we choose to complicate the interface by adding explicit ordering specifications to various operations. Many comparable packages do not, and instead provide only a single version of operations, like compare-and-swap, which implicitly include a full memory fence.

Unfortunately, the extra ordering constraint introduced by the single version is almost never completely necessary. For example, an atomic increment operation may be used simply to count the number of times a function is called, as in a profiler. This requires atomicity, but no ordering constraints. And on many architectures (PowerPC, ARM, Itanium, Alpha, though not X86), the extra ordering constraints are at least moderately expensive.

The proposal defines an enumeration that enables detailed specification of the memory order for every operation.

Consistency

A strict interpretation of acquire and release yields a fairly weak consistency model, which allows threads to have a different notion of the order of writes. For stronger consistency, this proposal destinguishes between an operation with acquire and release semantics and an operation with sequentially consistent semantics. See N2177, Sequential Consistency for Atomics.

The member operators provide no mechanism for specifying memory order enumeration, and so they have only the strongest synchronization. This default is not an undue burden because using weaker synchronization requires considerable thought. Any extra syntactic burden is dwarfed by the semantic burden.

Program auditors can search for uses of the memory order enumeration to identify code that uses syncronization models that are weaker than sequentially consistent.

Fences

It is also unclear how a convention requiring full global memory fences would properly interact with an interface that supported simple atomic loads and stores. Here a full memory fence would generally multiply the cost by a large factor. (The current gcc interface does not seem to support simple atomic loads and stores explicitly, which makes it unclear how to support e.g. lock-based emulation, or architectures on which the relevant loads and stores are not implicitly atomic.)

There are two possible approaches to specifying ordering constraints:

Have the programmer provide explicit memory fences/barriers, perhaps most usefully in a way that is analogous to the SPARC membar instructions.
Associate the ordering semantics with operations. The closest hardware analog for this is probably Itanium, though we carry this through more consistently.

Both approaches appear to have their merits. We chose the latter for reasons described in N2176.

Some architectures provide fences that are limited to loads or stores. We have, so far, not included them, since it seems to be hard to find cases in which both:

Such limited ordering constraints are useful and not excessively brittle.
They actually result in a performance benefit over the more general constraint.

However, we have provided per-variable fence operations, which are semantically modeled as read-modify-write operations. Such fences enable programmers to conditinalize memory synchronization, which can substantially improve performance. See N2153 A simple and efficient memory model for weakly-ordered architectures for motivating examples and the Lazy Initialization example for an use of the proposed syntax.

We expect that implementations that have hardware fences will use such operations to implement the language fences. See N2362 Converting Memory Fences to N2324 Form, for a discussion of the issues and approaches to converting from existing practice of global fences to the per-variable fences in this proposal.

Dependent Memory Order

Most architectures provide additional ordering guarantees if one memory operation is dependent on another. In fact, these are critical for efficient implementation of languages like Java.

For C++, there is near-universal agreement that it would be nice to have some such guarantees. The fundamental issues are:

Compilers may remove or change data and/or control dependencies.
Detailed guarantees vary across architectures.

See papers N2359 C++ Dependency Ordering: Atomics, N2360 C++ Dependency Ordering: Memory Model, and N2361 C++ Dependency Ordering: Function Annotation for exploration of these issues.

Lock-Free Property

In some cases, both the decision to use a lock-free algorithm, and sometimes the choice of lock-free algorithm, depends on the availability of underlying hardware primitives. In other cases, e.g. when dealing with asynchronous signals, it may be important to know that operations like compare-and-swap are really lock-free, because a lock-based emulation might result in deadlock.

The proposal defines feature queries to determine whether or not operations are lock-free. We provide two kinds of feature queries, compile-time preprocessor macros and run-time functions. To facilitate optimal storage use, the proposal supplies feature macros that indicates general lock-free status of integral and address atomic types. The run-time function provide per-object information.

The proposal provides run-time lock-free query functions rather than compile-time constants because subsequent implementations of a platform may upgrade locking operations with lock-free operations, so it is common for systems to abstract such facilities behind dynamic libraries, and we wish to leave that possiblility open. Furthermore, we recommend that implementations without hardware atomic support use that technique.

The proposal provides lock-free query functions on individual objects rather than whole types to permit unavoidably misaligned atomic variables without penalizing the performance of aligned atomic variables.

Because consistent use of operations requires that all operations on a type must use the same protocol, all operations are lock-free or none of them are. That is, the lock-free property applies to whole objects, not individual operations.

To facilitate inter-process communication via shared memory, it is our intent that lock-free operations also be address-free. That is, atomic operations on the same memory location via two different addresses will communicate atomically. The implementation shall not depend on any per-process state. While such a definition is beyond the scope of the standard, a clear statement of our intent will enable a portable expression of class of a programs already extant.

Flag Type and Operations

The proposal includes a very simple atomic flag type, providing two primary operations, test-and-set and clear. This type is the minimum hardware-implemented type needed to conform to this standard. The remaining types can be emulated with the atomic flag, though with less than ideal properties. Few programmers should be using this type.

We considered adding a wait operation in this proposal, but ultimately rejected it because pure busy waiting has pathological performance characteristics on multi-programmed machines and because doing better requires interaction with the operating system scheduler, which seems inappropriate to a processor-level facility.

Integral and Address Types

The proposal includes a full set of integral atomic types and an address atomic type.

This proposal includes atomic integral types smaller than a machine word, even though many architectures do not have such operations. For machines that implement a word-based compare-and-swap operation, the effect of operations can be achieved by loading the containing word, modifying the sub-word in place, and performing a compare-and-swap on the containing word. In the event that no compare-and-swap is available, the implementation may need to either implement smaller types with larger types or use locking algorithms. Using the same size for the atomic type as its base type eases effort required to port existing code, e.g. code using __sync.

Atomic Operations

The simplest atomic operations are load and store.

There are times when one wishes to store a new value, but also load the old value. An atomic load followed by an atomic store is not an atomic sequence, so we provide an atomic swap operation that does a combined load and store atomically.

We also provide the common fetch-and-modify operations. The fetch-and-modify functions return the original stored value. The original stored value is required for fetch-and-or and fetch-and-and because there is no means to compute to the original stored value from the new value and the modifying argument. In contrast to the functions, the fetch-and-modify assignment operators return the new value. We do this for consistency with normal assignment operators. Unlike normal C++ assignment operators, though, the atomic assignments return values rather than references, which is like C. The reason is that another thread might intervene between an assignment and a subsequent read. Rather than introduce this classic parallel programming bug, we return a value.

The normal signed integral addition operations have undefined behavior in the event of an overflow. For atomic variables, this undefined behavior introduces significant burden because there is in general no way to pre-test for overflow. Rather than leave overflow undefined, we recognize the defacto behavior of modern systems and define atomic fetch-and-add (-subtract) to use two's-complement arithmetic. We are aware of no implementation of C++ for which this definition is a problem.

The compare-and-swap operation seems to be the minimum general-purpose synchronization primitive. It appears to be both necessary and sufficient for most interesting lock-free algorithms. The compare-and-swap operations may fail spuriously. That is, the stored value may be known to be equal to the expected value, and yet the operation fails to swap. We permit this failure for two reasons. First, it appears unavoidable for load-locked store-conditional implementations. Second, hardware memory switches may prefer to fail some operations to ensure overall machine performance.

The compare-and-swap operations replace the expected value with the found value in the event that the found value does not match the expected value. The intent of the design is to set up a compare-and-swap loop for the next iteration. The reason for this design is that:

Recovering from spurious failure requires a loop.
Setting up the loop for the next iteration requires the least syntax in user code.
The more comprehensive operation subsumes the two hardware variants of compare and swap. Some architectures (e.g. SPARC) return the mismatched value in hardware, and it is more efficient to pass that back up rather than artificially introduce another atomic load. Architectures that don't provide the mismatched value must do a subsequent load. Likewise some architectures do not set conditions (e.g. SPARC) and those will need to insert a comparison.
Compilers will be able to examine the compilation context and remove any unnecessary overhead for the relatively rare cases where a loop is not needed.

Unlike other operations, the compare-and-swap operations have two memory synchronization order parameters. The first is for operations that succeed; the second is for those that fail. The reason for this pair of specifications is that there are several circumstances in which the failure case can have substantially weaker, and substantially cheaper, synchronization. However, for sequential consistency, full synchronization is required.

Generic Types

Generic atomic types provide significant benefits:

They ease type-safe use of atomic pointers.
They ease the process of making atomic version of user-defined types.
They enable generic algorithms, such as lock-free pointer-based queues.

The generic atomic type has a partial specialization for pointers (derived from the atomic address type), and full specializations for integral types (derived from the atomic integrals).

The primary problem with a generic atomic template is that effective use of machine operations requires that their parameter types are trivially copy assignable and bitwise equality comparable.

Furthermore, parameter types that are not also statically initializable and trivially destructable may be difficult to use.

In the present language, there is no mechanism to require these properties of a type parameter. Roland Schwarz suggests using a template union to enfore POD type parameters. Unfortunately, that approach also prevents the derivation of specializations of atomic for the types above, which is unacceptable. Furthermore, Lois Goldthwaite proposes generalizing unions to permit non-POD types in N2248 Toward a More Perfect Union. We believe that concepts are a more appropriate mechanism to enforce this restriction, and so we have proposed the concepts and concept maps necessary for safe use of generic atomics. The burden on programmers using generic atomics is to declare that types passes as template parameters are in fact bitwise equality comparable. The proposed mechanisms will infer trivially copy assignable types.

The intent is that vendors will specialize a fully-general locking implementation of a generic atomic template with implementations using hardware primitives when those primitives are applicable. This specialization can be accomplished by defining a base template with the size and alignment of the parameter type as additional template parameters, and then specializing on those two arguments.

Implementability

We believe that there is ample evidence for implementability.

The Intel/gcc __sync intrinsics provide evidence for compiler implementability of the proposed interface.

(We do not advocate standardizing these intrisics as is. They provide far less control over memory ordering than we advocated above. For example, they provide no way to atomically increment a counter without imposing unnecessary ordering constraints. The lack of appropriate ordering control appears to already have resulted in implementation shortcuts, e.g. gcc does not implement __sync_synchronize() as a full memory barrier on X86, in spite of the documentation. We believe a number of issues were not fully understood when that design was developed, and it could could greatly benefit from another iteration at this stage.)

Other packages, particularly Boehm's atomic_ops package provide evidence of efficient implementability over a range of architectures.

The remaining implementation issue is the burden on implementors to produce a minimally conforming implementation. The minimum hardware support required is the atomic test-and-set and clear operations that form the basis of the atomic flag type. This proposal includes an example implementation based on that minimal hardware support, and thus shows the vendor work required.

Prior Approaches

This proposal results from experience with many approaches. For brevity, we address only those approaches with formal proposals before the C++ standards committee.

N1875 presented an atomic statement, which would require the compiler to select the appropriate atomic operations. We chose to abandon this approach because:

The non-atomic evaluations within the atomic block were less than obvious.
The atomic block syntax was too invitingly simple for the level of skill required to use it correctly.
The atomic block syntax seems far more appropriate to express general atomic memory transactions. It is premature to standardize those at this point.

N2047 presented a template-based C++ library of atomic types layered on top of a C library of atomic operations on plain integral types. We chose to abandon this approach because:

The template-based approach severely limits interoperability with C.
The proposal had both "guaranteed atomic" and "possibly emulated" versions of atomic types, which appears to have insufficient expressive power to warrant the added complexity. It was originally designed in part to allow for access to hardware-provided atomic load and store operations on platforms that do not provide hardware provided compare-and-swap, a property not shared by the current proposal. Our current feeling is that such platforms will be rare enough by 2010 that this property is not a major consideration.
The distinction between wait-free and lock-free had insufficient user support and use cases.
The C-level approach failed to identify concurrently accessed data at the point of declaration, and made it too easy to access such data with non-atomic primitives.

N2145 introduced the basic approach adopted in this proposal.

There are atomic types, with atomic operations on them.
The operations include synchronization semantics.
There are a set of primitive types in common to both C and C++.
There were definitional weaknesses in the proposal due to an overly-strong definition of POD.

N2153 proposes a fence-based memory model.

There are atomic types, with atomic operations on them.
The operations include synchronization semantics.
There are global fences.
There is some discussion of load-dependent memory order.

N2195 presents a lower level interface than that of N2145. N2195 deviates from the previous proposals in the following major areas:

Absence of feature test macros or functions.
Absence of a dedicated enumerated set of atomic types.
Absence of a high-level C++ API (the std::atomic<T> class template).
Inclusion of an acquire+release ordering constraint.
Passing the constraint as a first argument instead of using several functions or macros with the appropriate suffix.
Introduction of bi-directional fences as proposed in N2153.
Additional requirements on the ordered constraint that enable the programmer to achieve sequential consistency and CCCC (cache-coherent casual consistency).

N2334 derives from N2145, but was refined in several areas.

The synchronization specification has changed from a part of the function name to a function parameter. (Adopted from N2195.) The change simplifies the specification and simplifies the programming of generic algorithms. Its major cost is that optimizers will need to do constant parameter propagation to obtain full performance. We feel this capability is common enough in modern compilers to not be an excessive burden.
We have split the "ordered" synchronization primitive into two parts, "acquire-release" and "sequentially-consistent". This split provides finer control over synchronization. (Adopted from N2195.)
We have added per-variable fence operations, consistent with the updated concurrency model. These fences were motivated by N2153, but are not isomorphic to them. In particular, N2153 provides global fences but N2334 provides per-variable fences.)
We have updated the proposal to include the new character types from N2249 New Character Types in C++.
We have addressed the definitional weaknesses with respect to POD-ness with the help of the following papers.
- N2215 Initializer lists (Rev.3)
- N2230 POD's Revisited; Resolving Core Issue 568 (Revison 3)
- N2235 Generalized Constant Expressions — Revision 5
- N2326 Defaulted and Deleted Functions

N2381 is primarily a set of editorial changes to N2324, but with the following substantive changes.

We included summaries of more prior work.
We moved some discussion of specifying ordering constraints to N2176.
We added references to the work on load-dependent ordering.
We added a discussion of the design for compare-and-swap.
We added a second synchronization parameter to compare-and-swap to specify the synchronization semantics for a failed comparison.
We removed the implicit conversion operators.

N2393 is a substantial editorial rewrite of N2381, with primary emphasis on providing normative wording. Changes are generally not substantive, but with the following exceptions.

We split our use of "scalar" into "integral and address" to correctly indicate the set of atomic types. We have split the lock-free property macro to correspond.
We add the concepts necessary to for the atomic generics to behave properly.
We have eliminated one unspportable requirement on ordering, replacing it by a non-normative encouragement.
We have eliminated one requirement as a result of simplification in the memory model.

N2427 is a minor change to N2393 correcting a weakness in the interaction between concepts, booleans, and hardware compare-and-swap.

We changed BitwiseEqualityComparable to AtomicComparable.
We changed AtomicComparable to not be a refinement of EqualityComparable.

Changes to the Standard

This section of the proposal provides the formal wording for the standard.

Chapter 17 Library introduction [library]

Add a new paragraph with the following wording:

The atomic components allow more fine-grained concurrent access to shared data than is possible with locks.

Chapter 20 General utilities library [utilities]

This chapter of the working paper has not yet been modified for concepts. The changes listed under this chapter will apply after those of N2322 Concepts for the C++0x Standard Library: Utilities or its adopted successor.

20.1 Concepts [utility.concepts]

In the synopsis, move LessThanComparable immediately before EqualityComparable.

In the synopsis, immediately after EqualityComparable add:

concept AtomicComparable<typename T> see below;
concept_map AtomicComparable<T> see below;

In the synopsis, immediately after CopyAssignable add:

concept TriviallyCopyAssignable<typename T> see below;
template<CopyAssignable T> concept_map TriviallyCopyAssignable<T> see below;

20.1.2 Comparisons [concept.comparison]

Move the paragraphs on LessThanComparable immediately before those on EqualityComparable.

Add a new paragraph after those on EqualityComparable.

Concept AtomicComparable to enable a bitwise equality comparison, as with memcmp, in the implementation of compare-and-swap operations. [Note: Such types should not have padding, i.e. the size of the type is the sum of the sizes of its elements. If padding exists, the comparison may provide false negatives, but never false positives. —end note]
concept AtomicComparable<typename T> { }

Add a new paragraph.

All integral types are AtomicComparable.


concept_map AtomicComparable<char> { }
concept_map AtomicComparable<signed char> { }
concept_map AtomicComparable<unsigned char> { }
concept_map AtomicComparable<short> { }
concept_map AtomicComparable<unsigned short> { }
concept_map AtomicComparable<int> { }
concept_map AtomicComparable<unsigned int> { }
concept_map AtomicComparable<long> { }
concept_map AtomicComparable<unsigned long> { }
concept_map AtomicComparable<long long> { }
concept_map AtomicComparable<unsigned long long> { }
concept_map AtomicComparable<wchar_t> { }
concept_map AtomicComparable<char16_t> { }
concept_map AtomicComparable<char32_t> { }
concept_map AtomicComparable<char32_t> { }
concept_map AtomicComparable<bool> { }

Add a new paragraph.

All pointer types are AtomicComparable.
template< typename T > concept_map AtomicComparable<T*> { }

20.1.5 Copy and move [concept.copymove]

Add a new paragraph after the paragraph on CopyAssignable:

Concept TriviallyCopyAssignable requires the ability to assign to an object with a trivial assignment.
concept TriviallyCopyAssignable<typename T> : CopyAssignable<T> { }

template<CopyAssignable T>
    requires True<has_trivial_assign<T>::value>
concept_map TriviallyCopyAssignable<T> { }

Chapter 29 Atomics library [atomics]

Add a new clause with the following paragraphs.

This clause describes components for fine-grained atomic access. This access is provided via operations on atomic objects. [Footnote: Atomic objects are neither active nor radioactive. —end footnote]

The following subclauses describe atomics requirements, and components for types and operations, as summarized below.

Atomics library summary
Subclause Header(s)

29.1 Order and Consistency [atomics.order]

29.2 Lock-Free Property [atomics.lockfree]

29.3 Flag Type and Operations [atomics.flag]

29.4 Integral and Address Types [atomics.types]

29.5 Operations [atomics.operations]

29.6 Generic Types [atomics.generic]

29.7 Synopsis [atomics.synopsis] <cstdatomic> <stdatomic.h>

Atomics library summary
Subclause	Header(s)
29.1 Order and Consistency [atomics.order]
29.2 Lock-Free Property [atomics.lockfree]
29.3 Flag Type and Operations [atomics.flag]
29.4 Integral and Address Types [atomics.types]
29.5 Operations [atomics.operations]
29.6 Generic Types [atomics.generic]
29.7 Synopsis [atomics.synopsis]	`<cstdatomic> <stdatomic.h>`