Doc. No.:	WG21/N2047 J16/06-0117
Date:	2006-06-24
Reply to:	Hans-J. Boehm
Phone:	+1-650-857-3406
Email:	Hans.Boehm@hp.com

An Atomic Operations Library for C++

We present a brief design rationale and a proposed interface for a C++ atomic operations library. This design has benefited from discussions at the Berlin meeting.

Unlike N1875, and for reasons discussed below, we propose to add atomic operations purely as a library API. In practice, this API would have to be implemented largely with either compiler intrinsics or assembly code. We believe that in most cases intrinsics or assembler support sufficient for a prototype implementation already exists.

The memory model proposal (N1942) assumes the existence of something along these lines.

A Rationale

Here are some of the arguments for different aspects of the current atomics package design.

Provide atomics purely as a library API

Function call syntax appears to provide a strong and useful hint to the programmer as to which operations are executed atomically. Only the function invocation itself is atomic; argument evaluation is not part of the atomic action. It seems to be harder, not easier, to make this clear with some kind of atomic block syntax.

An atomic block syntax seems far more appropriate to express general atomic memory transactions. It is premature to standardize those at this point.

This part of our approach is consistent with exiting practice, which often provides for somewhat platform-dependent compiler intrinsics or inline assembly functions to express atomic operations.

Allow relaxed ordering specifications

We choose to complicate the interface by adding explicit ordering specifications to various operations. Many comparable packages don't, and instead provide only a single version of operations like compare-and-swap, which implicitly include a full memory fence.

Unfortunately, the extra ordering constraint introduced by the single version is almost never completely necessary. For example, an atomic increment operation may be used simply to count the number of times a function is called, as in a profiler. This requires atomicity, but no ordering constraints. And on many architectures (PowerPC, Itanium, Alpha, though not X86), the extra ordering constraints are at least moderately expensive.

It is also unclear how a convention requiring full memory fences would properly interact with an interface that supported simple atomic loads and stores. Here a full memory fence would generally multiply the cost by a large factor. (The current gcc interface does not seem to support simple atomic loads and stores explicitly, which makes it unclear how to support e.g. lock-based emulation, or architectures on which the relevant loads and stores are not implicitly atomic.)

Include ordering constraints with operations

There are two possible approaches to specifying ordering constraints:

Have the programmer provide explicit memory fences/barriers, perhaps most usefully in a way that's analogous to the SPARC membar instructions.
Associate the ordering semantics with operations. The closest hardware analog for this is probably Itanium, though we carry this through more consistently.

Both approaches appear to have their merits. We chose the latter for several reasons:

On architectures such as X86 and Itanium, it can lead to substantially faster code, at least in the absence of complex compiler analysis. For example, on X86, a lock is often released with a simple store instruction, which is widely believed to effectively have implicit "release" semantics. If this appears in the source as store<release> it is easy to simply map that to a store instruction. If it is expressed as a fence followed by a store, the compiler would have to deduce that the fence is redundant. It is unclear that that can be done under realistic conditions, since the fence prevents operations from moving into the critical region, while a simple store does not do so for loads. On Itanium a similar situation arises with the compare-and-swap operation.
It seems to be marginally more convenient. For example, double-checked locking can be easily written with load_acquire and store_release, with no explicit barriers. (The semantics of the fence version are also unnecessarily stronger, causing unnecessary overhead on Itanium. We are not aware of common examples where the reverse is true for other architectures.)
It gives us an easy way to express that atomic loads and stores "normally" have acquire and release semantics, but that weaker options may exist. It is important to encourage the acquire/release versions, since they behave with respect to dependencies in the way that essentially all programmers expect, while remaining easily definable. The unordered variants can be very counterintuitive.
It makes it harder to ignore ordering issues. Ignoring them is almost always a mistake at this level.

Provide both direct and possibly emulated versions

In some cases, both the decision to use a lock-free algorithm, and sometimes the choice of lock-free algorithm, depends on the availability of underlying hardware primitives. For these cases, we provide both feature-tests and direct access to the underlying hardware primitives as part of the native_atomic template.

In other cases, e.g. when dealing with asynchronous signals, it may be important to know that operations like compare-and-swap are really lock-free, because a lock-based emulation might result in deadlock. Again direct access to the hardware primitives is called for.

In other cases, implementors know that sufficient hardware primitives are available on the architectures important to them, but would like the code to be minimally portable to other architectures without further work. We provide atomic to handle this case.

Provide a minimal set of ordering constraints

Clearly there is a reason to keep the interface to the atomics library as simple as possible. However, there have been repeated calls for adding additional ordering constraints. There are probably cases in which these would result in some limited performance benefit. We have so far not included these for the following reasons:

Some architectures provide fences that are limited to loads or stores. We have, so far, not included them, since it seems to be hard to find cases in which both:

Such limited ordering constraints are useful and not excessively brittle.
Actually result in a performance benefit over the more general constraint.

Most architectures provide additional ordering guarantees if one memory operation is dependent on another. In fact, these are critical for efficient implementation of languages like Java. They are not reflected in our proposal (except for the "depends-on" implications from the memory model).

In this case, there is near-universal agreement that it would be nice to have some such guarantees. The difficulty is that we have not been able to formulate such a guarantee that both makes sense at the C++ source level, and does not interfere with optimization of code in compilation units that do not mention atomics. The fundamental issues are:

Compilers may remove or change data and/or control dependencies.
Detailed guarantees vary across architectures.

Pass ordering constrains as template arguments

This seems to be the best compromise between insisting on ordering constraints to be static and avoiding repeated reimplementation of higher level functions that differ only in ordering constraints.

Provide two separate `atomic_int` and `atomic_ptr` variants.

An earlier version of our proposal passed the underlying atomic operations class as a template parameter to atomic_int, avoiding the introduction of native_atomic_int. This is purely a taste issue, but that level of template parametrization was generally viewed as excessive by the C++ committee subgroup that considered it.

Provide primarily "dynamic" feature tests

Our feature tests generally do not guarantee that they can be evaluated at compile time. This allows the resulting code to directly handle the common situation in which a single executable is generated for several architectural variants which differ in the atomic operations they directly support.

The low level interface does make one exception here, in that it makes it possible to test statically that no variants of a platform provide certain operations. That makes it possible to use a different data structure in cases in which no dynamic test is ever necessary.

The API

We present the proposed API for C++ atomics in header file format, for now:

// The expectation is that this will usually be implemented in terms of
// low_level_atomics.h by checking whether the template argument has
// sufficient size and alignment constraints that it can be safely
// cast to a primitive integer type, and one of the low level primitives
// can be applied.

enum ordering_constraint {raw, acquire, release, ordered};
	// Informally:
	// raw ==> This operation is unordered, and may become
	//          visible to other threads in an order that is
	//          constrained only by ordering constraints on other
	//          operations.
	// release ==> All prior memory operations (including ordinary
	//          assignments) become visible to a an acquire
	//          operation on the same object that sees the resulting
	//          value.
	// acquire ==> See above.
	// ordered ==> Both acquire and release ordering properties.

// This version gives direct access to the hardware primitives, and fails
// if they don't exist.  As a result, it should be OK for inter-process
// and signal-handler communication, though that's beyond the standard.
template <class T>
class native_atomic {
    public:
	static bool basics_supported();
		// Are load/store primitives supported?
	native_atomic(T);
		// No ordering semantics, constructor itself not atomic.
	
	// The following may fail if basics_supported() returns false.
	// A store_release, or an atomic update with a release
	// argument, "synchronizes with" a load_acquire or an atomic
	// update with an acquire argument.  There are no other
	// such relationships.
	template <ordering_constraint c>
	    void store(const T&);
	    // Compile-time error if c is acquire.
	template <ordering_constraint c>
	    T load();
	    // Compile-time error if c is neither none nor acquire.
	
	static bool cas_supported();
		// Is compare_and_swap supported?
		// If so, the various fetch_and_... primitives
		// are also presumed to be supported for numeric
		// types T, since they
		// can be emulated with a CAS loop.
		// Cas is not guaranteed to be
		// wait-free, though it should be if the hardware
		// provides them.
	// There was a suggestion that the above be static, for this
	// version only.  I'm not sure that's right, given that portable
	// code should probably be prepared to adjust dynamically.

	// Compare-and-swap.  Does not fail spuriously.  Not wait-free
	// on ll-sc machines.
	template <ordering_constraint c>
	    bool cas(const T& old, const T& new_val);
	static bool cas_is_wait_free();

	// Compare-and-swap.  May fail spuriously.  Wait-free
	// on ll-sc machines.
	template <ordering_constraint c>
	    bool weak_cas(const T& old, const T& new_val);
	static bool weak_cas_is_wait_free();

	// I'm inclined to restrict double-width operations to
	// the low level interface, if we provide them at all.
	// They're very difficult to use,
	// due to architectural variation, and would mess up this
	// interface.
	
	T operator T() { return load<acquire>(); }
	void operator=(const T& x) { return store<release>(x); }
};

// The following provides the same interface, but the primitives
// always have functional implementations, possibly because they are
// emulated with locks.
// The implementation should avoid emulation whenever the hardware
// provides suitable primitives.
// We expect that the canonical implementation will provide a static
// hash table of locks, and map each address to a location in the hash
// table.
// Since the implementation may be lock-based, this version is NOT useful
// for either signal-handler or inter-process communication.  (Again,
// this is beyond the scope of the standard, but non-normative text
// should make that clear to avoid accidents.)  It is only useful
// for inter-thread communication, which would like to avoid the overhead of
// a full lock in most cases, but needs to run everywhere to some
// extent.  Based on limited experience, we nonetheless believe
// that this is a common case.

template <class T>
class atomic {
    public:
	static bool basics_supported();
		// Always yields true.
	atomic(T);
		// No ordering semantics, constructor not atomic.
	template <ordering_constraint c>
	    void store(const T&);
	    // Compile-time error if c is neither none nor release.
	template <ordering_constraint c>
	    T load();
	    // Compile-time error if c is neither none nor acquire.
	
	static bool cas_supported();
		// Always yields true.
		// Lock free if native_atomic::cas_supported()
		// yields true.

	// Compare-and-swap.  Does not fail spuriously.  Not wait-free
	// on ll-sc machines.
	template <ordering_constraint c>
	    bool cas(const T& old, const T& new_val);
	static bool cas_is_wait_free();

	// Compare-and-swap.  May fail spuriously.  Wait-free
	// on ll-sc machines.
	template <ordering_constraint c>
	    bool weak_cas(const T& old, const T& new_val);
	static bool weak_cas_is_wait_free();

	T operator T() { return load<acquire>(); }
	void operator=(const T& x) { return store<release>(x); }


// Atomic integral data type with fetch_and_... operations.
// Meaningful unly if the argument T is an integral type.  
template <class T=int>
class native_atomic_int : public native_atomic<T> {
    public:
	native_atomic_int(T);
		// No ordering semantics, not atomic.
	// The fetch_op functions may fail if cas_supported() yields false.
	// These are also not guaranteed to be wait-free, and hence
	// can be trivially emulated with cas.  We provide them
	// directly for convenience, and since they may have slightly
	// faster implementations.
	// They return the original value of the atomic.
	template <ordering_constraint c>
	    T fetch_add(T);
	static bool fetch_add_is_wait_free();
		// Is the operation wait-free?
	template <ordering_constraint c>
	    T fetch_and(T);
	static bool fetch_and_is_wait_free();
	template <ordering_constraint c>
	    T fetch_or(T);
	static bool fetch_or_is_wait_free();
};

template <class T=int>
class atomic_int : public atomic<T> {
    public:
	atomic_int(T);
		// No ordering semantics, not atomic.
	// The fetch_op functions always succeed, but are guaranteed
	// to be lock-free only id native_atomic::cas_supported yields
	// true.
	template <ordering_constraint c>
	    T fetch_add(T);
	static bool fetch_add_is_wait_free();
		// Is the operation wait-free?
	template <ordering_constraint c>
	    T fetch_and(T);
	static bool fetch_and_is_wait_free();
	template <ordering_constraint c>
	    T fetch_or(T);
	static bool fetch_or_is_wait_free();
};

// Adds fetch_and_... operations to a pointer type.
// Meaningful unly if T is a pointer type.
template <class T>
class native_atomic_ptr : public native_atomic<T> {
    public:
	native_atomic_ptr(T);
		// No ordering semantics, not atomic.
	// The following may fail if native_atomic<T>::cas_supported()
	// yields false.
	// This atomic pointer addition or
	// subtraction, i.e. a multiple of the object size is added
	// to the address.
	// Also by analogy to basic pointer types, we do not directly
	// provide for tags.
	template <ordering_constraint c> T fetch_and_add(ptr_diff_t);
	static bool fetch_add_is_wait_free();
};

// Adds fetch_and_... operations to a pointer type.
// Meaningful unly if T is a pointer type.
template <class T>
class atomic_ptr : public atomic<T> {
    public:
	atomic_ptr(T);
		// No ordering semantics, not atomic.
	// Always succeeds, since it may be emulated with locks.
	template <ordering_constraint c> T fetch_and_add(ptr_diff_t);
	static bool fetch_add_is_wait_free();
};

The low level API

There appears to be agreement that we also need a lower level C-compatible API for expressing atomic operations. The details are less clear. In particular, based on earlier discussions, there is a choice to be made as to whether C-level atomic operations should operate only on specially declared "atomic data", or should be usable on arbitrary scalars.

The former design potentially provides better compatibility with the higher level C++ API discussed above. But it makes it significantly harder to convert existing code to the new API, since existing interfaces, such as the Intel/gcc __sync_... intrinsics, often operate on scalar data that has not been specially declared, and may thus be used to operate on fields of previously declared structures.

(Although the Intel/gcc primitives are somewhat established, and are in many ways similar to what is proposed here, I do not advocate standardizing them as is. They provide far less control over memory ordering than we advocated above. For example, they provide no way to atomically increment a counter without imposing unnecessary ordering constraints. The lack of appropriate ordering control appears to already have resulted in implementation shortcuts, e.g. gcc does not implement __sync_synchronize() as a full memory barrier on X86, in spite of the documentation. I believe a number of issues were not fully understood when that design needs as developed, and it could could greatly benefit from another iteration at this stage.)

Here we propose an interface similar to what was tentatively discussed at the Berlin meeting. In order to facilitate conversion of existing code, we dropped the previously required volatile qualifiers on arguments.

Again, this is presented in header file format for now.

// This is an attempt at providing low-level C style
// atomics.  This does not follow existing precedent, in that we are
// very explicit about ordering constraints.

// All primitives have corresponding feature tests.
// Each feature test macro is either:
//   - undefined if no instance of the platform defines the feature, or
//   - defined to a possibly non-constant (in the C/C++ sense) expression
//     which evaluates to a nonzero value if the feature is present on the
//     current machine.
// For example, ATOMIC_HAVE_INT_CAS would be undefined on an architecture
// such as PA-RISC, which does not provide a hardware compare-and-swap
// operation, but it would be defined to a runtime expression, which tests
// that the processor is at least a 486 when compiling generic (386 or better)
// X86 code.  On an architecture like Itanium, where all instances provide
// the operation, it would be defined to a nonzero constant such as 1.
//
// Thus most uses of these macros would be of the form
//      # ifndef ATOMIC_HAVE_...
//	#   define ATOMIC_HAVE_... 0
//	# endif
//
// 	if (ATOMIC_HAVE_...)
//	  < code that uses atomic operations >
//      else
//	  < lock-based code >
//
// #ifndef ATOMIC_HAVE_...
// can be used to substitute alternate code on platforms that never provide
// the feature.
//
// This arrangement does not allow a preprocessor test for whether the
// feature is defined on all instances of the processor.  Our guess is that
// this is not particularly useful.  Typically if any instances provide it,
// nearly all future instances will, and thus code should include the
// dynamic tests.
//
// We could provide both static and dynamic feature tests directly, but
// that significantly increases the size of the interface.  Or we could
// eliminate the static test, but that would make it harder to deal with
// cases (such as a linked stack) that are likely to use different 
// data structured depending the availability of primitives.  Thus this
// compromise.

// Some feature test macros
#define ATOMIC_HAVE_CHAR_BASICS <impl. defined>
	// nonzero if ordered & unordered load/store primitives on
	// char and unsigned char are supported.
#define ATOMIC_HAVE_SHORT_BASICS <impl. defined>
#define ATOMIC_HAVE_INT_BASICS <impl. defined>
#define ATOMIC_HAVE_LONG_BASICS <impl. defined>
#define ATOMIC_HAVE_PTR_BASICS <impl. defined>
	// 1 if ordered & unordered load/store primitives on
	// void * are supported.
void atomic_memory_fence(void);     // Guarantees explicit memory ordering
				     // for otherwise unordered atomics,
				     // and for other memory references wrt
				     // atomics.  Not useful for ordering
				     // ordinary memory references, since
				     // those may not race and, if they don't
				     // race, always appear to be ordered.
void atomic_compiler_fence(void);   // Ensures that prior memory operations
				     // appear in the instruction stream
				     // before subsequent ones, i.e. the
				     // compiler is not allowed to reorder
				     // around this.  This really has only
				     // implementation-defined semantics,
				     // but it seems to be useful in
				     // ensuring ordering with respect to
				     // signal handlers and the like.
				     
// I will assume that the following are overloaded for T in (void *)
// and unsigned {char, short, int, long}.
// A signed version can be derived from the unsigned.
// (In a strictly conforming program, I think this requires adding an
// explicit bias.  In practice, it's a non-issue.)
// They may not be applied to user-defined types.
// The semantics of primitives whose corresponding feature test macro is
// not defined are left undefined.  Implementations that fail conspicuously
// are preferred over implementations that occasionally produce unexpected
// outcomes, w.g. by relaxing the atomicity constraint.
void atomic_store(T*, T);  // No ordering guarantees.
void atomic_store_release(T*, T);

T atomic_load(T*);	// No ordering guarantees.
T atomic_load_acquire(T*);

#define ATOMIC_HAVE_T_CAS  <impl. defined>
	// Abbreviates 5 different feature tests for different
	// replacements of T.

#define ATOMIC_HAVE_WAIT_FREE_T_CAS  <impl. defined>
	// Abbreviates 5 different feature tests for different
	// replacements of T.  Indicates whether the below
	// operation is wait-free.  Undefined if it is never wait-free.

T atomic_cas[_order](T* addr, T old_val, T new_val);
	// Order can be any of raw, acquire, release, or ordered.
	// "Raw" implies the operation is unordered.
	// Most architectures provide a way to return the old
	// value.  On those that do not, it can be emulated with
	// an additional load, at the expense of wait-freedom
	// or spurious failure.

#define ATOMIC_HAVE_WEAK_T_CAS  <impl. defined>
T atomic_weak_cas[_order](T* addr, T old_val, T new_val);
	// Similar to the above, but may fail spuriously, and
	// must be wait-free, if provided.

// It is unclear how various flavors of double-wide or two operand
// CAS should be handled.  We omit them here.  That may in fact
// be a reasonable alternative.

#define ATOMIC_HAVE_T_FETCH_ADD <impl. defined>
	// Abbreviates 5 different feature tests for different
	// replacements of T.

T atomic_fetch_add[_order](T* addr, T incr);
	// Order can be any of acquire, release, or ordered.
	// If it is omitted, the operation is unordered.

#define ATOMIC_HAVE_T_FETCH_OR <impl. defined>

T atomic_fetch_or[_order](T* addr, T mask);

#define ATOMIC_HAVE_T_FETCH_AND <impl. defined>

T atomic_fetch_and[_order](T* addr, T mask);

// A simple test-and-set primitive.  We probably don't need a
// feature test macro, since this pretty much has to be supported.
typedef enum {atomic_ts_clear, atomic_ts_set} atomic_ts_val;

typedef <implementation defined> atomic_ts_loc;
	// Needs strange alignments, etc on some architectures.

atomic_ts_val atomic_ts(atomic_ts_loc *addr);

void atomic_ts_clear(atomic_ts_loc *addr);
void atomic_ts_clear_release(atomic_ts_loc *addr);

#define ATOMIC_TS_LOC_INITIALIZER <implementation defined>
	// Initialization expression for cleared atomic_ts_loc.