ISO/IEC JTC1 SC22 WG21 N2324 = 07-0184 - 2007-06-24
 Hans-J. Boehm, Hans.Boehm@hp.com, boehm@acm.org
Lawrence Crowl, crowl@google.com, Lawrence@Crowl.org
This document is a revision of N2145 = 07-0005 - 2007-01-12.
We present an interface and minimal implementation for standard atomic types and operations.
The standard needs atomic types for two reasons.
The traditional shared-memory notion that every store is instantly visible to all threads induces an unacceptable performance loss on modern systems. Therefore programmers must have a mechanism to indicate when stores in one thread should be communicated to another. This mechanism must necessarily be atomic, and we use the proposed interface to provide that mechanism.
Specifically, a program that wishes to communicate the fact that a particular piece of data prepared by one thread is ready to be examined by another thread, needs a shared variable flag, that both
Although the second aspect is often glossed over, it is usually not automatic with modern hardware and compilers, and is just as important as the first in ensuring correctness.
N1875 presented an atomic statement, which would require the compiler to select the appropriate atomic operations. We chose to abandon this approach because:
N2047 presented a template-based C++ library of atomic types layered on top of a C library of atomic operations on plain integral types. We chose to abandon this approach because:
N2145 is the basis for this proposal and is not fundamentally different. However, we have refined the proposal in several areas.
We propose to add atomic types and operations purely as a library API. In practice, this API would have to be implemented largely with either compiler intrinsics or assembly code. (As such, this proposal should be implemented by compiler vendors, not library vendors, much as the exception facilities are implemented by compiler vendors.)
The proposal defines atomic types and defines all atomic operations on those types, which enables enforcement of a single synchronization protocol for all operations on an object.
The proposal defines the types as standard layout structs and the core functions on pointers to those structs, so that types and core functions are usable from both C and C++. That is, a header included from both C and C++ can declare a variable of an atomic type and provide inline functions that operate on them. The proposal additionally provides member operators and member functions so that C++ programmers may use a more concise syntax.
The proposal defines the core functions as overloaded functions in C++ and as type-generic macros in C. This approach helps programmers avoid changing code when an atomic type changes size.
The proposal defines additional template types to aid type-safe use of atomic pointers and to simplify the wrapping of user-defined types.
The proposal defines feature queries to determine whether or not a type's operations are lock-free. In some cases, both the decision to use a lock-free algorithm, and sometimes the choice of lock-free algorithm, depends on the availability of underlying hardware primitives. In other cases, e.g. when dealing with asynchronous signals, it may be important to know that operations like compare-and-swap are really lock-free, because a lock-based emulation might result in deadlock. We provide two kinds of feature queries, compile-time preprocessor macros and run-time functions. The former provide general characteristics of the implementation. The latter provide per-type information.
To facilitate inter-process communication via shared memory, it is our intent that lock-free operations also be address-free. That is, atomic operations on the same memory location via two different addresses will communicate atomically. The implementation may not depend on any per-process state. While such a definition is beyond the scope of the standard, a clear statement of our intent will enable a portable expression of class of a programs already extant.
Synchronization operations in the memory model (N2052, N2138, N2171, N2176, N2177, N2300) may be either acquire or release operations, or both. These operations govern the communication of non-atomic stores between threads. A release operation ensures that prior memory operations become visible to the thread performing subsequent acquire operation on the same object.
Rather than have atomic operations implicitly provide both acquire and release semantics, we choose to complicate the interface by adding explicit ordering specifications to various operations. Many comparable packages do not, and instead provide only a single version of operations, like compare-and-swap, which implicitly include a full memory fence.
Unfortunately, the extra ordering constraint introduced by the single version is almost never completely necessary. For example, an atomic increment operation may be used simply to count the number of times a function is called, as in a profiler. This requires atomicity, but no ordering constraints. And on many architectures (PowerPC, ARM, Itanium, Alpha, though not X86), the extra ordering constraints are at least moderately expensive.
It is also unclear how a convention requiring full memory fences would properly interact with an interface that supported simple atomic loads and stores. Here a full memory fence would generally multiply the cost by a large factor. (The current gcc interface does not seem to support simple atomic loads and stores explicitly, which makes it unclear how to support e.g. lock-based emulation, or architectures on which the relevant loads and stores are not implicitly atomic.)
There are two possible approaches to specifying ordering constraints:
Both approaches appear to have their merits. We chose the latter for several reasons:
Some architectures provide fences that are limited to loads or stores. We have, so far, not included them, since it seems to be hard to find cases in which both:
However, we have provided limited fence operations, which are semantically modeled on read-modify-write operations. We expect that implementations that have hardware fences will use such operations to implement the language fences.
Most architectures provide additional ordering guarantees if one memory operation is dependent on another. In fact, these are critical for efficient implementation of languages like Java.
In this case, there is near-universal agreement that it would be nice to have some such guarantees. The difficulty is that we have not been able to formulate such a guarantee that both makes sense at the C++ source level, and does not interfere with optimization of code in compilation units that do not mention atomics. The fundamental issues are:
A strict interpretation of acquire and release yields a fairly weak consistency model, which allows threads to have a different notion of the order of writes. For stronger consistency, this proposal destinguishes between an operation with acquire and release semantics and an operation with sequentially consistent semantics.
The proposal defines a synchronization enumeration, which enables detailed specification of the memory order for every operation.
The proposal includes a very simple atomic flag type, providing two operations, test-and-set-acquire and clear-release. This type is the minimum hardware-implemented type needed to conform to this standard. The remaining types can be emulated with the atomic flag, though with less than ideal properties. Few programmers should be using this type.
The proposal includes several standard scalar atomic types: boolean, address, and many integral types.
To support generic data and algorithms, the proposal also includes a generic atomic type, a partial specialization for pointers (based on the atomic address), and full specializations for integral types (based on the atomic scalars).
The primary problem with a generic atomic template is that effective use of machine operations requires three properties of their parameter types:
In the present language, there is no mechanism to enforce the properties parameter type. Roland Schwarz suggests using a template union to enfore POD parameter types. Unfortunately, that approach also prevents the derivation of specializations of atomic for the types above, which is unacceptable. Furthermore, Lois Goldthwaite proposes generalizing unions to permit non-POD types in N2248 Toward a More Perfect Union. We believe that concepts are a more appropriate mechanism to enforce this restriction, but we do not employ that mechanism in this paper.
The intent is that vendors will specialize a fully-general locking implementation of a generic atomic template with implementations using hardware primitives when those primitives are applicable. This specialization can be accomplished by defining a base template with the size and alignment of the parameter type as additional template parameters, and then specializing on those two arguments.
The proposal defines several atomic functions, which serve the requirements of both C and C++ programmers. Because these functions are clumsy, the proposal includes member operator and member function definitions that are syntactically simpler. The member operators provide only the strongest memory synchronization.
The fetch-and-operation functions return the original stored value. This approach is required for fetch-and-or and fetch-and-and because there is no means to compute to the original stored value from the new value and the modifying argument. In contrast to the core functions, the modifying assignment operators return the new value. We do this for consistency with normal assignment operators. Unlike normal assignment operators, though, the atomic assignments return values rather than references. The reason is that another thread might intervene between an assignment and a subsequent read. Rather than introduce this classic parallel programming bug, we return a value.
The functions and operations are defined to work with volatile objects, so that variables that should be volatile can also be atomic. The volatile qualifier, however, is not required for atomicity.
The normal signed integral addition operations have undefined behavior in the event of an overflow. For atomic variables, this undefined behavior introduces significant burden. Rather than leave overflow undefined, we recognize the defacto behavior of modern systems and define atomic fetch-and-add (-subtract) to use two's-complement arithmetic. We are aware of no implementation of C++ for which this definition is a problem.
The fence operations act upon an atomic variable and provide synchronization semantics of a read-modify-write operation.
We believe that there is ample evidence for implementatability.
The Intel/gcc __sync intrinsics provide evidence for compiler implementability of the proposed interface.
(We do not advocate standardizing these intrisics as is. They provide far less control over memory ordering than we advocated above. For example, they provide no way to atomically increment a counter without imposing unnecessary ordering constraints. The lack of appropriate ordering control appears to already have resulted in implementation shortcuts, e.g. gcc does not implement __sync_synchronize() as a full memory barrier on X86, in spite of the documentation. We believe a number of issues were not fully understood when that design was developed, and it could could greatly benefit from another iteration at this stage.)
Other packages, particularly Boehm's atomic_ops package provide evidence of efficient implementability over a range of architectures.
This proposal includes atomic integers smaller than a machine word, even though many architectures do not have such operations. For machines that implement a word-based compare-and-swap operation, the effect of operations can be achieved by loading the containing word, modifying the sub-word in place, and performing a compare-and-swap on the containing word. In the event that no compare-and-swap is available, the implementation may need to either implement smaller types with larger types or use locking algorithms.
The remaining implementation issue is the burden on implementors to produce a minimally conforming implementation. The minimum hardware support required is the atomic test-and-set and clear operations that form the basis of the atomic flag type. This proposal includes an example implementation based on that minimal hardware support, and thus shows the vendor work required.
The proposal does not standardize any flag wait functions, despite using one. This issue as that any function will often be inappropriate. Should the proposal define a set of wait functions for common situations? Example situations include a non-preemptive kernel execution and preemptive space execution.
There are FIX 470 functions and FIX 206 methods defined by this proposal. While the methods generally have trival implementations, the functions do not. Will vendors commit to implementing this large a definition? Note that both the prototype presented here and the atomic_ops package provide moderately compact implementations, in spite of the interface sizes.
The mechanism provided to verify that programmers are not using detailed synchronization control is to encode such control in an enumeration which is searchable. Is this acceptable?
Are non-volatile overloads in template classes neccessary or desireable?
Is a per-type static lock-free query function desireable?
The present implementation does not privatize members. This is a shortcut to avoid all the friend declarations.
This section of the proposal is intended to provide the basis for formal wording in a subsequent proposal.
The standard provides two headers, cstdatomic and stdatomic.h. The cstdatomic header defines the types and operations in namespace std. The stdatomic.h header defines the types and operations in namespace std also exports the types to the global namespace.
The enumeration memory_order specifies the detailed memory synchronization order as defined in N2300. Its enumerated values and their meanings are as follows.
An atomic store shall only store a value that has been computed from constants and program input values by a finite sequence of program evaluations, such that each evaluation observes the values of variables as computed by the last prior assignment in the sequence. The ordering of evaluations in this sequence must be such that
y.store( 42, memory_order_relaxed );On the other hand,
r1 = y.load( memory_order_relaxed );
x.store( r1, memory_order_relaxed );
r2 = x.load( memory_order_relaxed );
There are only a few kinds of general operations, though there are many variants on those kinds. The operations may explicitly specify memory order. Implicitly, the memory order is memory_order_seq_cst.
Implementations should stive to make atomic stores visible to atomic loads within a reasonable amount of time. They should never move an atomic operation out of an unbounded loop.
Many operations are volatile qualified. This qualification means that volatility is preserved when applying these operations to volatile objects. It does not mean that operatoins on non-volatile objects become volatile.
The atomic_flag type is the minimum hardware-implemented type needed to conform to this standard. The remaining types can be emulated with the atomic flag, though with less than ideal properties. The flag type has two primary operations, test-and-set and clear.
The atomic_bool type supports store, load, swap, compare-and-swap, and fence.
The atomic_address type provides atomic void* operations and supports store, load, swap, compare-and-swap, fence, fetch-and-add, and fetch-and-subtract. The unit of addition/subtraction is one byte.
The atomic integral types support store, load, swap, compare-and-swap, fence, fetch-and-add, fetch-and-subtract, fetch-and-and, and fetch-and-or, and fetch-and-xor. The integral types are characters
atomic_char16_t, atomic_char32_t, atomic_wchar_tand integers
atomic_char, atomic_schar, atomic_uchar, atomic_short, atomic_ushort, atomic_int, atomic_uint, atomic_long, atomic_ulong, atomic_llong, atomic_ullongIn addition, there are typedefs for atomic types corresponding to the stdint typedefs.
atomic_int_least8_t, atomic_uint_least8_t, atomic_int_least16_t, atomic_uint_least16_t, atomic_int_least32_t, atomic_uint_least32_t, atomic_int_least64_t, atomic_uint_least64_t, atomic_int_fast8_t, atomic_uint_fast8_t, atomic_int_fast16_t, atomic_uint_fast16_t, atomic_int_fast32_t, atomic_uint_fast32_t, atomic_int_fast64_t, atomic_uint_fast64_t, atomic_intptr_t, atomic_uintptr_t, atomic_size_t, atomic_ssize_t, atomic_ptrdiff_t, atomic_intmax_t, atomic_uintmax_t
Note that in C, the atomic character types are not distinct types, but rather typedefs to other atomic integral types.
There is a generic atomic class template. The template requires that its type argument be
There are pointer partial specializations on the atomic class template. The operations supported are store, load, swap, compare-and-swap, fence, fetch-and-add, and fetch-and-subtract. For atomic pointer partial specializations, the unit of addition/subtraction is the size of the referenced type.
Finally, there are full specializations over the integral types on the atomic class template. The operations supported are store, load, swap, compare-and-swap, fence, fetch-and-add, fetch-and-subtract, fetch-and-and, fetch-and-or, and fetch-and-xor.
Whether or not a particular type supports lock-free operations is important to its use in signal handlers and certain algorithms, so there are queries for the lock-free property. Because consistent use of operations requires that all operations on a type must use the same protocol, all operations are wait-free or none of them are. Therefore, there is a single wait-free query per type. However, the proposal defines operations on the atomic_flag type to be wait-free.
To facilitate optimal storage use, the proposal supplies a feature macro, ATOMIC_SCALAR_LOCK_FREE, that indicates general lock-free status of scalar atomic types. A value of 0 indicates that the scalar types are never lock-free. A value of 1 indicates that the scalar types are sometimes lock-free. A value of 2 indicates that the scalar types are always lock-free.
The result lock-free query on an object cannot be inferred from the result of a lock-free query on another object. [Note: This rule permits misaligned atomic variable when they are unavoidable. —end note ]
The clear and test-and-set operations must be lock-free, and hence address-free.
The atomic types and operations have the following synopsis.
typedef enum memory_order {
       memory_order_relaxed, memory_order_acquire, memory_order_release,
       memory_order_acq_rel, memory_order_seq_cst
} memory_order;
typedef struct atomic_flag
{
    bool test_and_set( memory_order = memory_order_seq_cst ) volatile;
    void clear( memory_order = memory_order_seq_cst ) volatile;
    atomic_flag() = default;
    atomic_flag( const atomic_flag& ) = delete;
    atomic_flag& operator =( const atomic_flag& ) = delete;
}
bool atomic_flag_test_and_set( volatile atomic_flag* );
bool atomic_flag_test_and_set_explicit( volatile atomic_flag*, memory_order );
void atomic_flag_clear( volatile atomic_flag* );
void atomic_flag_clear_explicit( volatile atomic_flag*, memory_order );
typedef struct atomic_bool
{
    bool lock_free();
    void store( bool, memory_order = memory_order_seq_cst ) volatile;
    bool load( memory_order = memory_order_seq_cst ) volatile;
    bool swap( bool, memory_order = memory_order_seq_cst ) volatile;
    bool compare_swap( bool&, bool,
                       memory_order = memory_order_seq_cst ) volatile;
    void fence( memory_order ) volatile;
    atomic_bool() = default;
    constexpr atomic_bool( bool __v__ );
    atomic_bool( const atomic_bool& ) = delete;
    atomic_bool& operator =( const atomic_bool & ) = delete;
    bool operator =( bool ) volatile;
    operator bool() volatile;
} atomic_bool;
bool atomic_lock_free( volatile atomic_bool* );
void atomic_store( volatile atomic_bool*, bool );
void atomic_store_explicit( volatile atomic_bool*, bool, memory_order );
bool atomic_load( volatile atomic_bool* );
bool atomic_load_explicit( volatile atomic_bool*, memory_order );
bool atomic_swap( volatile atomic_bool* );
bool atomic_swap_explicit( volatile atomic_bool*, bool );
bool atomic_compare_swap( volatile atomic_bool*, bool*, bool );
bool atomic_compare_swap_explicit( volatile atomic_bool*, bool*, bool,
                                   memory_order );
void atomic_fence( volatile atomic_bool*, memory_order ) volatile;
typedef struct atomic_address
{
    bool lock_free();
    void store( void*, memory_order = memory_order_seq_cst ) volatile;
    void* load( memory_order = memory_order_seq_cst ) volatile;
    void* swap( void*, memory_order = memory_order_seq_cst ) volatile;
    void* compare_swap( void*&, void*,
                        memory_order = memory_order_seq_cst ) volatile;
    void fence( memory_order ) volatile;
    void* fetch_add( void*, memory_order = memory_order_seq_cst ) volatile;
    void* fetch_sub( void*, memory_order = memory_order_seq_cst ) volatile;
    atomic_address() = default;
    constexpr atomic_address( void* );
    atomic_address( const atomic_address& ) = delete;
    atomic_address& operator =( const atomic_address& ) = delete;
    void* operator =( void* ) volatile;
    operator void*() volatile;
    void* operator +=( void* ) volatile;
    void* operator -=( void* ) volatile;
} atomic_address;
bool atomic_lock_free( volatile atomic_address* );
void atomic_store( volatile atomic_address*, void* );
void atomic_store_explicit( volatile atomic_address*, void*, memory_order );
void* atomic_load( volatile atomic_address* );
void* atomic_load_explicit( volatile atomic_address*, memory_order );
void* atomic_swap( volatile atomic_address* );
void* atomic_swap_explicit( volatile atomic_address*, void* );
void* atomic_compare_swap( volatile atomic_address*, void**, void* );
void* atomic_compare_swap_explicit( volatile atomic_address*, void**, void*,
                                    memory_order );
void atomic_fence( volatile atomic_address*, memory_order ) volatile;
void* atomic_fetch_add( volatile atomic_address*, void* );
void* atomic_fetch_add_explicit( volatile atomic_address*, void*,
                                 memory_order );
void* atomic_fetch_sub( volatile atomic_address*, void* );
void* atomic_fetch_sub_explicit( volatile atomic_address*, void*,
                                 memory_order );
And for each of the integral (character and integer) types listed above,
typedef struct atomic_integral
{
    bool lock_free();
    void store( integral, memory_order = memory_order_seq_cst ) volatile;
    integral load( memory_order = memory_order_seq_cst ) volatile;
    integral swap( integral,
                   memory_order = memory_order_seq_cst ) volatile;
    bool compare_swap( integral&, integral,
                       memory_order = memory_order_seq_cst ) volatile;
    void fence( memory_order ) volatile;
    integral fetch_add( integral,
                        memory_order = memory_order_seq_cst ) volatile;
    integral fetch_sub( integral,
                        memory_order = memory_order_seq_cst ) volatile;
    integral fetch_and( integral,
                        memory_order = memory_order_seq_cst ) volatile;
    integral fetch_or( integral,
                        memory_order = memory_order_seq_cst ) volatile;
    integral fetch_xor( integral,
                        memory_order = memory_order_seq_cst ) volatile;
    atomic_integral() = default;
    constexpr atomic_integral( integral );
    atomic_integral( const atomic_integral& ) = delete;
    atomic_integral& operator =( const atomic_integral & ) = delete;
    integral operator =( integral ) volatile;
    operator integral() volatile;
    integral operator +=( integral ) volatile;
    integral operator -=( integral ) volatile;
    integral operator &=( integral ) volatile;
    integral operator |=( integral ) volatile;
    integral operator ^=( integral ) volatile;
} atomic_integral;
bool atomic_lock_free( volatile atomic_integral* );
void atomic_store( volatile atomic_integral*, integral );
void atomic_store_explicit( volatile atomic_integral*, integral, memory_order );
integral atomic_load( volatile atomic_integral* );
integral atomic_load_explicit( volatile atomic_integral*, memory_order );
integral atomic_swap( volatile atomic_integral*, integral );
integral atomic_swap_explicit( volatile atomic_integral*, integral,
                               memory_order );
bool atomic_compare_swap( volatile atomic_integral*, integral*, integral );
bool atomic_compare_swap_explicit( volatile atomic_integral*, integral*,
                                   integral, memory_order );
void atomic_fence( volatile atomic_integral*, memory_order ) volatile;
integral atomic_fetch_add( volatile atomic_integral*, integral );
integral atomic_fetch_add_explicit( volatile atomic_integral*, integral,
                                    memory_order );
integral atomic_fetch_sub( volatile atomic_integral*, integral );
integral atomic_fetch_sub_explicit( volatile atomic_integral*, integral,
                                    memory_order );
integral atomic_fetch_and( volatile atomic_integral*, integral );
integral atomic_fetch_and_explicit( volatile atomic_integral*, integral,
                                    memory_order );
integral atomic_fetch_or( volatile atomic_integral*, integral );
integral atomic_fetch_or_explicit( volatile atomic_integral*, integral,
                                    memory_order );
integral atomic_fetch_xor( volatile atomic_integral*, integral );
integral atomic_fetch_xor_explicit( volatile atomic_integral*, integral,
                                    memory_order );
This proposal embeds an example, minimally-conforming, implementation. The implementation uses a hash table of flags and does busy waiting on the flags.
The proposal marks the defined interface with the <code> font tag, which typically renders in a teletype font.
The proposal marks the example implemenation with the <var> font tag within the <code> font tag, which typically renders in an italic teletype font. This example implementation is not part of the standard; it is evidence of implementability.
The embedded source is a bash script that generates the C and C++ source files. We have taken this approach because the definitions have a high degree of redundancy, which would otherwise interfere with the readability of the document.
To extract the bash script from the HTML source, use the following sed script. (The bash script will also generate the sed script.)
echo n2324.sed
cat <<EOF >n2324.sed
1,/<code>/        d
/<\/code>/,/<code>/    d
            s|<var>||g
            s|</var>||g
            s|<|<|g
            s|>|>|g
            s|&|\&|g
EOF
To compile the enclosed sources and examples, use the following Makefile. (The bash script will also generate the Makefile.)
echo Makefile
cat <<EOF >Makefile
default : test
n2324.bash : n2324.html
	sed -f n2324.sed n2324.html > n2324.bash
stdatomic.h cstdatomic impatomic.h impatomic.c n2324.c : n2324.bash
	bash n2324.bash
impatomic.o : impatomic.h impatomic.c
	gcc -std=c99 -c impatomic.c
n2324.c.exe : n2324.c stdatomic.h impatomic.o
	gcc -std=c99 -o n2324.c.exe n2324.c impatomic.o
n2324.c++.exe : n2324.c stdatomic.h impatomic.o
	g++ -o n2324.c++.exe n2324.c impatomic.o
test : n2324.c.exe n2324.c++.exe
clean :
	rm -f n2324.bash stdatomic.h cstdatomic impatomic.h impatomic.c
	rm -f impatomic.o n2324.c.exe n2324.c++.exe
EOF
As is common practice, we place the common portions of the C and C++ standard headers in a separate implementation header.
The implementation header includes standard headers to obtain basic typedefs.
echo impatomic.h includes
cat <<EOF >impatomic.h
#ifdef __cplusplus
#include <cstddef>
namespace std {
#else
#include <stddef.h>
#include <stdbool.h>
#endif
EOF
The corresponding implementation source includes the implementation header and stdint.h.
echo impatomic.c includes
cat <<EOF >impatomic.c
#include <stdint.h>
#include "impatomic.h"
EOF
Because current compilers do not support the new C++0x features, we have surrounded these with a macro to conditionally remove them.
echo impatomic.h CPP0X
cat <<EOF >>impatomic.h
#define CPP0X( feature )
EOF
echo impatomic.h order
cat <<EOF >>impatomic.h
typedef enum memory_order {
    memory_order_relaxed, memory_order_acquire, memory_order_release,
    memory_order_acq_rel, memory_order_seq_cst
} memory_order;
EOF
To aid the emulated implementation, the example implementation includes a predefined hashtable of locks implemented via flags.
echo impatomic.h flag
cat <<EOF >>impatomic.h
typedef struct atomic_flag
{
#ifdef __cplusplus
    bool test_and_set( memory_order = memory_order_seq_cst ) volatile;
    void clear( memory_order = memory_order_seq_cst ) volatile;
    atomic_flag() CPP0X(=default);
    atomic_flag( const atomic_flag& ) CPP0X(=delete);
    atomic_flag& operator =( const atomic_flag& ) CPP0X(=delete);
#endif
    bool __f__;
} atomic_flag;
#define ATOMIC_FLAG_INIT { false }
#ifdef __cplusplus
extern "C" {
#endif
extern bool atomic_flag_test_and_set( atomic_flag volatile* );
extern bool atomic_flag_test_and_set_explicit
( atomic_flag volatile*, memory_order );
extern void atomic_flag_clear( atomic_flag volatile* );
extern void atomic_flag_clear_explicit
( atomic_flag volatile*, memory_order );
extern void __atomic_flag_wait__
( atomic_flag volatile* );
extern void __atomic_flag_wait_explicit__
( atomic_flag volatile*, memory_order );
extern atomic_flag volatile* __atomic_flag_for_address__
( void volatile* __z__ )
__attribute__((const));
#ifdef __cplusplus
}
#endif
#ifdef __cplusplus
inline bool atomic_flag::test_and_set( memory_order __x__ ) volatile
{ return atomic_flag_test_and_set_explicit( this, __x__ ); }
inline void atomic_flag::clear( memory_order __x__ ) volatile
{ atomic_flag_clear_explicit( this, __x__ ); }
#endif
EOF
The wait operation may be implemented with busy-waiting, and hence must be used only with care.
The for_address function returns the address of a flag. Multiple argument values may yield a single flag, and the implementation of locks may use these flags, so no operation should attempt to hold any flag or lock while holding a flag for an address.
The prototype implementation of flags uses the __sync macros from the GNU C/C++ compiler, when available, and otherwise uses a non-atomic implementation with the expectation that vendors will replace it. It might even be implemented with, for example, pthread_mutex_trylock, in which case the internal flag wait function might just be pthread_mutex_lock. This would of course tend to make atomic_flag larger than necessary.
The prototype implementation of flags is implemented in C.
echo impatomic.c flag
cat <<EOF >>impatomic.c
#if defined(__GNUC__)
#if __GNUC__ > 4 || (__GNUC__ == 4 && __GNUC_MINOR__ > 0)
#define USE_SYNC
#endif
#endif
bool atomic_flag_test_and_set( atomic_flag volatile* __a__ )
{ return atomic_flag_test_and_set_explicit( __a__, memory_order_seq_cst ); }
bool atomic_flag_test_and_set_explicit
( atomic_flag volatile* __a__, memory_order __x__ )
{
#ifdef USE_SYNC
    if ( __x__ >= memory_order_acq_rel )
        __sync_synchronize();
    return __sync_lock_test_and_set( &(__a__->__f__), 1 );
#else
    bool result = __a__->__f__;
    __a__->__f__ = true;
    return result;
#endif
}
void atomic_flag_clear( atomic_flag volatile* __a__ )
{ atomic_flag_clear_explicit( __a__, memory_order_seq_cst ); }
void atomic_flag_clear_explicit
( atomic_flag volatile* __a__, memory_order __x__ )
{
#ifdef USE_SYNC
    __sync_lock_release( &(__a__->__f__) );
    if ( __x__ >= memory_order_acq_rel )
        __sync_synchronize();
#else
    __a__->__f__ = false;
#endif
} 
Note that the following implementation of wait is almost always wrong -- it has high contention. Some form of exponential backoff prevents excessive contention.
void __atomic_flag_wait__( atomic_flag volatile* __a__ )
{ while ( atomic_flag_test_and_set( __a__ ) ); }
void __atomic_flag_wait_explicit__( atomic_flag volatile* __a__,
                                    memory_order __x__ )
{ while ( atomic_flag_test_and_set_explicit( __a__, __x__ ) ); }
#define LOGSIZE 4
static atomic_flag volatile __atomic_flag_anon_table__[ 1 << LOGSIZE ] =
{
    ATOMIC_FLAG_INIT, ATOMIC_FLAG_INIT, ATOMIC_FLAG_INIT, ATOMIC_FLAG_INIT,
    ATOMIC_FLAG_INIT, ATOMIC_FLAG_INIT, ATOMIC_FLAG_INIT, ATOMIC_FLAG_INIT,
    ATOMIC_FLAG_INIT, ATOMIC_FLAG_INIT, ATOMIC_FLAG_INIT, ATOMIC_FLAG_INIT,
    ATOMIC_FLAG_INIT, ATOMIC_FLAG_INIT, ATOMIC_FLAG_INIT, ATOMIC_FLAG_INIT,
};
atomic_flag volatile* __atomic_flag_for_address__( void volatile* __z__ )
{
    uintptr_t __u__ = (uintptr_t)__z__;
    __u__ += (__u__ >> 2) + (__u__ << 4);
    __u__ += (__u__ >> 7) + (__u__ << 5);
    __u__ += (__u__ >> 17) + (__u__ << 13);
    if ( sizeof(uintptr_t) > 4 ) __u__ += (__u__ >> 31);
    __u__ &= ~((~(uintptr_t)0) << LOGSIZE);
    return __atomic_flag_anon_table__ + __u__;
}
EOF
The remainder of the example implementation uses the following macros. These macros exploit GNU extensions for value-returning blocks (AKA statement expressions) and __typeof__.
The macros rely on data fields of atomic structs being named __f__. Other symbols used are __a__=atomic, __e__=expected, __f__=field, __g__=flag, __m__=modified, __o__=operation, __r__=result, __p__=pointer to field, __v__=value (for single evaluation), and __x__=memory-ordering.
echo impatomic.h macros implementation
cat <<EOF >>impatomic.h
#define _ATOMIC_LOAD_( __a__, __x__ ) \\
({ __typeof__((__a__)->__f__) volatile* __p__ = &((__a__)->__f__); \\
   atomic_flag volatile* __g__ = __atomic_flag_for_address__( __p__ ); \\
   __atomic_flag_wait_explicit__( __g__, __x__ ); \\
   __typeof__((__a__)->__f__) __r__ =*__p__; \\
   atomic_flag_clear_explicit( __g__, __x__ ); \\
   __r__; })
#define _ATOMIC_STORE_( __a__, __m__, __x__ ) \\
({ __typeof__((__a__)->__f__) volatile* __p__ = &((__a__)->__f__); \\
   __typeof__(__m__) __v__ = (__m__); \\
   atomic_flag volatile* __g__ = __atomic_flag_for_address__( __p__ ); \\
   __atomic_flag_wait_explicit__( __g__, __x__ ); \\
   *__p__ = __v__; \\
   atomic_flag_clear_explicit( __g__, __x__ ); \\
   __v__; })
#define _ATOMIC_MODIFY_( __a__, __o__, __m__, __x__ ) \\
({ __typeof__((__a__)->__f__) volatile* __p__ = &((__a__)->__f__); \\
   __typeof__(__m__) __v__ = (__m__); \\
   atomic_flag volatile* __g__ = __atomic_flag_for_address__( __p__ ); \\
   __atomic_flag_wait_explicit__( __g__, __x__ ); \\
   __typeof__((__a__)->__f__) __r__ = *__p__; \\
   *__p__ __o__ __v__; \\
   atomic_flag_clear_explicit( __g__, __x__ ); \\
   __r__; })
#define _ATOMIC_CMPSWP_( __a__, __e__, __m__, __x__ ) \\
({ __typeof__((__a__)->__f__) volatile* __p__ = &((__a__)->__f__); \\
   __typeof__(__e__) __q__ = (__e__); \\
   __typeof__(__m__) __v__ = (__m__); \\
   bool __r__; \\
   atomic_flag volatile* __g__ = __atomic_flag_for_address__( __p__ ); \\
   __atomic_flag_wait_explicit__( __g__, __x__ ); \\
   __typeof__((__a__)->__f__) __t__ = *__p__; \\
   if ( __t__ == *__q__ ) { *__p__ = __v__; __r__ = true; } \\
   else { *__q__ = __t__; __r__ = false; } \\
   atomic_flag_clear_explicit( __g__, __x__ ); \\
   __r__; })
EOF
echo impatomic.h lock-free macros
cat <<EOF >>impatomic.h
#define ATOMIC_SCALAR_LOCK_FREE 0
EOF
The standard defines atomic types corresponding to booleans, addresses, integers, and for C++, wider characters. These atomic types are defined in terms of a base type.
The base types have two names in this proposal, a short name usually embedded within other identifiers, and a long name for the base type. The mapping between them is as follows.
bool="bool"
address="void*"
INTEGERS="char schar uchar short ushort int uint long ulong llong ullong"
char="char"
schar="signed char"
uchar="unsigned char"
short="short"
ushort="unsigned short"
int="int"
uint="unsigned int"
long="long"
ulong="unsigned long"
llong="long long"
ullong="unsigned long long"
CHARACTERS="wchar_t"
/* CHARACTERS="char16_t char32_t wchar_t" // char*_t not yet in compilers */
char16_t="char16_t"
char32_t="char32_t"
wchar_t="wchar_t"
In addition to types, some operations also need two names, one for embedding within other identifiers, and one consisting of the operator.
ADR_OPERATIONS="add sub"
INT_OPERATIONS="add sub and ior xor"
add="+"
sub="-"
and="&"
ior="|"
xor="^"
echo impatomic.h type boolean
cat <<EOF >>impatomic.h
typedef struct atomic_bool
{
#ifdef __cplusplus
    bool lock_free() volatile;
    void store( bool, memory_order = memory_order_seq_cst ) volatile;
    bool load( memory_order = memory_order_seq_cst ) volatile;
    bool swap( bool, memory_order = memory_order_seq_cst ) volatile;
    bool compare_swap ( bool&, bool,
                        memory_order = memory_order_seq_cst) volatile;
    void fence( memory_order ) volatile;
    atomic_bool() CPP0X(=delete);
    CPP0X(constexpr) atomic_bool( bool __v__ ) : __f__( __v__ ) { }
    atomic_bool( const atomic_bool& ) CPP0X(=delete);
    atomic_bool& operator =( const atomic_bool& ) CPP0X(=delete);
    bool operator =( bool __v__ ) volatile
    { store( __v__ ); return __v__; }
    operator bool() volatile
    { return load(); }
#endif
    bool __f__;
} atomic_bool;
EOF
echo impatomic.h type address
cat <<EOF >>impatomic.h
typedef struct atomic_address
{
#ifdef __cplusplus
    bool lock_free() volatile;
    void store( void*, memory_order = memory_order_seq_cst ) volatile;
    void* load( memory_order = memory_order_seq_cst ) volatile;
    void* swap( void*, memory_order = memory_order_seq_cst ) volatile;
    bool compare_swap( void*&, void*,
                       memory_order = memory_order_seq_cst ) volatile;
    void fence( memory_order ) volatile;
    void* fetch_add( ptrdiff_t, memory_order = memory_order_seq_cst ) volatile;
    void* fetch_sub( ptrdiff_t, memory_order = memory_order_seq_cst ) volatile;
    atomic_address() CPP0X(=default);
    atomic_address( const atomic_address& ) CPP0X(=delete);
    CPP0X(constexpr) atomic_address( void* __v__ ) : __f__( __v__) { }
    atomic_address& operator =( const atomic_address & ) CPP0X(=delete);
    void* operator =( void* __v__ ) volatile
    { store( __v__ ); return __v__; }
    operator void*() volatile
    { return load(); }
    void* operator +=( ptrdiff_t __v__ ) volatile
    { return fetch_add( __v__ ); }
    void* operator -=( ptrdiff_t __v__ ) volatile
    { return fetch_sub( __v__ ); }
#endif
    void* __f__;
} atomic_address;
EOF
echo impatomic.h type integers
for TYPEKEY in ${INTEGERS}
do
TYPENAME=${!TYPEKEY}
cat <<EOF >>impatomic.h
typedef struct atomic_${TYPEKEY}
{
#ifdef __cplusplus
    bool lock_free() volatile;
    void store( ${TYPENAME},
                memory_order = memory_order_seq_cst ) volatile;
    ${TYPENAME} load( memory_order = memory_order_seq_cst ) volatile;
    ${TYPENAME} swap( ${TYPENAME},
                      memory_order = memory_order_seq_cst ) volatile;
    bool compare_swap( ${TYPENAME}&, ${TYPENAME},
                       memory_order = memory_order_seq_cst ) volatile;
    void fence( memory_order ) volatile;
    ${TYPENAME} fetch_add( ${TYPENAME},
                           memory_order = memory_order_seq_cst ) volatile;
    ${TYPENAME} fetch_sub( ${TYPENAME},
                           memory_order = memory_order_seq_cst ) volatile;
    ${TYPENAME} fetch_and( ${TYPENAME},
                           memory_order = memory_order_seq_cst ) volatile;
    ${TYPENAME} fetch_or( ${TYPENAME},
                           memory_order = memory_order_seq_cst ) volatile;
    ${TYPENAME} fetch_xor( ${TYPENAME},
                           memory_order = memory_order_seq_cst ) volatile;
    atomic_${TYPEKEY}() CPP0X(=default);
    atomic_${TYPEKEY}( const atomic_${TYPEKEY}& ) CPP0X(=delete);
    CPP0X(constexpr) atomic_${TYPEKEY}( ${TYPENAME} __v__ ) : __f__( __v__) { }
    atomic_${TYPEKEY}& operator =( const atomic_${TYPEKEY}& ) CPP0X(=delete);
    ${TYPENAME} operator =( ${TYPENAME} __v__ ) volatile
    { store( __v__ ); return __v__; }
    operator ${TYPENAME}() volatile
    { return load(); }
    ${TYPENAME} operator ++( int ) volatile
    { return fetch_add( 1 ); }
    ${TYPENAME} operator --( int ) volatile
    { return fetch_sub( 1 ); }
    ${TYPENAME} operator ++() volatile
    { return fetch_add( 1 ) + 1; }
    ${TYPENAME} operator --() volatile
    { return fetch_sub( 1 ) - 1; }
    ${TYPENAME} operator +=( ${TYPENAME} __v__ ) volatile
    { return fetch_add( __v__ ) + __v__; }
    ${TYPENAME} operator -=( ${TYPENAME} __v__ ) volatile
    { return fetch_sub( __v__ ) - __v__; }
    ${TYPENAME} operator &=( ${TYPENAME} __v__ ) volatile
    { return fetch_and( __v__ ) & __v__; }
    ${TYPENAME} operator |=( ${TYPENAME} __v__ ) volatile
    { return fetch_or( __v__ ) | __v__; }
    ${TYPENAME} operator ^=( ${TYPENAME} __v__ ) volatile
    { return fetch_xor( __v__ ) ^ __v__; }
#endif
    ${TYPENAME} __f__;
} atomic_${TYPEKEY};
EOF
done
The following typedefs support atomic versons of the cstdint and stdint.h types.
echo impatomic.h typedefs integers
cat <<EOF >>impatomic.h
typedef atomic_schar atomic_int_least8_t;
typedef atomic_uchar atomic_uint_least8_t;
typedef atomic_short atomic_int_least16_t;
typedef atomic_ushort atomic_uint_least16_t;
typedef atomic_int atomic_int_least32_t;
typedef atomic_uint atomic_uint_least32_t;
typedef atomic_llong atomic_int_least64_t;
typedef atomic_ullong atomic_uint_least64_t;
typedef atomic_schar atomic_int_fast8_t;
typedef atomic_uchar atomic_uint_fast8_t;
typedef atomic_short atomic_int_fast16_t;
typedef atomic_ushort atomic_uint_fast16_t;
typedef atomic_int atomic_int_fast32_t;
typedef atomic_uint atomic_uint_fast32_t;
typedef atomic_llong atomic_int_fast64_t;
typedef atomic_ullong atomic_uint_fast64_t;
typedef atomic_long atomic_intptr_t;
typedef atomic_ulong atomic_uintptr_t;
typedef atomic_long atomic_ssize_t;
typedef atomic_ulong atomic_size_t;
typedef atomic_long atomic_ptrdiff_t;
typedef atomic_llong atomic_intmax_t;
typedef atomic_ullong atomic_uintmax_t;
EOF
echo impatomic.h type characters
cat <<EOF >>impatomic.h
#ifdef __cplusplus
EOF
for TYPEKEY in ${CHARACTERS}
do
TYPENAME=${!TYPEKEY}
cat <<EOF >>impatomic.h
typedef struct atomic_${TYPEKEY}
{
#ifdef __cplusplus
    bool lock_free() volatile;
    void store( ${TYPENAME}, memory_order = memory_order_seq_cst ) volatile;
    ${TYPENAME} load( memory_order = memory_order_seq_cst ) volatile;
    ${TYPENAME} swap( ${TYPENAME},
                      memory_order = memory_order_seq_cst ) volatile;
    bool compare_swap( ${TYPENAME}&, ${TYPENAME},
                       memory_order = memory_order_seq_cst ) volatile;
    void fence( memory_order ) volatile;
    ${TYPENAME} fetch_add( ${TYPENAME},
                           memory_order = memory_order_seq_cst ) volatile;
    ${TYPENAME} fetch_sub( ${TYPENAME},
                           memory_order = memory_order_seq_cst ) volatile;
    ${TYPENAME} fetch_and( ${TYPENAME},
                           memory_order = memory_order_seq_cst ) volatile;
    ${TYPENAME} fetch_or( ${TYPENAME},
                           memory_order = memory_order_seq_cst ) volatile;
    ${TYPENAME} fetch_xor( ${TYPENAME},
                           memory_order = memory_order_seq_cst ) volatile;
    atomic_${TYPENAME}() CPP0X(=default);
    atomic_${TYPENAME}( const atomic_${TYPENAME}& ) CPP0X(=delete);
    CPP0X(constexpr) atomic_${TYPEKEY}( ${TYPENAME} __v__ ) : __f__( __v__) { }
    atomic_${TYPENAME}& operator =( const atomic_${TYPENAME}& ) CPP0X(=delete);
    ${TYPENAME} operator =( ${TYPENAME} __v__ ) volatile
    { store( __v__ ); return __v__; }
    operator ${TYPENAME}() volatile
    { return load(); }
    ${TYPENAME} operator ++( int ) volatile
    { return fetch_add( 1 ); }
    ${TYPENAME} operator --( int ) volatile
    { return fetch_sub( 1 ); }
    ${TYPENAME} operator ++() volatile
    { return fetch_add( 1 ) + 1; }
    ${TYPENAME} operator --() volatile
    { return fetch_sub( 1 ) - 1; }
    ${TYPENAME} operator +=( ${TYPENAME} __v__ ) volatile
    { return fetch_add( __v__ ) + __v__; }
    ${TYPENAME} operator -=( ${TYPENAME} __v__ ) volatile
    { return fetch_sub( __v__ ) - __v__; }
    ${TYPENAME} operator &=( ${TYPENAME} __v__ ) volatile
    { return fetch_and( __v__ ) & __v__; }
    ${TYPENAME} operator |=( ${TYPENAME} __v__ ) volatile
    { return fetch_or( __v__ ) | __v__; }
    ${TYPENAME} operator ^=( ${TYPENAME} __v__ ) volatile
    { return fetch_xor( __v__ ) ^ __v__; }
#endif
    ${TYPENAME} __f__;
} atomic_${TYPEKEY};
EOF
done
cat <<EOF >>impatomic.h
#else
typedef atomic_int_least16_t atomic_char16_t;
typedef atomic_int_least32_t atomic_char32_t;
typedef atomic_int_least32_t atomic_wchar_t;
#endif
EOF
This minimal implementation does not specialize on size.
echo impatomic.h type generic
cat <<EOF >>impatomic.h
#ifdef __cplusplus
template< typename T >
struct atomic
{
#ifdef __cplusplus
    bool lock_free() volatile;
    void store( T, memory_order = memory_order_seq_cst ) volatile;
    T load( memory_order = memory_order_seq_cst ) volatile;
    T swap( T __v__, memory_order = memory_order_seq_cst ) volatile;
    bool compare_swap( T&, T, memory_order = memory_order_seq_cst) volatile;
    void fence( memory_order ) volatile;
    atomic() CPP0X(=default);
    CPP0X(constexpr) atomic( T __v__ ) : __f__( __v__ ) { }
    atomic( const atomic& ) CPP0X(=delete);
    atomic& operator =( const atomic& ) CPP0X(=delete);
    T operator =( T __v__ ) volatile
    { store( __v__ ); return __v__; }
    operator T() volatile
    { return load(); }
#endif
    T __f__;
};
#endif
EOF
echo impatomic.h type pointer
cat <<EOF >>impatomic.h
#ifdef __cplusplus
template<typename T> struct atomic< T* > : atomic_address
{
    T* fetch_add( ptrdiff_t, memory_order = memory_order_seq_cst ) volatile;
    T* fetch_sub( ptrdiff_t, memory_order = memory_order_seq_cst ) volatile;
    atomic() CPP0X(=default);
    CPP0X(constexpr) atomic( T __v__ ) : atomic_address( __v__ ) { }
    atomic( const atomic& ) CPP0X(=delete);
    atomic& operator =( const atomic& ) CPP0X(=delete);
    T* operator =( T* __v__ ) volatile
    { store( __v__ ); return __v__; }
    operator T*() volatile
    { return load(); }
    T* operator ++( int ) volatile
    { return fetch_add( 1 ); }
    T* operator --( int ) volatile
    { return fetch_sub( 1 ); }
    T* operator ++() volatile
    { return fetch_add( 1 ) + 1; }
    T* operator --() volatile
    { return fetch_sub( 1 ) - 1; }
    T* operator +=( T* __v__ ) volatile
    { return fetch_add( __v__ ) + __v__; }
    T* operator -=( T* __v__ ) volatile
    { return fetch_sub( __v__ ) - __v__; }
};
#endif
EOF
We provide full specializations of the generic atomic for integers and characters. These specializations derive from the specific atomic types to enable implicit reference conversions. The implicit assignment of the derived class prevents inheriting the base class assignments, and so the assignment must be explicitly redeclared.
echo impatomic.h type specializations
cat <<EOF >>impatomic.h
#ifdef __cplusplus
EOF
for TYPEKEY in bool address ${INTEGERS} ${CHARACTERS}
do
TYPENAME=${!TYPEKEY}
cat <<EOF >>impatomic.h
template<> struct atomic< ${TYPENAME} > : atomic_${TYPEKEY}
{
    atomic() CPP0X(=default);
    CPP0X(constexpr) atomic( ${TYPENAME} __v__ ) : atomic_${TYPEKEY}( __v__ ) { }
    atomic( const atomic& ) CPP0X(=delete);
    atomic& operator =( const atomic& ) CPP0X(=delete);
    ${TYPENAME} operator =( ${TYPENAME} __v__ ) volatile
    { store( __v__ ); return __v__; }
};
EOF
done
cat <<EOF >>impatomic.h
#endif
EOF
In C++, these operations are implemented as overloaded functions.
cat <<EOF >>impatomic.h
#ifdef __cplusplus
EOF
echo impatomic.h functions ordinary basic
for TYPEKEY in bool address ${INTEGERS} ${CHARACTERS}
do
TYPENAME=${!TYPEKEY}
cat <<EOF >>impatomic.h
inline bool atomic_lock_free( volatile atomic_${TYPEKEY}* __a__ )
{ return false; }
inline ${TYPENAME} atomic_load( volatile atomic_${TYPEKEY}* __a__ )
{ return _ATOMIC_LOAD_( __a__, memory_order_seq_cst ); }
inline ${TYPENAME} atomic_load_explicit
( volatile atomic_${TYPEKEY}* __a__, memory_order __x__ )
{ return _ATOMIC_LOAD_( __a__, __x__ ); }
inline ${TYPENAME} atomic_store
( volatile atomic_${TYPEKEY}* __a__, ${TYPENAME} __m__ )
{ return _ATOMIC_STORE_( __a__, __m__, memory_order_seq_cst ); }
inline ${TYPENAME} atomic_store_explicit
( volatile atomic_${TYPEKEY}* __a__, ${TYPENAME} __m__, memory_order __x__ )
{ return _ATOMIC_STORE_( __a__, __m__, __x__ ); }
inline ${TYPENAME} atomic_swap
( volatile atomic_${TYPEKEY}* __a__, ${TYPENAME} __m__ )
{ return _ATOMIC_MODIFY_( __a__, =, __m__, memory_order_seq_cst ); }
inline ${TYPENAME} atomic_swap_explicit
( volatile atomic_${TYPEKEY}* __a__, ${TYPENAME} __m__, memory_order __x__ )
{ return _ATOMIC_MODIFY_( __a__, =, __m__, __x__ ); }
inline bool atomic_compare_swap
( volatile atomic_${TYPEKEY}* __a__, ${TYPENAME}* __e__, ${TYPENAME} __m__ )
{ return _ATOMIC_CMPSWP_( __a__, __e__, __m__, memory_order_seq_cst ); }
inline bool atomic_compare_swap_explicit
( volatile atomic_${TYPEKEY}* __a__, ${TYPENAME}* __e__, ${TYPENAME} __m__,
  memory_order __x__ )
{ return _ATOMIC_CMPSWP_( __a__, __e__, __m__, __x__ ); }
inline void atomic_fence
( volatile atomic_${TYPEKEY}* __a__, memory_order __x__ )
{ ${TYPENAME} __v__ = _ATOMIC_LOAD_( __a__, memory_order_relaxed );
  _ATOMIC_CMPSWP_( __a__, &__v__, __v__, __x__ ); }
EOF
done
echo impatomic.h functions template basic
cat <<EOF >>impatomic.h
template< typename T >
inline bool atomic_lock_free( volatile atomic<T>* __a__ )
{ return false; }
template< typename T >
inline T atomic_store
( volatile atomic<T>* __a__, T __m__ )
{ return _ATOMIC_STORE_( __a__, __m__, memory_order_seq_cst ); }
template< typename T >
inline T atomic_store_explicit
( volatile atomic<T>* __a__, T __m__, memory_order __x__ )
{ return _ATOMIC_STORE_( __a__, __m__, __x__ ); }
template< typename T >
inline T atomic_load( volatile atomic<T>* __a__ )
{ return _ATOMIC_LOAD_( __a__, memory_order_seq_cst ); }
template< typename T >
inline T atomic_load_explicit
( volatile atomic<T>* __a__, memory_order __x__ )
{ return _ATOMIC_LOAD_( __a__, __x__ ); }
template< typename T >
inline T atomic_swap
( volatile atomic<T>* __a__, T __m__ )
{ return _ATOMIC_MODIFY_( __a__, =, __m__, memory_order_seq_cst ); }
template< typename T >
inline T atomic_swap_explicit
( volatile atomic<T>* __a__, T __m__, memory_order __x__ )
{ return _ATOMIC_MODIFY_( __a__, =, __m__, __x__ ); }
template< typename T >
inline bool atomic_compare_swap
( volatile atomic<T>* __a__, T* __e__, T __m__ )
{ return _ATOMIC_CMPSWP_( __a__, __e__, __m__, memory_order_seq_cst ); }
template< typename T >
inline bool atomic_compare_swap_explicit
( volatile atomic<T>* __a__, T* __e__, T __m__, memory_order __x__ )
{ return _ATOMIC_CMPSWP_( __a__, __e__, __m__, __x__ ); }
template< typename T >
inline void atomic_fence
( volatile atomic<T>* __a__, memory_order __x__ )
{ T __v__ = _ATOMIC_LOAD_( __a__, memory_order_relaxed );
  _ATOMIC_CMPSWP_( __a__, &__v__, __v__, __x__ ); }
EOF
echo impatomic.h functions address fetch
TYPEKEY=address
TYPENAME=${!TYPEKEY}
for FNKEY in ${ADR_OPERATIONS}
do
OPERATOR=${!FNKEY}
cat <<EOF >>impatomic.h
inline ${TYPENAME} atomic_fetch_${FNKEY}
( volatile atomic_${TYPEKEY}* __a__, ptrdiff_t __m__ )
{ ${TYPENAME} volatile* __p__ = &((__a__)->__f__);
  atomic_flag volatile* __g__ = __atomic_flag_for_address__( __p__ );
  __atomic_flag_wait__( __g__ );
  ${TYPENAME} __r__ = *__p__;
  *__p__ = (${TYPENAME})((char*)(*__p__) ${OPERATOR} __m__);
  atomic_flag_clear( __g__ );
  return __r__; }
inline ${TYPENAME} atomic_fetch_${FNKEY}_explicit
( atomic_${TYPEKEY} volatile* __a__, ptrdiff_t __m__, memory_order __x__ )
{ ${TYPENAME} volatile* __p__ = &((__a__)->__f__);
  atomic_flag volatile* __g__ = __atomic_flag_for_address__( __p__ );
  __atomic_flag_wait_explicit__( __g__, __x__ );
  ${TYPENAME} __r__ = *__p__;
  *__p__ = (${TYPENAME})((char*)(*__p__) ${OPERATOR} __m__);
  atomic_flag_clear_explicit( __g__, __x__ );
  return __r__; }
EOF
done
echo impatomic.h functions integer fetch
for TYPEKEY in ${INTEGERS} ${CHARACTERS}
do
TYPENAME=${!TYPEKEY}
for FNKEY in ${INT_OPERATIONS}
do
OPERATOR=${!FNKEY}
cat <<EOF >>impatomic.h
inline ${TYPENAME} atomic_fetch_${FNKEY}
( volatile atomic_${TYPEKEY}* __a__, ${TYPENAME} __m__ )
{ return _ATOMIC_MODIFY_( __a__, ${OPERATOR}=, __m__, memory_order_seq_cst ); }
inline ${TYPENAME} atomic_fetch_${FNKEY}_explicit
( volatile atomic_${TYPEKEY}* __a__, ${TYPENAME} __m__, memory_order __x__ )
{ return _ATOMIC_MODIFY_( __a__, ${OPERATOR}=, __m__, __x__ ); }
EOF
done
done
For C, we need type-generic macros.
cat <<EOF >>impatomic.h
#else
EOF
echo impatomic.h type-generic macros basic
cat <<EOF >>impatomic.h
#define atomic_lock_free( __a__ ) \\
false
#define atomic_load( __a__ ) \\
_ATOMIC_LOAD_( __a__, memory_order_seq_cst )
#define atomic_load_explicit( __a__, __x__ ) \\
_ATOMIC_LOAD_( __a__, __x__ )
#define atomic_store( __a__, __m__ ) \\
_ATOMIC_STORE_( __a__, __m__, memory_order_seq_cst )
#define atomic_store_explicit( __a__, __m__, __x__ ) \\
_ATOMIC_STORE_( __a__, __m__, __x__ )
#define atomic_swap( __a__, __m__ ) \\
_ATOMIC_MODIFY_( __a__, =, __m__, memory_order_seq_cst )
#define atomic_swap_explicit( __a__, __m__, __x__ ) \\
_ATOMIC_MODIFY_( __a__, =, __m__, __x__ )
#define atomic_compare_swap( __a__, __e__, __m__ ) \\
_ATOMIC_CMPSWP_( __a__, __e__, __m__, memory_order_seq_cst )
#define atomic_compare_swap_explicit( __a__, __e__, __m__, __x__ ) \\
_ATOMIC_CMPSWP_( __a__, __e__, __m__, __x__ )
#define atomic_fence( __a__, __x__ ) \\
({ T __v__ = _ATOMIC_LOAD_( __a__, memory_order_relaxed ); \\
   _ATOMIC_CMPSWP_( __a__, &__v__, __v__, __x__ ); })
EOF
echo impatomic.h type-generic macros fetch
for FNKEY in ${INT_OPERATIONS}
do
OPERATOR=${!FNKEY}
cat <<EOF >>impatomic.h
#define atomic_fetch_${FNKEY}( __a__, __m__ ) \\
_ATOMIC_MODIFY_( __a__, ${OPERATOR}=, __m__, memory_order_seq_cst )
#define atomic_fetch_${FNKEY}_explicit( __a__, __m__, __x__ ) \\
_ATOMIC_MODIFY_( __a__, ${OPERATOR}=, __m__, __x__ )
EOF
done
cat <<EOF >>impatomic.h
#endif
EOF
The core functions are difficult to use, and so the proposal includes member function definitions that are syntactically simpler. The member operators are defined in the class definitions.
cat <<EOF >>impatomic.h
#ifdef __cplusplus
EOF
echo impatomic.h methods ordinary basic
for TYPEKEY in bool address ${INTEGERS} ${CHARACTERS}
do
TYPENAME=${!TYPEKEY}
cat <<EOF >>impatomic.h
inline bool atomic_${TYPEKEY}::lock_free() volatile
{ return false; }
inline void atomic_${TYPEKEY}::store
( ${TYPENAME} __m__, memory_order __x__ ) volatile
{ atomic_store_explicit( this, __m__, __x__ ); }
inline ${TYPENAME} atomic_${TYPEKEY}::load
( memory_order __x__ ) volatile
{ return atomic_load_explicit( this, __x__ ); }
inline ${TYPENAME} atomic_${TYPEKEY}::swap
( ${TYPENAME} __m__, memory_order __x__ ) volatile
{ return atomic_swap_explicit( this, __m__, __x__ ); }
inline bool atomic_${TYPEKEY}::compare_swap
( ${TYPENAME}& __e__, ${TYPENAME} __m__, memory_order __x__ ) volatile
{ return atomic_compare_swap_explicit( this, &__e__, __m__, __x__ ); }
EOF
done
echo impatomic.h methods template basic
cat <<EOF >>impatomic.h
template< typename T >
inline bool atomic<T>::lock_free() volatile
{ return false; }
template< typename T >
inline void atomic<T>::store( T __v__, memory_order __x__ ) volatile
{ atomic_store_explicit( this, __v__, __x__ ); }
template< typename T >
inline T atomic<T>::load( memory_order __x__ ) volatile
{ return atomic_load_explicit( this, __x__ ); }
template< typename T >
inline T atomic<T>::swap( T __v__, memory_order __x__ ) volatile
{ return atomic_swap_explicit( this, __v__, __x__ ); }
template< typename T >
inline bool atomic<T>::compare_swap
( T& __r__, T __v__, memory_order __x__ ) volatile
{ return atomic_compare_swap_explicit( this, &__r__, __v__, __x__ ); }
EOF
echo impatomic.h methods address fetch
TYPEKEY=address
TYPENAME=${!TYPEKEY}
cat <<EOF >>impatomic.h
inline void* atomic_address::fetch_add
( ptrdiff_t __m__, memory_order __x__ ) volatile
{ return atomic_fetch_add_explicit( this, __m__, __x__ ); }
inline void* atomic_address::fetch_sub
( ptrdiff_t __m__, memory_order __x__ ) volatile
{ return atomic_fetch_sub_explicit( this, __m__, __x__ ); }
EOF
echo impatomic.h methods pointer fetch
cat <<EOF >>impatomic.h
template< typename T >
T* atomic<T*>::fetch_add( ptrdiff_t __v__, memory_order __x__ ) volatile
{ return atomic_fetch_add_explicit( this, sizeof(T) * __v__, __x__ ); }
template< typename T >
T* atomic<T*>::fetch_sub( ptrdiff_t __v__, memory_order __x__ ) volatile
{ return atomic_fetch_sub_explicit( this, sizeof(T) * __v__, __x__ ); }
EOF
cat <<EOF >>impatomic.h
#endif
EOF
echo impatomic.h close namespace
cat <<EOF >>impatomic.h
#ifdef __cplusplus
} // namespace std
#endif
EOF
The C standard header.
echo stdatomic.h
cat <<EOF >stdatomic.h
#include "impatomic.h"
#ifdef __cplusplus
EOF
for TYPEKEY in flag bool address ${INTEGERS} ${CHARACTERS}
do
cat <<EOF >>stdatomic.h
using std::atomic_${TYPEKEY};
EOF
done
cat <<EOF >>stdatomic.h
using std::atomic;
using std::memory_order;
using std::memory_order_relaxed;
using std::memory_order_acquire;
using std::memory_order_release;
using std::memory_order_acq_rel;
using std::memory_order_seq_cst;
#endif
EOF
The C++ standard header.
echo cstdatomic
cat <<EOF >cstdatomic
#include "impatomic.h"
EOF
The following program shows example uses of the atomic types. These examples also serve as tests for the interface definition.
echo n2324.c include
cat <<EOF >n2324.c
#include "stdatomic.h"
EOF
We show two uses, global functions with explicit memory ordering and member functions with implicit memory ordering.
echo n2324.c flag
cat <<EOF >>n2324.c
atomic_flag af = ATOMIC_FLAG_INIT;
void flag_example( void )
{
    if ( ! atomic_flag_test_and_set_explicit( &af, memory_order_acquire ) )
        atomic_flag_clear_explicit( &af, memory_order_release );
#ifdef __cplusplus
    if ( ! af.test_and_set() )
        af.clear();
#endif
}
EOF
For lazy initialization, a thread that does not do initialization may need to wait on the thread that does. (Lazy initialization is similar to double-checked locking.) For this example, we busy wait on a boolean. Busy waiting like this is usually ill-advised, but it sufficies for the example. There are three variants of the example: one using strong C++ operators and methods, one using weak C functions, and one using fence-based C++ operators and methods.
echo n2324.c lazy
cat <<EOF >>n2324.c
atomic_bool lazy_ready = { false };
atomic_bool lazy_assigned = { false };
int lazy_value;
#ifdef __cplusplus
int lazy_example_strong_cpp( void )
{
    if ( ! lazy_ready ) {
        /* the value is not yet ready */
        if ( lazy_assigned.swap( true ) ) {
            /* initialization assigned to another thread; wait */
            while ( ! lazy_ready );
        }
        else {
            lazy_value = 42;
            lazy_ready = true;
        }
    }
    return lazy_value;
}
#endif
int lazy_example_weak_c( void )
{
    if ( ! atomic_load_explicit( &lazy_ready, memory_order_acquire ) ) {
        if ( atomic_swap_explicit( &lazy_assigned, true,
                                   memory_order_relaxed ) ) {
            while ( ! atomic_load_explicit( &lazy_ready,
                                            memory_order_acquire ) );
        }
        else {
            lazy_value = 42;
            atomic_store_explicit( &lazy_ready, true, memory_order_release );
        }
    }
    return lazy_value;
}
#ifdef __cplusplus
int lazy_example_fence_cpp( void )
{
    if ( lazy_ready.load( memory_order_relaxed ) )
        lazy_ready.fence( memory_order_acquire );
    else if ( lazy_assigned.swap( true, memory_order_relaxed ) )
        while ( ! lazy_ready.load( memory_order_relaxed ) );
    else {
        lazy_value = 42;
        lazy_ready.store( true, memory_order_release );
    }
    return lazy_value;
}
#endif
EOF
echo n2324.c integer
cat <<EOF >>n2324.c
atomic_ulong volatile aulv = { 0 };
atomic_ulong auln = { 1 };
#ifdef __cplusplus
atomic< unsigned long > taul CPP0X( { 3 } );
#endif
void integer_example( void )
{
    atomic_ulong a = { 3 };
    unsigned long x = atomic_load_explicit( &auln, memory_order_acquire );
    atomic_store_explicit( &aulv, x, memory_order_release );
    unsigned long y = atomic_fetch_add_explicit( &aulv, 1,
                                                 memory_order_relaxed );
    unsigned long z = atomic_fetch_xor( &auln, 4 );
#ifdef __cplusplus
    x = auln;
    aulv = x;
    auln += 1;
    aulv ^= 4;
    // auln = aulv; // uses a deleted operator
    aulv -= auln++;
    auln |= --aulv;
    aulv &= 7;
    atomic_store_explicit( &taul, 7, memory_order_release );
    x = taul.load( memory_order_acquire);
    y = atomic_fetch_add_explicit( & taul, 1, memory_order_acquire );
    z = atomic_fetch_xor( & taul, 4 );
    x = taul;
    // auln = taul; // uses a deleted operator
    // taul = aulv; // uses a deleted operator
    taul = x;
    taul += 1;
    taul ^= 4;
    taul -= taul++;
    taul |= --taul;
    taul &= 7;
#endif
}
EOF
An event counter is not part of the communication between threads, and so it can use faster primitives.
echo n2324.c event
cat <<EOF >>n2324.c
#ifdef __cplusplus
struct event_counter
{
    void inc() { au.fetch_add( 1, memory_order_relaxed ); }
    unsigned long get() { au.load( memory_order_relaxed ); }
    atomic_ulong au;
};
event_counter ec = { 0 };
void generate_events()
{
    ec.inc();
    ec.inc();
    ec.inc();
}
int read_events()
{
    return ec.get();
}
int event_example()
{
    generate_events(); // possibly in multiple threads
    // join all other threads, ensuring that final value is written
    return read_events();
}
#endif
EOF
An important point here is that this is safe, and we are guaranteed to see exactly the final value, because the thread joins force the necessary ordering between the inc calls and the get call.
The insertion into a shared linked list can be accomplished in a lock-free manner with compare-and-swap, provided that compare-and-swap is lock-free. (Note that adding a correct "remove" operation without a garbage collector is harder than it seems.)
echo n2324.c list
cat <<EOF >>n2324.c
#ifdef __cplusplus
struct data;
struct node
{
    node* next;
    data* value;
};
atomic< struct node* > head CPP0X( { (node*)0 } );
void list_example_strong( data* item )
{
    node* candidate = new node;
    candidate->value = item;
    candidate->next = head;
    while ( ! head.compare_swap( candidate->next, candidate ) );
}
void list_example_weak( struct data* item )
{
    node* candidate = new node;
    candidate->value = item;
    candidate->next = head.load( memory_order_relaxed );
    while ( ! head.compare_swap( candidate->next, candidate,
                                 memory_order_release ) );
}
#endif
EOF
The best algorithm for updating a variable may depend on whether or not atomics are lock-free. In the example below, this update can be accomplished in a lock-free manner with compare-and-swap when atomic scalars are lock-free, but may require other mechanisms when atomic scalars are not lock-free. This example uses the feature macro to generate minimal code when the lock-free status is known a priori.
echo n2324.c update
cat <<EOF >>n2324.c
#if ATOMIC_SCALAR_LOCK_FREE <= 1
atomic_flag pseudo_mutex = ATOMIC_FLAG_INIT;
unsigned long regular_variable = 1;
#endif
#if ATOMIC_SCALAR_LOCK_FREE >= 1
atomic_ulong atomic_variable = { 1 };
#endif
void update()
{
#if ATOMIC_SCALAR_LOCK_FREE == 1
    if ( atomic_lock_free( &atomic_variable ) ) {
#endif
#if ATOMIC_SCALAR_LOCK_FREE > 0
        unsigned long full = atomic_load( atomic_variable );
        unsigned long half = full / 2;
        while ( ! atomic_compare_swap( &atomic_variable, &full, half ) )
            half = full / 2;
#endif
#if ATOMIC_SCALAR_LOCK_FREE == 1
    } else {
#endif
#if ATOMIC_SCALAR_LOCK_FREE < 2
        __atomic_flag_wait__( &pseudo_mutex );
        regular_variable /= 2 ;
        atomic_flag_clear( &pseudo_mutex );
#endif
#if ATOMIC_SCALAR_LOCK_FREE == 1
    }
#endif
}
EOF
echo n2324.c main
cat <<EOF >>n2324.c
int main()
{
}
EOF