N1803: Atomic issues in DR423 and DR431

Author: Blaine Garst
Date: 2014-03-14
Version: 1

Summary

DR423 asks whether type generic macros should differentiate between atomic and the corresponding non-atomic type, and DR431 asks for guidance on 7.17.7.4 atomic_compare_exchange.

1.0 Background.

The memory model and 7.17 atomic library functions were introduced with just a few opaque structures defining atomic objects, as listed in 7.16.6. Constructive atomic types were introduced so in N1485 with a simple declarative syntax for sequentially consistent operations across the entire spectrum of what are and were existing C11 object types in a one to one correspondence. The 7.17 library generic functions were intentionally left unchanged since it was an explicit goal to not introduce new atomic semantics as much as extend their domain from the relatively few integer types named in 7.17.6, to that which could be constructed with the newly introduced _Atomic keyword (in two formulations).

The 7.17 library as originally introduced was intended to provide a bare minimum mechanism to access the newly introduced memory model, with the memory_order type describing in 7.17 the six variations of memory synchronization requirements in exacting detail.

2.0 lock free

Atomics were introduced (and in this respect left unmodified) with the concept that some types could be implemented in a lock-free manner, with a hard requirement 7.17.8p2 that the atomic_flag type be "lock free".

In practice, what this means is that some atomic types, including those named explicitly as initially introduced, such as atomic_ull, might not have direct support by the implementation cpu architecture.

There were two main implementation paths for non-lock-free objects known at the time of atomics introduction. Let us examine both.

2.1 The padding implementation of non-lock-free objects.

This solution co-allocates an extra byte (or more) of padding that would be used by the 7.17.7 atomic generic functions as a guard (as well as via syntax).

As an example, imagine an architecture with only native 8-bit atomic operations, and a 64-bit sized long long. One choice for an implementor would be to internally implement long long thusly:

typedef struct { unsigned lock[8], value[8]; } atomic_ull; unsigned long long atomic_ull_read(atomic_ull *arg, memory_order mo) { unsigned long long result; switch (mo) { memory_order_seq_cst: unsigned char desired = 1, expected = 0; while (!atomic_compare_exchange_strong(&arg.lock[0], &expected, &desired)) ; memcpy(&result, arg->value, 8); desired = 0; expected = 1; atomic_compare_exchange_strong(arg, &expected, &desired); break; } return result; }

where the implementation would synthesize a call to atomic_ull_read wherever a read of some atomic_ull object was encountered. (This technique is colloquially known as a software read barrier).

There would similarly be an atomic_ull_write function used for writes to atomic_ull objects. These types of functions are typically coded directly in the assembly language of the architecture for efficiency, and the C11 code illustrated above is a high level equivalent algorithm.

So, one choice for an implementation of a non-lock-free data structure is with extra padding byte(s) used as a lock in whatever form of an atomic compare and swap primitive the architecture provides.

2.2 Side table approach.

The well-known alternative is to implement a global table of locks keyed by the address of the object. The same read- and write- barrier technique of helper functions is again implemented, but the location of the lock is typically spread across several tables (typically a multiple of the number of cores) to decrease contention. Since this is a lock, a context switch can occur while holding the lock (as it can in the padding approach) but the contention is slightly greater depending on how many threads collide on the exact table entry.

The table could be of size one, meaning a global lock is acquired for a trivial non-performant yet correct implementation. (There can be no recursion)

The many table technique has been used, for example, in Apple's (open sourced) Objective-C runtime as an implementation of its Java-like @synchronize construct on an arbitrary data structure (any Objective-C object) that has no provision for padding.

2.3 Performance Comparison

The padding approach might be thought to lead to greater efficiency due to the co-location of the lock and the data. However, most architectures implement exclusion on a cache-line size basis, so the true memory cost of each padded structure for efficiency requires an equivalent _Alignas(cache-line-size) for each non-lock-free atomic object. Further, the cache synchronization costs when memory barriers are considered (for most memory_order choices) dominate even the memory costs, although with decreased pipeline sizes this cost is coming down.

There is, and was, and to my knowledge no proposed or actual implementation of the padding approach, and this was the case at the time of the introduction of atomics.

2.4 Costs of allowing a possible padding approach.

The requirement that a padding approach be allowable forces C11 atomics into a very unnatural and convoluted position within the standard, with syntactical position as a qualifier yet not the semantics (requiring such phrasing as "atomic, qualified, or unqualified" in many places).

2.4.1 DR423, do type generic macros select between atomic and non-atomic counterparts?

The correct answer has to be "yes" as long as the possibility exists that some implementation at some point somewhere at some point in the future chooses to use a padding implementation of atomics. Otherwise any such macro cannot be used portably to such a system.

2.4.2 DR431, atomic_compare_exchange of structures

First,

Suggested Technical Corrigendum

To 7.17.7.4 add a new note immediately after Note 1
Note 2: as comparison of non-atomic structures and unions is undefined behavior, so is comparison of atomic structures or unions.

There is strong sentiment to provide a somewhat trivial library implementation of, say, the atomic_compare_exchange generic function, such that memcmp and memcpy be used wherever possible. The idea is that, where possible, native instructions would implement the lock-free data types, and a single (secret) library implementation function would be used in a type generic way, simply by passing the appropriate addresses and size of the operands.

The padding approach would require that all padding bytes be kept in a known state lest a stray pad byte provide an incorrect comparison result. Implementations that have trap representations for a non-lock-free integer type would also need to be aware of this, if even possible.

So regardless of whether atomic_compare_exchange applies to optionally supported atomic structures (it can't unless regular structures can compare), there are stringent requirements further imposed on implementations that would choose to use padding bytes. This goes beyond the original design and intent of an easily implemented library and easily implemented compiler support for atomics.

3.0 What should we do?

There are two main choices. The simplest one is to say "yes" to DR423 and beware of padding bytes used for non-lock-free objects in DR431.

The better choice, as has always been my opinion, is to remove the possibility of a padding byte implementation of atomics. This realigns atomic to be no more and no less than another type qualifier, fitting naturally into the existing structure of the standard. The change is along the lines of

Suggested Technical Corrigendum
Replace 6.25 paragraph 27 with:
Further, there is the _Atomic qualifier. The presence of the _Atomic qualifier designates an atomic qualified type.
Further, remove "atomic" from the phrases "atomic, qualified or unqualified" throughout the standard, and replace all uses of "atomic type" with "atomic qualified type".

A more comprehensive list of edits would be required, but the notion that loads of atomic object already lose the "atomic" status, just as do "volatile", and it should lead to great simplification.

The answer to DR423 changes to "no", type generic macros would not select for the atomic qualified types, just as they do not for volatile qualified types. This fits one's intuition given the _Atomic keyword as a qualifier.

The answer for DR431 remains the same, where memcmp and memcpy can be used for all non-trap representations.

At the time of introduction, there was concern that a C++11 implementation that might want to use padding bytes would be constrained to not do so for interoperability with its often co-implemented C11 implementation. But as I have hopefully shown, even in the unlikely event that a padding implementation is chosen, a correct C11 co-implementation could simply be to always use the padded size, so a simple solution is available even in this unlikely scenario.

4.0 Conclusion

The 6.25p27 constraint that atomic types may have differing size and alignment is an entirely needless complication to the standard. It should be removed since it has not and is highly unlikely to be used. Side table approaches are superior, since the side table can be sized depending on a runtime check of the cache size.

There is likely to be a new DR regarding the treatment of atomic floating point types, and it may be the case that bitwise comparison using memcmp and memcyp is inappropriate, and if so, the correct answer would be to implement the type generic atomic library implementation internally using the new type generic macros which can easily select among more specialized "helper" implementations for floating point types. These helper functions would use floating point instructions directly after acquiring a side table based lock.