Shared_ptr atomic access, revision 1

Overview
Rationale
Proposed Text
Implementability
Performance

Changes in revision 1

Added rationale for pass by value.
Changed atomic_swap and atomic_compare_swap to atomic_exchange and atomic_compare_exchange, respectively.
Added atomic_is_lock_free.
Replaced members with free functions for consistency with chapter 29 atomics, at the committee's request.
Added a chapter 29 swap to exchange edit to the proposed text, at the committee's request.

The level of thread safety offered by shared_ptr is the default for the standard library: const operations on the same instance can be performed concurrently by multiple threads, but updates require exclusive access. Some scenarios require stronger guarantees; there is considerable demand (on the Boost mailing lists and otherwise) for an atomic shared_ptr, one that can withstand concurrent updates from two threads without synchronization. This is often motivated by the desire to port an existing scalable pattern that relies on atomic operations on raw pointers in combination with garbage collection.

We propose the addition of the following functions to <memory> to address this use case:

template<class T> bool atomic_is_lock_free( shared_ptr<T> const * p );

template<class T> shared_ptr<T> atomic_load( shared_ptr<T> const * p );
template<class T> shared_ptr<T> atomic_load_explicit( shared_ptr<T> const * p, memory_order mo );

template<class T> void atomic_store( shared_ptr<T> * p, shared_ptr<T> r );
template<class T> void atomic_store_explicit( shared_ptr<T> * p, shared_ptr<T> r, memory_order mo );

template<class T> shared_ptr<T> atomic_exchange( shared_ptr<T> * p, shared_ptr<T> r );
template<class T> shared_ptr<T> atomic_exchange_explicit( shared_ptr<T> * p, shared_ptr<T> r, memory_order mo );

template<class T> bool atomic_compare_exchange( shared_ptr<T> * p, shared_ptr<T> * v, shared_ptr<T> w );
template<class T> bool atomic_compare_exchange_explicit( shared_ptr<T> * p, shared_ptr<T> * v, shared_ptr<T> w, memory_order success, memory_order failure );

A typical example scenario — a "lock-free" reader/writer pattern — that takes advantage of this functionality is outlined below:

shared_ptr<State> ps;

void reader()
{
    shared_ptr<State const> p = atomic_load( &ps );
    use *p;
}

// single writer case
void writer()
{
    shared_ptr<State> p( new State( *ps ) );
    update *p reflecting new information;
    atomic_store( &ps, move( p ) );
}

// or, multiple writers case
void writer()
{
    shared_ptr<State> p = atomic_load( &ps );

    do
    {
        shared_ptr<State> q( new State( *p ) );
        update *q reflecting new information;
    }
    while( !atomic_compare_exchange( &ps, &p, move( q ) ) );
}

This feature extends the interface of shared_ptr in a backward-compatible way. We believe that it is a strong candidate for addition to the C++0x standard. It introduces no source- or binary compatibility issues.

The proposed additions have been implemented in Boost (with a slightly different syntax) and will be part of Boost 1.36. Performance tests have identified reader/writer scenarios where the above pattern achieves close to linear scalability, outperforming a read/write lock.

II. Rationale

Can the implementation be made lock-free? If not, is it still worth having?

There are several possible approaches that lead to a lock-free implementation: hazard pointers, k-CAS, software transactional memory. The challenge is to consistently outperform the spinlock baseline implementation, and this part is not easy; that is why it makes sense to provide the functionality as part of the standard library.

Yes, the spinlock implementation is still worth having. For certain tasks and workloads, it still scales better than mutexes or read/write locks. See the performance section.

Can this be moved to TR2? Is it essential to have it in the standard?

A spinlock-based implementation can be done non-intrusively in a TR, but a lock-free implementation cannot, as it inherently requires access to shared_ptr internals. We don't want to disallow lock-free implementations because they would scale better. The area is a subject of active research and having a stable interface as a target will, hopefully, lead to the emergence of such implementations.

We do not know whether a lock-free implementation will need to change shared_ptr's internal details, but the conservative assumption is that it might, which could potentially require an ABI change unsuitable for a technical report.

Is the interface sufficient? Should `shared_ptr` maintain a version counter?

We believe that the interface is sufficient.

The need for a version counter is motivated by the read-upgrade-write pattern:

void maybe_writer()
{
    obtain a read lock;
    use object;

    if( object needs updating )
    {
        upgrade read lock to write lock;
        update object;
    }
}

which in shared_ptr terms looks like:

void maybe_writer()
{
    shared_ptr<State> p = atomic_load( &ps );

    do
    {
        use *p;

        if( object doesn't need updating ) break;

        shared_ptr<State> q( new State( *p ) );
        update *q;
    }
    while( !atomic_compare_exchange( &ps, &p, move( q ) ) );
}

The other possible use for a version counter is to avoid the ABA problem that is common for CAS-based algorithms. ABA cannot occur in our case because the storage for the object referenced by p cannot be reused while p is still alive.

Should an atomic pointer be provided as a separate type? As an `std::atomic` specialization?

An std::atomic< shared_ptr<> > specialization is not an appropriate way to provide the functionality. The template parameter of std::atomic<T> is required to have a trivial copy constructor and an assignment operator, for good reasons; calling user code from within the atomic operations is a recipe for disaster. Since the copy constructor and the assignment operator of shared_ptr aren't trivial, it is not acceptable to instantiate std::atomic on it.

Providing a separate type atomic_shared_ptr<T> is a legitimate alternative. We have instead chosen to propose additions to the existing shared_ptr interface for the following reasons:

We have implemented these additions and have experience with them.
By adding to the existing shared_ptr interface, we make it clear that the support for atomic access and manipulation does not alter the type or the layout of shared_ptr.
We can still manipulate shared_ptr instances atomically without sacrificing the ability to keep them in containers. In particular, we could have an std::map<Key, shared_ptr<State>> and use the reader-writer pattern on the elements of this map.
We have no experience with a separate atomic_shared_ptr type and do not feel confident that we can propose the right interface for it.
Given the proposed functions, it is possible to build an atomic_shared_ptr on top of shared_ptr, but not vice versa.

Why do the functions take some of their parameters by value, rather than by const reference?

The implementation requires that the argument value is private to the function so that it can fiddle with its internal state without some other thread getting inconsistent results. A const reference implies that other threads are free to read the value.

If the argument is passed by const reference, the functions will need to make their own private copies anyway. This is inefficient when the argument is a temporary; an extra copy is introduced, which isn't needed. Pass by value automatically elides the copy when the argument is an rvalue.

III. Proposed Text

Chapter 20 edits

Add to the synopsis of <memory> in [memory] the following:

// shared_ptr atomic access:

template<class T> bool atomic_is_lock_free( shared_ptr<T> const * p );

template<class T> shared_ptr<T> atomic_load( shared_ptr<T> const * p );
template<class T> shared_ptr<T> atomic_load_explicit( shared_ptr<T> const * p, memory_order mo );

template<class T> void atomic_store( shared_ptr<T> * p, shared_ptr<T> r );
template<class T> void atomic_store_explicit( shared_ptr<T> * p, shared_ptr<T> r, memory_order mo );

template<class T> shared_ptr<T> atomic_exchange( shared_ptr<T> * p, shared_ptr<T> r );
template<class T> shared_ptr<T> atomic_exchange_explicit( shared_ptr<T> * p, shared_ptr<T> r, memory_order mo );

template<class T> bool atomic_compare_exchange( shared_ptr<T> * p, shared_ptr<T> * v, shared_ptr<T> w );
template<class T> bool atomic_compare_exchange_explicit( shared_ptr<T> * p, shared_ptr<T> * v, shared_ptr<T> w, memory_order success, memory_order failure );

Add a new section [util.smartptr.shared.atomic]:

Concurrent access to a shared_ptr instance from multiple threads does not introduce a data race if the access is done exclusively via the functions in this section and the instance is passed as their first argument.

The meaning of the arguments of type memory_order is explained in 29.1/1 [atomics.order] and 29.1/2.
template<class T> bool atomic_is_lock_free( shared_ptr<T> const * p );
Returns: true if atomic access to *p is lock-free, false otherwise.

Throws: nothing.
template<class T> shared_ptr<T> atomic_load( shared_ptr<T> const * p );
Returns: atomic_load_explicit( p, memory_order_seq_cst ).
template<class T> shared_ptr<T> atomic_load_explicit( shared_ptr<T> const * p, memory_order mo );
Requires: mo shall not be memory_order_release or memory_order_acq_rel.

Returns: *p.

Throws: nothing.
template<class T> void atomic_store( shared_ptr<T> * p, shared_ptr<T> r );
Effects: atomic_store_explicit( p, r, memory_order_seq_cst ).
template<class T> void atomic_store_explicit( shared_ptr<T> * p, shared_ptr<T> r, memory_order mo );
Requires: mo shall not be memory_order_acquire or memory_order_acq_rel.

Effects: p->swap( r ).

Throws: nothing.
template<class T> shared_ptr<T> atomic_exchange( shared_ptr<T> * p, shared_ptr<T> r );
Returns: atomic_exchange_explicit( p, r, memory_order_seq_cst ).
template<class T> shared_ptr<T> atomic_exchange_explicit( shared_ptr<T> * p, shared_ptr<T> r, memory_order mo );
Effects: p->swap( r ).

Returns: the previous value of *p.

Throws: nothing.
template<class T> bool atomic_compare_exchange( shared_ptr<T> * p, shared_ptr<T> * v, shared_ptr<T> w );
Returns: atomic_compare_exchange( p, v, w, memory_order_seq_cst, memory_order_seq_cst ).
template<class T> bool atomic_compare_exchange_explicit( shared_ptr<T> * p, shared_ptr<T> * v, shared_ptr<T> w, memory_order success, memory_order failure );
Requires: failure shall not be memory_order_release, memory_order_acq_rel, or stronger than success.

Effects: If *p is equivalent to *v, assigns w to *p and has synchronization semantics corresponding to to the value of success, otherwise assigns *p to *v and has synchronization semantics corresponding to to the value of failure.

Returns: true if *p was equivalent to *v, false otherwise.

Throws: nothing.

Remarks: two shared_ptr instances are equivalent if they store the same pointer value and share ownership.

Chapter 29 edits

Rename the swap member function to exchange throughout chapter 29.

Rename the compare_swap member function to compare_exchange throughout chapter 29.

Rename the atomic_swap function to atomic_exchange throughout chapter 29.

Rename the atomic_swap_explicit function to atomic_exchange_explicit throughout chapter 29.

Rename the atomic_compare_swap function to atomic_compare_exchange throughout chapter 29.

Rename the atomic_compare_swap_explicit function to atomic_compare_exchange_explicit throughout chapter 29.

IV. Implementability

A straightforward implementation would use a per-instance spinlock, obtained from a spinlock pool keyed on a hash of the address of the instance. In pseudocode, the functions will be implemented as follows:

template<class T> shared_ptr<T> atomic_load( shared_ptr<T> const * p )
{
    lock the spinlock for *p;
    shared_ptr<T> r( *p );
    unlock the spinlock for *p;
    return r;
}

template<class T> void atomic_store( shared_ptr<T> * p, shared_ptr<T> r )
{
    lock the spinlock for *p;
    p->swap( r );
    unlock the spinlock for *p;
}

template<class T> shared_ptr<T> atomic_exchange( shared_ptr<T> * p, shared_ptr<T> r )
{
    lock the spinlock for *p;
    p->swap( r );
    unlock the spinlock for *p;
    return r;
}

template<class T> bool atomic_compare_exchange( shared_ptr<T> * p, shared_ptr<T> * v, shared_ptr<T> w )
{
    lock the spinlock for *p;

    if( *p is equivalent to *v )
    {
        p->swap( w );
        unlock the spinlock for *p;
        return true;
    }
    else
    {
        shared_ptr<T> tmp( *p );
        unlock the spinlock for *p;
        tmp.swap( *v );
        return false;
    }
}

Note that the code carefully avoids destroying a non-empty shared_ptr instance while holding the spinlock, as this may lead to user code being called and present an opportunity for a deadlock. It also attempts to limit the spinlock scope to the minimum necessary to prevent contention.

An implementation that closely follows this outline has already been committed to the Boost SVN and will be part of Boost release 1.36. The relevant files can be browsed online at:

V. Performance

We used sp_atomic_mt2_test.cpp to evaluate the performance of the reader-writer pattern from the overview. For comparison purposes we used synchronization with a boost::detail::lightweight_mutex, equivalent to CRITICAL_SECTION under Windows, and boost::shared_mutex, a reader/writer lock that allows multiple concurrent read locks and a single exclusive write lock.

The test is intended to emulate a server process that fulfills client read or write requests from a number of worker threads.

We picked the following test parameters:

vector size of 10000. The weight of the task performed by the readers and the writers is proportional to the vector size.
iteration count of 1000000. Every worker thread performs 1M operations.
read to write ratio of 100. From the 1M operations, 10000 are writes, and the rest are reads.

The results, obtained on a non-hyperthreaded dual core Pentium D (two hardware threads) under Windows XP64, are shown in the following tables.

Time in seconds:

Primitive	1 thread	2 threads	4 threads
mutex	8.312	20.921	42.812
rwlock	8.437	23.390	46.468
atomic access	8.515	9.421	18.781

Operations per millisecond:

Primitive	1 thread	2 threads	4 threads
mutex	120	96	93
rwlock	119	86	86
atomic access	117	212	213

It is clear that for this combination of parameters, the atomic access reader/writer pattern offers the best scalability.

Note that we make no claim that the atomic access approach unconditionally outperforms the others, merely that it may be best for certain scenarios that can be encountered in practice.

Thanks to Hans Boehm and Lawrence Crowl for reviewing this paper.

--end