ISO/IEC JTC1 SC22 WG21 N3425 = 12-0115 - 2012-09-20

Arch D. Robison, Intel Corp., (arch.robison@intel.com)

Anton Malakhov, Intel Corp., (anton.malakhov@intel.com)

Artur Laksberg, Microsoft Corp., (arturl@microsoft.com)

Concurrent Unordered Associative Containers for C++

Introduction
Interfaces
    Why Concurrent Erasure Is Not Supported
        Deferred Destruction
        Alternative: GC-Based Deferred Destruction
        Alternative: Copy-Based Interface
    Bucket Interface
        Alternative: Concurrency-Safe Bucket Interface
Synopsis
    Header <concurrent_unordered_map> synopsis
     Header <concurrent_unordered_set> synopsis
Concurrency Guarantees
    Concurrency-Safe Operations
    Non-Blocking
    Partial Linearizability
References

Introduction

We propose adding concurrent variants of the unordered associative containers that guarantee concurrency safety of insertion operations and lookup operations. By concurrency safety, we mean that concurrent execution of the operations does not introduce a data race. Though the existing specifications of STL unordered associative containers permit concurrency safety, they do not require it. We also propose allowing concurrency-safe erasure as an option.

Currently a programmer must serialize insertions with respect to themselves and other operations, even though often there is no logical conflict. For example, consider multiple threads that need to insert keys into a set. Logically, the resulting set should be independent of whether the insertions are concurrent or not. The proposed containers enable the insertions to proceed concurrently.

The concurrent variants can be efficiently implemented using split-ordered lists [SplitOrder], an established technique for implementing lock-free hash tables. Versions of these concurrent variants have been shipping with Intel Threading Building Blocks and the Microsoft Parallel Patterns Libraries.

Interfaces

The interfaces of the concurrent variants are close to the existing unordered associative containers. Insertion, lookup, and traversal work just they do for the existing containers, except that they are now permitted to operate concurrently. Traversal might skip some items that are inserted concurrently with the traversal. A later section on partial linearizability describes guarantees on how the concurrent operations interact.

The significant syntactic difference is that the prefix unsafe_ is added to some of the methods that are unsafe to invoke concurrently with mutating operations. These methods are:

The prefix serves as a reminder that these are not concurrency-safe.

There are a few unprefixed operations that not concurrency safe:

Prefixing these names seemed too far a departure from existing practice, and the wholesale nature of these operations made it easy to remember that they were not concurrency safe. A later section describes the concurrency guarantees in more detail.

Why Concurrent Erasure Is Not Supported

Concurrent erasure is not supported in the base proposal because of the difficulty (and overhead) of resolving what to do if one thread has looked up an item (and is holding a reference to it) while another thread is concurrently erasing the item.

There are several solutions to this problem, each with inherent costs. If the committee wishes to support concurrent erasure, we recommend the ``deferred destruction'' approach outlined in the next section, since it is non-blocking and offers the fewest surprises. However, it has a high performance cost, and therefore we recommend that it be an optional feature enabled by a template parameter to the container, to preserve ``pay as you go'' for cases not needing concurrent erasure.

Deferred Destruction

Deferred destruction enables safe concurrent erasure with non-blocking semantics. When an item is erased, it is unlinked from the container so that subsequent lookups cannot find it, but destruction of the item is deferred until no container iterators pointing to it.

A minor drawback of deferred destruction is that it permits a thread to continue operating on item that is no longer in the table. However, the alternative of blocking erasure seems worse, since it can lead to deadlock. Clients wanting blocking erasure can add the necessary synchronization themselves, such as by embedding a mutex in each item. Note that the converse, providing a blocking container and making clients who need non-blocking semantics subtract synchronization, is not viable.

Any support for concurrency-safe erasure is not cheap, because it requires thread synchronization to avoid races. Deferred destruction requires, at a minimum, two atomic operations for the acquisition and release of an item. These operations can generate much coherence traffic since they require modifying a cache line. The cost is not pay-as-you-go since these modifications are required if concurrent erasure might happen, even if it never happens. Hence our recommendation is that the feature be enabled or disabled on a per-instantiation basis.

Alternative: GC-Based Deferred Destruction

In principle, deferred destruction could also be implemented using garbage collection instead of reference counting, and thus avoid the expensive atomic operations for acquisition and release of an item. Iterators pointing to items can be freely copied, and so the garbage collection would have to track all these copies.

If implementations relying on garbage collection are to be permitted, the specification of erase and consequent deferred destruction need to permit an item to remain alive after the last reference to it is removed, until the garbage collector can reclaim it.

Alternative: Copy-Based Interface

Another solution to the concurrent erasure problem is to depart from STL conventions and provide a copy-access interface. A lookup/insert operation would return a copy of an item instead of an abstract pointer (iterator) to it. A copy-based interface would require a traits parameter for specifying how to copy an item in a concurrency-safe way, since copy construction is not necessarily thread-safe for types of interest (e.g. instances of shared_ptr).

The big advantage of a copy-access interface is that there is no need for expensive atomic operations to track acquisition/release of an item as in in the deferred destruction approach. However, since the underlying data structure still involves links to things that might be concurrently erased, fully avoidance of mutating atomic operations in the internal implementation will have to rely on advanced software techniques for safe memory reclamation (e.g. hazard pointers) or hardware support such as transactional memory,

Copy-based interfaces can also have significant performance advantages because the underlying implementation is free to repack storage on the fly, and pack them densely, which is particularly helpful for single-word items.

The big drawback of a copy-based interface is that it is a significant departure from existing practice in the standard library. Thus we are proposing a non-copying interface.

Bucket Interface

Our bucket interface is weakly concurrency safe in the sense that concurrent execution of bucket operations with insert operations will not corrupt the container. We nonetheless mark these operations as unsafe_ because concurrent insertion can cause semantic surprises. For example, consider the following routine for dumping a concurrent unordered set.
template<typename Set>
void DumpSet(const Set& s) {
    using namespace std;
    for(typename Set::size_type i=0; i!=s.unsafe_bucket_count(); ++i) {
        cout << "bucket " << i << endl;  
        for( auto j=s.unsafe_begin(i); j!=s.unsafe_end(i); ++j)
            cout << *j << endl;
    }
}
If a concurrent insert happens while the dump is running, the dump may list some of the other elements twice.

To see how this can happen, suppose that the inner loop has dumped each item in bucket 0 and is about to evaluate j==unsafe_end(0). If an insert happens and causes the number of buckets to increase, the number of buckets doubles, from some value N to 2N. Bucket 0 splits into two buckets, bucket 0 and bucket N. The dumping routine will eventually dump bucket n, re-outputting items that were in bucket 0.

If the outer loop takes a snapshot of unsafe_bucket_count at the start and does not re-evaluate it, then problem of duplication is replaced by the problem of omission, since a concurrent insertion might move items to buckets beyond those known at the point of the snapshot.

Alternative: Concurrency-Safe Bucket Interface

Split-order lists make it possible to provide a bucket inspection interface with stronger guarantees, though the interface must necessarily depart from the serial equivalent. Each method that looks up a bucket must know how many buckets there were originally. For example, suppose that bucket_count() returns N and afterwards insertions cause the number of buckets to increase. The original bucket boundaries are still embedded in the split-order list, but now there are new bucket boundaries within the original bucket. Knowing the original N suffices to find them.

Here is what the alternative interface would look like:

    // bucket interface
    size_type bucket_count() const noexcept;
    size_type max_bucket_count() const noexcept;
    size_type bucket_size(size_type n, size_type c);
    size_type bucket(const key_type& key, size_type c) const;

    local_iterator begin(size_type n, size_type c);
    const_local_iterator begin(size_type n, size_type c) const;
    local_iterator end(size_type n, size_type c);
    const_local_iterator end(size_type n, size_type c) const;
    const_local_iterator cbegin(size_type n) const;
    const_local_iterator cend(size_type n, size_type c) const;

With this kind of interface, the following code does a reasonable dump of a concurrent unordered set in the presence of concurrent insert operations.
template<typename Set>
void DumpSet(const Set& s) {
    using namespace std;
    auto c = s.bucket_count();
    for(typename Set::size_type i=0; i!=c; ++i) {
        cout << "bucket " << i << endl;  
        for( auto j=s.begin(i,c); j!=s.end(i,c); ++j)
            cout << *j << endl;
    }
}
It is reasonable in the sense that all items already in the table are output exactly once. Any concurrently inserted items might or might not be output.

We seek the committee's opinion on whether the concurrency-safety gains of this alternative bucket interface over our baseline proposal are worth the cost, particularly since we believe the interface is used primarily for debugging. The costs are:

Synopsis

This section specifies the interfaces in synopsis form. Syntactic differences with the corresponding non-concurrent classes are underlined. To avoid tedium, only the synopses for concurrent_unordered_map and concurrent_unordered_set are shown explicitly.

The Intel implementations have some addition methods for recursively divisible ranges, which are useful for parallel traversal of the containers. These are omitted from the proposal since they make sense only within larger framework for recursively divisible ranges.

Header <concurrent_unordered_map> synopsis

namespace std {

  template <class Key, 
            class T, 
            class Hash = hash<Key>, 
            class Pred = std::equal_to<Key>, 
            class Allocator = std::allocator<std::pair<const Key, T> > >
  class concurrent_unordered_map 
  {
public:
    // types
    typedef Key                                         key_type;
    typedef std::pair<const Key, T>                     value_type;
    typedef T                                           mapped_type;
    typedef Hash                                        hasher;
    typedef Pred                                        key_equal;
    typedef Allocator                                   allocator_type;
    typedef typename allocator_type::pointer            pointer;
    typedef typename allocator_type::const_pointer      const_pointer;
    typedef typename allocator_type::reference          reference;
    typedef typename allocator_type::const_reference    const_reference;
    typedef implementation-defined                      size_type;
    typedef implementation-defined                      difference_type;

    typedef implementation-defined                      iterator;
    typedef implementation-defined                      const_iterator;
    typedef implementation-defined                      local_iterator;
    typedef implementation-defined                      const_local_iterator;

    // construct/destroy/copy
    explicit concurrent_unordered_map(size_type n = see below, 
                                      const hasher& hf = hasher(),
                                      const key_equal& eql = key_equal(), 
                                      const allocator_type& a = allocator_type());

    template <typename Iterator>
      concurrent_unordered_map(Iterator first, Iterator last, 
                               size_type n = see below,
                               const hasher& hf = hasher(),
                               const key_equal& eql = key_equal(), 
                               const allocator_type& a = allocator_type());
    concurrent_unordered_map(const concurrent_unordered_map&);
    concurrent_unordered_map(const concurrent_unordered_map&&);
    explicit concurrent_unordered_map(const Allocator&);
    concurrent_unordered_map(const concurrent_unordered_map&, const Allocator&);
    concurrent_unordered_map(const concurrent_unordered_map&& table, const Allocator&);
    concurrent_unordered_map(initializer_list<value_type>,
      size_type = see below,
      const hasher& hf = hasher(),
      const key_equal& eql = key_equal(),
      const allocator_type& a = allocator_type());
    ~concurrent_unordered_map();
    concurrent_unordered_map& operator=(const concurrent_unordered_map&);
    concurrent_unordered_map& operator=(concurrent_unordered_map&&);
    concurrent_unordered_map& operator=(initializer_list<value_type>);
    allocator_type get_allocator() const noexcept;
    
    // size and capacity
    bool empty() const noexcept;
    size_type size() const noexcept;
    size_type max_size() const noexcept;
    
    // iterators
    iterator begin() noexcept;
    const_iterator begin() const noexcept;
    iterator end() noexcept;
    const_iterator end() const noexcept;
    const_iterator cbegin() const noexcept;
    const_iterator cend() const noexcept;

    // modifiers
    template <class... Args> pair<iterator, bool> emplace(Args&&... args);
    template <class... Args> iterator emplace_hint(const_iterator position, Args&&... args);
    pair<iterator, bool> insert(const value_type& obj);
    template <class P> pair<iterator, bool> insert(P&& obj);
    iterator insert(const_iterator hint, const value_type& obj);
    template <class P> iterator insert(const_iterator hint, P&& obj);
    template <class InputIterator> void insert(InputIterator first, InputIterator last);
    void insert(initializer_list<value_type>);
    
    iterator unsafe_erase(const_iterator position);
    size_type unsafe_erase(const key_type& key);
    iterator unsafe_erase(const_iterator first, const_iterator last);
    void clear() noexcept;

    void swap(concurrent_unordered_map&);

    // Observers
    hasher hash_function() const;
    key_equal key_eq() const;

    // lookup
    iterator find(const key_type& key);
    const_iterator find(const key_type& key) const;
    size_type count(const key_type& key) const;
    std::pair<iterator, iterator> equal_range(const key_type& key);
    std::pair<const_iterator, const_iterator> equal_range(const key_type& key) const;

    mapped_type& operator[](const key_type& key);
    mapped_type& operator[](key_type&& key);
    mapped_type& at(const key_type& key);
    const mapped_type& at(const key_type& key) const;

    // bucket interface 
    size_type unsafe_bucket_count() const noexcept;
    size_type unsafe_max_bucket_count() const noexcept;
    size_type unsafe_bucket_size(size_type n);
    size_type unsafe_bucket(const key_type& key) const;

    local_iterator unsafe_begin(size_type n);
    const_local_iterator unsafe_begin(size_type n) const;
    local_iterator unsafe_end(size_type n);
    const_local_iterator unsafe_end(size_type n) const;
    const_local_iterator unsafe_cbegin(size_type n) const;
    const_local_iterator unsafe_cend(size_type n) const;

    // hash policy
    float load_factor() const noexcept;
    float max_load_factor() const noexcept;
    void max_load_factor(float newmax);
    void rehash(size_type buckets);
    void reserve(size_type n);

  };

  template <class Key, class T, class Hash, class Pred, class Alloc>
  void swap(concurrent_unordered_map<Key, T, Hash, Pred, Alloc>& x,
  concurrent_unordered_map<Key, T, Hash, Pred, Alloc>& y);
  template <class Key, class T, class Hash, class Pred, class Alloc>
  bool operator==(const concurrent_unordered_map<Key, T, Hash, Pred, Alloc>& a,
  const concurrent_unordered_map<Key, T, Hash, Pred, Alloc>& b);
  template <class Key, class T, class Hash, class Pred, class Alloc>
  bool operator!=(const concurrent_unordered_map<Key, T, Hash, Pred, Alloc>& a,
  const concurrent_unordered_map<Key, T, Hash, Pred, Alloc>& b);
} 

The synopsis for concurrent_unordered_multimap is analogous, and omitted for brevity.

Header <concurrent_unordered_set> synopsis

namespace std {

  template <class Key, 
            class Hash = hash<Key>, 
            class Pred = std::equal_to<Key>, 
            class Allocator = std::allocator<Key> >
  class concurrent_unordered_set 
  {
public:
    // types
    typedef Key                                         key_type;
    typedef Key                                         value_type;
    typedef Hash                                        hasher;
    typedef Pred                                        key_equal;
    typedef Allocator                                   allocator_type;
    typedef typename allocator_type::pointer            pointer;
    typedef typename allocator_type::const_pointer      const_pointer;
    typedef typename allocator_type::reference          reference;
    typedef typename allocator_type::const_reference    const_reference;
    typedef implementation-defined                      size_type;
    typedef implementation-defined                      difference_type;

    typedef implementation-defined                      iterator;
    typedef implementation-defined                      const_iterator;
    typedef implementation-defined                      local_iterator;
    typedef implementation-defined                      const_local_iterator;

    // construct/destroy/copy
    explicit concurrent_unordered_set(size_type n = see below, 
                                      const hasher& hf = hasher(),
                                      const key_equal& eql = key_equal(), 
                                      const allocator_type& a = allocator_type());

    template <typename Iterator>
      concurrent_unordered_set(Iterator first, Iterator last, 
                               size_type n = see below,
                               const hasher& hf = hasher(),
                               const key_equal& eql = key_equal(), 
                               const allocator_type& a = allocator_type());
    concurrent_unordered_set(const concurrent_unordered_set&);
    concurrent_unordered_set(const concurrent_unordered_set&&);
    explicit concurrent_unordered_set(const Allocator&);
    concurrent_unordered_set(const concurrent_unordered_set&, const Allocator&);
    concurrent_unordered_set(const concurrent_unordered_set&& table, const Allocator&);
    concurrent_unordered_set(initializer_list<value_type>,
      size_type = see below,
      const hasher& hf = hasher(),
      const key_equal& eql = key_equal(),
      const allocator_type& a = allocator_type());
    ~concurrent_unordered_set();
    concurrent_unordered_set& operator=(const concurrent_unordered_set&);
    concurrent_unordered_set& operator=(concurrent_unordered_set&&);
    concurrent_unordered_set& operator=(initializer_list<value_type>);
    allocator_type get_allocator() const noexcept;
    
    // size and capacity
    bool empty() const noexcept;
    size_type size() const noexcept;
    size_type max_size() const noexcept;
    
    // iterators
    iterator begin() noexcept;
    const_iterator begin() const noexcept;
    iterator end() noexcept;
    const_iterator end() const noexcept;
    const_iterator cbegin() const noexcept;
    const_iterator cend() const noexcept;

    // modifiers
    template <class... Args> pair<iterator, bool> emplace(Args&&... args);
    template <class... Args> iterator emplace_hint(const_iterator position, Args&&... args);
    pair<iterator, bool> insert(const value_type& obj);
    template <class P> pair<iterator, bool> insert(P&& obj);
    iterator insert(const_iterator hint, const value_type& obj);
    template <class P> iterator insert(const_iterator hint, P&& obj);
    template <class InputIterator> void insert(InputIterator first, InputIterator last);
    void insert(initializer_list<value_type>);
    
    iterator unsafe_erase(const_iterator position);
    size_type unsafe_erase(const key_type& key);
    iterator unsafe_erase(const_iterator first, const_iterator last);
    void clear() noexcept;

    void swap(concurrent_unordered_set&);

    // Observers
    hasher hash_function() const;
    key_equal key_eq() const;

    // lookup
    iterator find(const key_type& key);
    const_iterator find(const key_type& key) const;
    size_type count(const key_type& key) const;
    std::pair<iterator, iterator> equal_range(const key_type& key);
    std::pair<const_iterator, const_iterator> equal_range(const key_type& key) const;

    // bucket interface 
    size_type unsafe_bucket_count() const noexcept;
    size_type unsafe_max_bucket_count() const noexcept;
    size_type unsafe_bucket_size(size_type n);
    size_type unsafe_bucket(const key_type& key) const;

    local_iterator unsafe_begin(size_type n);
    const_local_iterator unsafe_begin(size_type n) const;
    local_iterator unsafe_end(size_type n);
    const_local_iterator unsafe_end(size_type n) const;
    const_local_iterator unsafe_cbegin(size_type n) const;
    const_local_iterator unsafe_cend(size_type n) const;

    // hash policy
    float load_factor() const noexcept;
    float max_load_factor() const noexcept;
    void max_load_factor(float newmax);
    void rehash(size_type buckets);
    void reserve(size_type n);

  };

  template <class Key, class Hash, class Pred, class Alloc>
  void swap(concurrent_unordered_set<Key, Hash, Pred, Alloc>& x,
  concurrent_unordered_set<Key, Hash, Pred, Alloc>& y);
  template <class Key, class Hash, class Pred, class Alloc>
  bool operator==(const concurrent_unordered_set<Key, Hash, Pred, Alloc>& a,
  const concurrent_unordered_set<Key, Hash, Pred, Alloc>& b);
  template <class Key, class Hash, class Pred, class Alloc>
  bool operator!=(const concurrent_unordered_set<Key, Hash, Pred, Alloc>& a,
  const concurrent_unordered_set<Key, Hash, Pred, Alloc>& b);
} 

The synopsis for concurrent_unordered_multiset is analogous, and omitted for brevity.

Concurrency Guarantees

The following subsections describe concurrency guarantees for our concurrent unordered associative containers.

Concurrency-Safe Operations

For serialized execution, the operations behave the same as their current STL counterparts. The only change is allowance for concurrency. Executing any of the following operations concurrently on a concurrent unordered container does not introduce a data race:
	get_allocator
	empty, size, max_size
	begin, end, cbegin, cend
	insert
	find, count, equal_range, operator[], at
	load_factor
	max_load_factor() 
	operator==, operator!= 
assuming that the requisite operations on the key type (and mapped_type if applicable) are concurrency safe.

Non-Blocking

No concurrency-safe operation shall cause another concurrency-safe operation to block.

Partial Linearizability

Linearizability[Linearizability] is an important property that simplifies reasoning about concurrent objects. It means that each operation appears to take effect at some instant between its invocation and return. See the reference for a formal exposition. A fully linearizable implementation of our containers seems impractical, because it requires that both the content and size of the container appear to be updated in the same instant.

Hence we recommend a weaker guarantee where the container is split into two abstract parts, and require that each part be a linearizable object. The two parts are:

An operation on the container behaves as if it operates on the content part first and then operates on the size part. For example, insert first attempts insertion of an item into the content part, and then updates the size part if the item was inserted.

The order of operations on the parts guarantees that, in the absence of concurrent erasure, that if a thread inspects the size of the container, then that size is a lower bound on the number of items that it can find in the container. For example, let a be a freshly constructed concurrent associative container, and consider the following execution:

Thread 1 Thread 2
a.insert(t); // modify "content"; modify "size" s = a.size();     // inspect "size"
c = a.count(t); // inspect "content"
Disallowed outcome: s==1 && c==0

Our requirement for partial linearizability minimizes surprises and delineates where they might occur. Each abstract part behaves in an intuitive manner, and surprises can exist only for properties relating both parts as a whole.

Because linearizability of the entire container is not guaranteed, it is possible for a thread to find an item and subsequently retrieve a count that does not include that item yet. For example, following execution has four possible outcomes.

Thread 1 Thread 2
a.insert(t); // modify "content"; modify "size" c = a.count(t); // inspect "content"
s = a.size();     // inspect "size"
Allowed outcomes: c∈{0,1}, s∈{0,1}
The outcome c==1 && s==0 is allowed here. It would be prohibited by a fully linearizable implementation of container a because it would not correspond to an outcome allowed by any serial interleaving of the operations.

In the absence of concurrent erasure, the comparison operations operator== or operator!= on concurrent containers are linearizable. Though these comparison operations were omitted from the TBB/PPL products, they should be straightforward to implement by walking the underlying split-ordered lists. Indeed these walks are probably faster than the means necessary for comparing their non-concurrent counterparts.

References

[SplitOrder]
Ori Shalev and Nir Shavit. 2006. ``Split-ordered lists: Lock-free extensible hash tables.'' J. ACM 53, 3 (May 2006), 379-405. DOI=10.1145/1147954.1147958.
[Linearizability]
Maurice P. Herlihy and Jeannette M. Wing. 1990. ``Linearizability: a correctness condition for concurrent objects.'' ACM Trans. Program. Lang. Syst. 12, 3 (July 1990), 463-492. DOI=10.1145/78969.78972