C++ Distributed Counters

ISO/IEC JTC1 SC22 WG21 N3706 - 2013-09-01

Lawrence Crowl, Lawrence@Crowl.org

Introduction
Solution
    General Methods
    Simplex Counters
    Counter Buffers
    Duplex Counters
    Weak Duplex Counters
    Buffering Brokers
    Counter Arrays
    Atomicity
    Guidelines For Use
Open Issues
Implementation
Synopsis
    counter::atomicity
    counter::bumper
    counter::simplex
    counter::buffer
    counter::strong_duplex
    counter::strong_broker
    counter::weak_duplex
    counter::weak_broker
    counter::bumper_array
    counter::simplex_array
    counter::buffer_array
    counter::strong_duplex_array
    counter::strong_broker_array
    counter::weak_duplex_array
    counter::weak_broker_array
Revision History
References

Introduction

Long-running multithreaded server programs may need to maintain many counters to aid in program diagnosis. As these counters are much more commonly incremented than read, we desire an implementation of counters that minimizes the cost of incrementing the counter, accepting increased costs to obtain the count.

Cilk adder-reducers are somewhat similar, except that the reduction is control-dependent and hence not suitable for external inspection of the count.

This proposal restricts itself to precise counters. Statistical counters, where the read count may be slightly different from the count operations, can produce better performance [Dice-Lev-Moir-2013].

Solution

We propose a set of coordinated counter types that collectively provide for distributed counting. The essential component of the design is a trade-off between the cost of incrementing the counter, the cost of loading (reading) the counter value, and the "lag" between an increment and its appearance in a "read" value. That is, the read of a counter value may not reflect the most recent increments. However, no count will be lost.

These counters are parameterized by the base integer type that maintains the count. Avoid situations that overflow the integer, as that may have undefined behavior. This constraint implies that counters must be sized to their use. (For dynamic mitigation, see the exchange operation below.)

General Methods

The general counter methods are as follows.

constructor( integer )

The parameter is the initial counter value.

constructor()

Equivalent to an initial value of zero.

void operator +=( integer )
void operator -=( integer )

Add/subtract a value to/from the counter. There is no default value. Note that these operations specifically do not return the count value, which would defeat the purpose of optimizing for increments over loads.

void operator ++() // prefix
void operator ++(int) // postfix
void operator --() // prefix
void operator --(int) // postfix

Increment or decrement the counter.

integer load()

Returns the value of the counter.

integer exchange( integer )

Replaces the existing count by the count in the parameter and returns the previous count, which enables safe concurrent count extraction. This operation has particular utility in long-running programs in occasionally draining the counter to prevent integer overflow.

There are no copy or assignment operations.

Simplex Counters

The simplex counters provide low-latency counting. They implement all the general operations.

counter::simplex<int> red_count;

void count_red( Bag bag ) {
    for ( auto i: bag )
        if ( i.is_red() )
            ++red_count;
}

Note that this code may have significant cost because of the repeated global atomic increments. Using a temporary int to maintain the count within the loop runs the risk of loosing some counts in the event of an exception.

Counter Buffers

The cost of incrementing the counter is reduced by using a buffer as a proxy for the counter. The counter is a reference parameter to the buffer. This buffer is typically accessed by a single thread, and so its cost can be substantially lower.

counter::simplex<int> red_count;

void count_red( Bag bag ) {
    counter::buffer<int> local_red( red_count );
    for ( auto i: bag )
        if ( i.is_red() )
            ++local_red;
}

The buffer will automatically transfer its count to the main counter on destruction. If this latency is too great, use the push method to transfer the count early.

void count_red( Bag bag1, Bag bag2 ) {
    counter::buffer<int> local_red( red_count );
    for ( auto i: bag1 )
        if ( i.is_red() )
            ++local_red;
    local_red.push();
    for ( auto i: bag2 )
        if ( i.is_red() )
            ++local_red;
}

Any increments on buffers since the last push will not be reflected in the value reported by a load of the counter. The destructor does an implicit push. The lifetime of the counter must be strictly larger than the lifetimes of any buffers attached to it.

Duplex Counters

The push model of buffers sometimes yields an unacceptable lag in the observed value of the count. To avoid this lag, there are duplex counters. A duplex counter is paired with a broker, the counter can query the broker for counts, thus maintaining a somewhat current view of the count.

counter::strong_duplex<int> red_count;

void count_red( Bag bag )
    counter::strong_broker broker( red_count );
    for ( auto i: bag )
        if ( i.is_red() )
            ++broker;
}

Another thread may call red_count.load() and get the current count. That operation will poll each broker for its count and return the sum. Naturally, any increments done to a broker after it is polled will be missed, but no counts will be lost.

The primary use case for duplex counters is to enable fast thread-local increments while still maintaining a decent global count.

weak_duplex<int> red_count;
thread_local counter::weak_broker<int> thread_red( red_count );

void count_red( Bag bag )
    for ( auto i: bag )
        if ( i.is_red() )
            ++thread_red;
}

Weak Duplex Counters

The exchange operation works by atomically transfering broker counts to the main counter, and then exchanging out of the main counter. Consequently, every count will be extracted by one and only one exchange operation.

However, that exchange operation can be expensive because it and the broker increment operations require write atomicity to the same broker object. To reduce that concurrency cost, the weak duplex counter and its weak broker do not provide the exchange operation. This difference means that the polling is a read-only operation and requires less synchronization. Use this counter when you do not intend to exchange values.

Buffering Brokers

Thread-local increments, while cheaper than shared global increments, are still more expensive than function-local increments. To mitigate that cost, buffers work with brokers as well.

counter::weak_duplex<int> red_count;
thread_local counter::weak_broker<int> thread_red( red_count );

void count_red( Bag bag )
    counter::buffer<int> local_red( thread_red );
    for ( auto i: bag )
        if ( i.is_red() )
            ++local_red;
}

As with buffers in general, the count transfers only on destruction a push operation. Consequently, red_count.load() will not reflect any counts in buffers. Those counts will not be lost, though.

Counter Arrays

Counter arrays provide a means to handle many counters with one name. The size of the counter arrays is fixed at construction time. Counter arrays have the same structure as single-value counters, with the following exceptions.

Increment a counter by indexing into its array and then applying one of the operations. E.g. ++ctr_ary[i].

The load and exchange operations take an additional index parameter.

Atomicity

In the course of program evolution, debugging and tuning, a counter may desire an implementation with weaker concurrency requirements. That is accomplished by explicitly specifing the atomicity. For example, suppose it is discovered that a counter

counter::simplex<int> red_count;

is only ever read and written from a single thread. We can avoid the cost of atomic operations by making the counter serial.

counter::simplex<int, counter::atomicity::none> red_count;

This approach preserves the programming interface.

There are several flavors of atomicity. These are specified as a (usually defaulted) template parameter.

none

Only one thread may access the counter. counter::atomicity::semi

semi

Multiple threads may load (read) the counter, but only one thread may adjust the counter. There is now also the issue of whether or not an adjustment has any memory ordering. The three potential orderings are relaxed, acquire/release, and sequentially consistent. (See 1.10 [intro.multithread] of the standard.) Of these, the safe default is sequentially consistent, though it has significant performance implications. There is some doubt as to whether sequentially consistent is required. (Note that acquire/release and sequentially consistent cannot apply to statistical counters [Dice-Lev-Moir-2013] as they work by avoiding atomic writes.) counter::atomicity::semi_relaxed counter::atomicity::semi_acq_rel counter::atomicity::semi_seq_cst

full

Multiple threads may both load and adjust the counter. The issue of memory ordering also applies. Further, there is the issue of contention. Under low contention, a simple atomic variable is sufficient. Under high contention, other implementation will perform better [Dice-Lev-Moir-2013]. Adaptive implementations may also be desireable. It is unclear which is best as a default, but most counters will likely have low contention, and hence that is a reasonable strawman default. counter::atomicity::full_relaxed_low counter::atomicity::full_relaxed_high counter::atomicity::full_relaxed_adapt counter::atomicity::full_acq_rel_low counter::atomicity::full_acq_rel_high counter::atomicity::full_acq_rel_adapt counter::atomicity::full_seq_cst_low counter::atomicity::full_seq_cst_high counter::atomicity::full_seq_cst_adapt

Buffers have two template parameters for atomicity, one for the atomicity of the counter it is modifying, and one for atomicity of the buffer itself. By exlicitly specifying this atomicity, one can build unusual configurations of buffers for unusual situations. For example, suppose increments of red_count tend to cluster in tasks with high processor affinity. By separating those counts with a global intermediate buffer, counting can exploit processor locality and avoid substantial inter-processor communication. For example,

counter::simplex<int> red_count;
counter::buffer<int, counter::atomicity::full_seq_cst_low,
                        counter::atomicity::full_seq_cst_low>
    red_squares( red_count );
counter::buffer<int, counter::atomicity::full_seq_cst_low,
                        counter::atomicity::full_seq_cst_low>
    red_circles( red_count );

void count_red_squares( Bag bag ) {
    counter::buffer<int> local_red( red_squares );
    for ( auto i: bag )
        if ( i.is_red() )
            ++local_red;
}

The red_squares variable is global, and will automatically transfer its count to red_count only on global variable destruction This transfer is likely to be too late for many purposes. One solution is explicit programming to call red_squares.push() at appropriate times. Another solution is to use duplex counters.

Guidelines For Use

Use a simplex counter when you have a low rate of updates or a high read rate or little tolerance for latency. Use a strong duplex counter and broker when your update rate is significantly higher than the load rate, you can tolerate latency in counting, and you need the exchange operation. Use a weak duplex counter and broker when your update rate is significantly higher than the load rate, you can tolerate latency in counting, but you do not need the exchange operation. Use buffers to collect short-term bursts of counts.

The operations of the counters, brokers, and buffers have the following rough costs.

simplex <default>strong duplexweak duplex
update atomic rmwatomic rmwatomic rmw
load atomic readmutex + n * atomic read mutex + n * atomic read
exchange atomic rmwmutex + n * atomic rmwn/a
construction trivialstd::setstd::set
destruction trivialstd::mutex + std::set std::mutex + std::set
buffer <default>strong brokerweak broker
update serial read & writeatomic rmw atomic read & write
construction pointer assignmutex + set.insert mutex + set.insert
destruction pointer assignmutex + set.remove mutex + set.remove

Open Issues

Do we want to change the names?

The meanings of the names are not readily apparent, though no concrete alternatives have been proposed.

What set of atomicity combinations are worth supporting?

It is not clear that the sequentially consistent forms are technically needed. It is not clear that the acquire/release forms are practically needed.

Do we want to initialize a counter array with an initializer list?
Do we want to return a dynarray for the load operation?
Do we want to pass and return a dynarray for the exchange operation?
There seems no pressing need, and adding this facility later should be binary compatible.

Implementation

An implementation is available from the Google Concurrency Library http://code.google.com/p/google-concurrency-library/ at .../source/browse/include/counter.h. At present, it only includes implementations for relaxed memory ordering and low contention.

The lowest implementation layer level of the organization is a bumper. Bumpers provide the interface of simplex counter, but only the increment and decrement interface is public. The rest are protected. Buffer constructors require a reference to a bumper. Simplex counters, buffers, duplex counters, and buffers are all derived from a bumper, which enables buffers to connect to all of them.

Synopsis

The following synopsis also includes some implementation detail. The base classes and class member variables are listed to provide an indication of implementation organization and cost.

The following definitions are within namespace counter within namespace std..

counter::atomicity

enum class atomicity
{
    none,
    semi_relaxed,
    semi_acq_rel,
    semi_seq_cst,
    full_relaxed_low,
    full_relaxed_high,
    full_relaxed_adapt,
    full_acq_rel_low,
    full_acq_rel_high,
    full_acq_rel_adapt,
    full_seq_cst_low,
    full_seq_cst_high,
    full_seq_cst_adapt
};

counter::bumper

The synopsis includes definitions for the bumper, which could be considered an implementation detail could be considered a visible part of the interface. The bumper is implemented as set of partial specializations, though this detail is ommitted for clarity in this paper.

template< typename Integral, atomicity Atomicity >
class bumper;
{
    bumper( const bumper& );
    bumper& operator=( const bumper& );
public:
    void operator +=( Integral by );
    void operator -=( Integral by );
    void operator ++();
    void operator ++(int);
    void operator --();
    void operator --(int);
protected:
    constexpr bumper( Integral in );
    constexpr bumper();
    Integral load();
    Integral exchange( Integral to );
    Integral value_;
    template< typename, atomicity >
    friend class bumper_array;
    template< typename, atomicity, atomicity >
    friend class buffer_array;
    friend class std::dynarray< bumper >;
};

counter::simplex

template< typename Integral,
          atomicity Atomicity = atomicity::full_seq_cst_low >
class simplex
: public bumper< Integral, Atomicity >
{
    typedef bumper< Integral, Atomicity > base_type;
public:
    constexpr simplex();
    constexpr simplex( Integral in );
    simplex( const simplex& );
    simplex& operator=( const simplex& );
    Integral load();
    Integral exchange( Integral to );
};

counter::buffer

template< typename Integral,
          atomicity PrimeAtomicity = atomicity::full_seq_cst_low,
          atomicity BufferAtomicity = atomicity::none >
class buffer
: public bumper< Integral, BufferAtomicity >
{
    typedef bumper< Integral, PrimeAtomicity > prime_type;
    typedef bumper< Integral, BufferAtomicity > base_type;
public:
    buffer();
    buffer( prime_type& p );
    buffer( const buffer& );
    buffer& operator=( const buffer& );
    void push();
    ~buffer();
private:
    prime_type& prime_;
};

counter::strong_duplex

template< typename Integral > class strong_duplex
: public bumper< Integral, atomicity::full_seq_cst_low >
{
    typedef bumper< Integral, atomicity::full_seq_cst_low > base_type;
    typedef strong_broker< Integral > broker_type;
    friend class strong_broker< Integral >;
public:
    strong_duplex();
    strong_duplex( Integral in );
    Integral load();
    Integral exchange( Integral to );
    ~strong_duplex();
private:
    void insert( broker_type* child );
    void erase( broker_type* child, Integral by );
    mutex serializer_;
    typedef std::unordered_set< broker_type* > set_type;
    set_type children_;
};

counter::strong_broker

template< typename Integral > class strong_broker
: public bumper< Integral, atomicity::full_seq_cst_low >
{
    typedef bumper< Integral, atomicity::full_seq_cst_low > base_type;
    typedef strong_duplex< Integral > duplex_type;
    friend class strong_duplex< Integral >;
public:
    strong_broker( duplex_type& p );
    strong_broker();
    strong_broker( const strong_broker& );
    strong_broker& operator=( const strong_broker& );
    ~strong_broker();
private:
    Integral poll();
    Integral drain();
    duplex_type& prime_;
};

counter::weak_duplex

template< typename Integral > class weak_duplex
: public bumper< Integral, atomicity::full_seq_cst_low >
{
    typedef bumper< Integral, atomicity::full_seq_cst_low > base_type;
    typedef weak_broker< Integral > broker_type;
    friend class weak_broker< Integral >;
public:
    weak_duplex();
    weak_duplex( Integral in );
    weak_duplex( const weak_duplex& );
    weak_duplex& operator=( const weak_duplex& );
    Integral load();
    ~weak_duplex();
private:
    void insert( broker_type* child );
    void erase( broker_type* child, Integral by );
    mutex serializer_;
    typedef std::unordered_set< broker_type* > set_type;
    set_type children_;
};

counter::weak_broker

template< typename Integral > class weak_broker
: public bumper< Integral, atomicity::semi >
{
    typedef bumper< Integral, atomicity::semi > base_type;
    typedef weak_duplex< Integral > duplex_type;
    friend class weak_duplex< Integral >;
public:
    weak_broker( duplex_type& p );
    weak_broker();
    weak_broker( const weak_broker& );
    weak_broker& operator=( const weak_broker& );
    ~weak_broker();
private:
    Integral poll();
    duplex_type& prime_;
};

counter::bumper_array

template< typename Integral,
          atomicity Atomicity = atomicity::full_seq_cst_low >
class bumper_array
{
public:
    typedef bumper< Integral, Atomicity > value_type;
private:
    typedef std::dynarray< value_type > storage_type;
public:
    typedef typename storage_type::size_type size_type;
    bumper_array();
    bumper_array( size_type size );
    bumper_array( const bumper_array& );
    bumper_array& operator=( const bumper_array& );
    value_type& operator[]( size_type idx );
    size_type size();
protected:
    Integral load( size_type idx );
    Integral exchange( size_type idx, Integral value );
private:
    storage_type storage;
};

counter::simplex_array

template< typename Integral,
          atomicity Atomicity = atomicity::full_seq_cst_low >
class simplex_array
: public bumper_array< Integral, Atomicity >
{
    typedef bumper_array< Integral, Atomicity > base_type;
public:
    typedef typename base_type::value_type value_type;
    typedef typename base_type::size_type size_type;
    simplex_array();
    constexpr simplex_array( size_type size );
    simplex_array( const simplex_array& );
    simplex_array& operator=( const simplex_array& );
    Integral load( size_type idx );
    Integral exchange( size_type idx, Integral value );
    value_type& operator[]( int idx );
    size_type size();
};

counter::buffer_array

template< typename Integral,
          atomicity PrimeAtomicity = atomicity::full_seq_cst_low,
          atomicity BufferAtomicity = atomicity::full_seq_cst_low >
class buffer_array
: public bumper_array< Integral, BufferAtomicity >
{
    typedef bumper_array< Integral, BufferAtomicity > base_type;
    typedef bumper_array< Integral, PrimeAtomicity > prime_type;
public:
    typedef typename base_type::value_type value_type;
    typedef typename base_type::size_type size_type;
    buffer_array();
    buffer_array( prime_type& p );
    buffer_array( const buffer_array& );
    buffer_array& operator=( const buffer_array& );
    void push( size_type idx );
    void push();
    size_type size();
    ~buffer_array();
private:
    prime_type& prime_;
};

counter::strong_duplex_array

template< typename Integral > class strong_duplex_array
: public bumper_array< Integral, atomicity::full_seq_cst_low >
{
    typedef bumper_array< Integral, atomicity::full_seq_cst_low > base_type;
    typedef strong_broker_array< Integral > broker_type;
    friend class strong_broker_array< Integral >;
public:
    typedef typename base_type::value_type value_type;
    typedef typename base_type::size_type size_type;
    strong_duplex_array();
    constexpr strong_duplex_array( size_type size )
     ;
    strong_duplex_array( const strong_duplex_array& );
    strong_duplex_array& operator=( const strong_duplex_array& );
    value_type& operator[]( int idx );
    size_type size();
    ~strong_duplex_array();
private:
    void insert( broker_type* child );
    void erase( broker_type* child, Integral by );
    mutex serializer_;
    typedef std::unordered_set< broker_type* > set_type;
    set_type children_;
};

counter::strong_broker_array

template< typename Integral > class strong_broker_array
: public bumper_array< Integral, atomicity::semi >
{
    typedef bumper_array< Integral, atomicity::semi > base_type;
    typedef strong_duplex_array< Integral > duplex_type;
    friend class strong_duplex_array< Integral >;
public:
    typedef typename base_type::value_type value_type;
    typedef typename base_type::size_type size_type;
    strong_broker_array();
    strong_broker_array( duplex_type& p );
    strong_broker_array( const strong_broker_array& );
    strong_broker_array& operator=( const strong_broker_array& );
    size_type size();
    ~strong_broker_array();
private:
    Integral poll( size_type idx );
    Integral drain( size_type idx );
    duplex_type& prime_;
};

counter::weak_duplex_array

template< typename Integral > class weak_duplex_array
: public bumper_array< Integral, atomicity::full_seq_cst_low >
{
    typedef bumper_array< Integral, atomicity::full_seq_cst_low > base_type;
    typedef weak_broker_array< Integral > broker_type;
    friend class weak_broker_array< Integral >;
public:
    typedef typename base_type::value_type value_type;
    typedef typename base_type::size_type size_type;
    weak_duplex_array();
    constexpr weak_duplex_array( size_type size );
    weak_duplex_array( const weak_duplex_array& );
    weak_duplex_array& operator=( const weak_duplex_array& );
    value_type& operator[]( int idx );
    size_type size();
    ~weak_duplex_array();
private:
    void insert( broker_type* child );
    void erase( broker_type* child, Integral by );
    mutex serializer_;
    typedef std::unordered_set< broker_type* > set_type;
    set_type children_;
};

counter::weak_broker_array

template< typename Integral > class weak_broker_array
: public bumper_array< Integral, atomicity::semi >
{
    typedef bumper_array< Integral, atomicity::semi > base_type;
    typedef weak_duplex_array< Integral > duplex_type;
    friend class weak_duplex_array< Integral >;
public:
    typedef typename base_type::value_type value_type;
    typedef typename base_type::size_type size_type;
    weak_broker_array();
    weak_broker_array( duplex_type& p );
    weak_broker_array( const weak_broker_array& );
    weak_broker_array& operator=( const weak_broker_array& );
    size_type size();
    ~weak_broker_array();
private:
    Integral poll( size_type idx );
    duplex_type& prime_;
};

Revision History

This paper revises ISO/IEC JTC1 SC22 WG21 N3355 = 12-0045 - 2012-01-14. Its changes are as follows:

References

[Cilk-Reducer]
"Reducers", http://software.intel.com/sites/products/documentation/hpc/composerxe/en-us/2011Update/cpp/lin/cref_cls/common/cilk_bk_reducer_intro.htm; "Intel Cilk Plus", http://software.intel.com/sites/products/documentation/hpc/composerxe/en-us/2011Update/cpp/lin/cref_cls/common/cilk_bk_using_cilk.htm; "Intel C++ Compiler XE 12.1 User and Reference Guides", http://software.intel.com/sites/products/documentation/hpc/composerxe/en-us/2011Update/cpp/lin/main/main_cover_title.htm
[Dice-Lev-Moir-2013]
"Scalable Statstics Counters"; Dave Dice, Yossi Lev, Mark Moir; SPAA'13, June 23-25, 2013, Montrèal Quèbec, Canada.
[Lev-Moir-2011]
"Lightweight Parallel Accumulators Using C++ Templates"; Yossi Lev, Mark Moir; ICSE'11, May 21-28, 2011, Waikiki, Honolulu, HI, USA