ISO/IEC JTC1 SC22 WG21 N3706 - 2013-09-01
Lawrence Crowl, Lawrence@Crowl.org
Introduction
Solution
General Methods
Simplex Counters
Counter Buffers
Duplex Counters
Weak Duplex Counters
Buffering Brokers
Counter Arrays
Atomicity
Guidelines For Use
Open Issues
Implementation
Synopsis
counter::atomicity
counter::bumper
counter::simplex
counter::buffer
counter::strong_duplex
counter::strong_broker
counter::weak_duplex
counter::weak_broker
counter::bumper_array
counter::simplex_array
counter::buffer_array
counter::strong_duplex_array
counter::strong_broker_array
counter::weak_duplex_array
counter::weak_broker_array
Revision History
References
Long-running multithreaded server programs may need to maintain many counters to aid in program diagnosis. As these counters are much more commonly incremented than read, we desire an implementation of counters that minimizes the cost of incrementing the counter, accepting increased costs to obtain the count.
Cilk adder-reducers are somewhat similar, except that the reduction is control-dependent and hence not suitable for external inspection of the count.
This proposal restricts itself to precise counters. Statistical counters, where the read count may be slightly different from the count operations, can produce better performance [Dice-Lev-Moir-2013].
We propose a set of coordinated counter types that collectively provide for distributed counting. The essential component of the design is a trade-off between the cost of incrementing the counter, the cost of loading (reading) the counter value, and the "lag" between an increment and its appearance in a "read" value. That is, the read of a counter value may not reflect the most recent increments. However, no count will be lost.
These counters are parameterized by the base integer type
that maintains the count.
Avoid situations that overflow the integer,
as that may have undefined behavior.
This constraint implies that counters must be sized to their use.
(For dynamic mitigation, see the exchange
operation below.)
The general counter methods are as follows.
( integer )
The parameter is the initial counter value.
()
Equivalent to an initial value of zero.
void operator +=( integer )
void operator -=( integer )
Add/subtract a value to/from the counter. There is no default value. Note that these operations specifically do not return the count value, which would defeat the purpose of optimizing for increments over loads.
void operator ++() // prefix
void operator ++(int) // postfix
void operator --() // prefix
void operator --(int) // postfix
Increment or decrement the counter.
integer load()
Returns the value of the counter.
integer exchange( integer )
Replaces the existing count by the count in the parameter and returns the previous count, which enables safe concurrent count extraction. This operation has particular utility in long-running programs in occasionally draining the counter to prevent integer overflow.
There are no copy or assignment operations.
The simplex counters provide low-latency counting. They implement all the general operations.
counter::simplex<int> red_count; void count_red( Bag bag ) { for ( auto i: bag ) if ( i.is_red() ) ++red_count; }
Note that this code may have significant cost because of the repeated global atomic increments. Using a temporary int to maintain the count within the loop runs the risk of loosing some counts in the event of an exception.
The cost of incrementing the counter is reduced by using a buffer as a proxy for the counter. The counter is a reference parameter to the buffer. This buffer is typically accessed by a single thread, and so its cost can be substantially lower.
counter::simplex<int> red_count; void count_red( Bag bag ) { counter::buffer<int> local_red( red_count ); for ( auto i: bag ) if ( i.is_red() ) ++local_red; }
The buffer will automatically transfer its count to the main counter on destruction. If this latency is too great, use the push method to transfer the count early.
void count_red( Bag bag1, Bag bag2 ) { counter::buffer<int> local_red( red_count ); for ( auto i: bag1 ) if ( i.is_red() ) ++local_red; local_red.push(); for ( auto i: bag2 ) if ( i.is_red() ) ++local_red; }
Any increments on buffers since the last push will not be reflected in the value reported by a load of the counter. The destructor does an implicit push. The lifetime of the counter must be strictly larger than the lifetimes of any buffers attached to it.
The push model of buffers sometimes yields an unacceptable lag in the observed value of the count. To avoid this lag, there are duplex counters. A duplex counter is paired with a broker, the counter can query the broker for counts, thus maintaining a somewhat current view of the count.
counter::strong_duplex<int> red_count; void count_red( Bag bag ) counter::strong_broker broker( red_count ); for ( auto i: bag ) if ( i.is_red() ) ++broker; }
Another thread may call red_count.load()
and get the current count.
That operation will poll each broker for its count and return the sum.
Naturally, any increments done to a broker after it is polled will be missed,
but no counts will be lost.
The primary use case for duplex counters is to enable fast thread-local increments while still maintaining a decent global count.
weak_duplex<int> red_count; thread_local counter::weak_broker<int> thread_red( red_count ); void count_red( Bag bag ) for ( auto i: bag ) if ( i.is_red() ) ++thread_red; }
The exchange operation works by atomically transfering broker counts to the main counter, and then exchanging out of the main counter. Consequently, every count will be extracted by one and only one exchange operation.
However, that exchange operation can be expensive because it and the broker increment operations require write atomicity to the same broker object. To reduce that concurrency cost, the weak duplex counter and its weak broker do not provide the exchange operation. This difference means that the polling is a read-only operation and requires less synchronization. Use this counter when you do not intend to exchange values.
Thread-local increments, while cheaper than shared global increments, are still more expensive than function-local increments. To mitigate that cost, buffers work with brokers as well.
counter::weak_duplex<int> red_count; thread_local counter::weak_broker<int> thread_red( red_count ); void count_red( Bag bag ) counter::buffer<int> local_red( thread_red ); for ( auto i: bag ) if ( i.is_red() ) ++local_red; }
As with buffers in general,
the count transfers only on destruction a push operation.
Consequently, red_count.load()
will not reflect any counts in buffers.
Those counts will not be lost, though.
Counter arrays provide a means to handle many counters with one name. The size of the counter arrays is fixed at construction time. Counter arrays have the same structure as single-value counters, with the following exceptions.
Increment a counter by indexing into its array
and then applying one of the operations.
E.g. ++ctr_ary[i]
.
The load and exchange operations take an additional index parameter.
In the course of program evolution, debugging and tuning, a counter may desire an implementation with weaker concurrency requirements. That is accomplished by explicitly specifing the atomicity. For example, suppose it is discovered that a counter
counter::simplex<int> red_count;
is only ever read and written from a single thread. We can avoid the cost of atomic operations by making the counter serial.
counter::simplex<int, counter::atomicity::none> red_count;
This approach preserves the programming interface.
There are several flavors of atomicity. These are specified as a (usually defaulted) template parameter.
Only one thread may access the counter.
counter::atomicity::semi
Multiple threads may load (read) the counter,
but only one thread may adjust the counter.
There is now also the issue of whether or not
an adjustment has any memory ordering.
The three potential orderings are relaxed,
acquire/release, and sequentially consistent.
(See 1.10 [intro.multithread] of the standard.)
Of these, the safe default is sequentially consistent,
though it has significant performance implications.
There is some doubt as to whether sequentially consistent is required.
(Note that acquire/release and sequentially consistent
cannot apply to statistical counters
[Dice-Lev-Moir-2013]
as they work by avoiding atomic writes.)
counter::atomicity::semi_relaxed
counter::atomicity::semi_acq_rel
counter::atomicity::semi_seq_cst
Multiple threads may both load and adjust the counter.
The issue of memory ordering also applies.
Further, there is the issue of contention.
Under low contention, a simple atomic variable is sufficient.
Under high contention,
other implementation will perform better
[Dice-Lev-Moir-2013].
Adaptive implementations may also be desireable.
It is unclear which is best as a default,
but most counters will likely have low contention,
and hence that is a reasonable strawman default.
counter::atomicity::full_relaxed_low
counter::atomicity::full_relaxed_high
counter::atomicity::full_relaxed_adapt
counter::atomicity::full_acq_rel_low
counter::atomicity::full_acq_rel_high
counter::atomicity::full_acq_rel_adapt
counter::atomicity::full_seq_cst_low
counter::atomicity::full_seq_cst_high
counter::atomicity::full_seq_cst_adapt
Buffers have two template parameters for atomicity, one for the atomicity of the counter it is modifying, and one for atomicity of the buffer itself. By exlicitly specifying this atomicity, one can build unusual configurations of buffers for unusual situations. For example, suppose increments of red_count tend to cluster in tasks with high processor affinity. By separating those counts with a global intermediate buffer, counting can exploit processor locality and avoid substantial inter-processor communication. For example,
counter::simplex<int> red_count; counter::buffer<int, counter::atomicity::full_seq_cst_low, counter::atomicity::full_seq_cst_low> red_squares( red_count ); counter::buffer<int, counter::atomicity::full_seq_cst_low, counter::atomicity::full_seq_cst_low> red_circles( red_count ); void count_red_squares( Bag bag ) { counter::buffer<int> local_red( red_squares ); for ( auto i: bag ) if ( i.is_red() ) ++local_red; }
The red_squares
variable is global,
and will automatically transfer its count to red_count
only on global variable destruction
This transfer is likely to be too late for many purposes.
One solution is explicit programming to call red_squares.push()
at appropriate times.
Another solution is to use duplex counters.
Use a simplex counter when you have a low rate of updates or a high read rate or little tolerance for latency. Use a strong duplex counter and broker when your update rate is significantly higher than the load rate, you can tolerate latency in counting, and you need the exchange operation. Use a weak duplex counter and broker when your update rate is significantly higher than the load rate, you can tolerate latency in counting, but you do not need the exchange operation. Use buffers to collect short-term bursts of counts.
The operations of the counters, brokers, and buffers have the following rough costs.
simplex <default> | strong duplex | weak duplex | |
---|---|---|---|
update | atomic rmw | atomic rmw | atomic rmw |
load | atomic read | mutex + n * atomic read | mutex + n * atomic read |
exchange | atomic rmw | mutex + n * atomic rmw | n/a |
construction | trivial | std::set | std::set |
destruction | trivial | std::mutex + std::set | std::mutex + std::set |
buffer <default> | strong broker | weak broker | |
update | serial read & write | atomic rmw | atomic read & write |
construction | pointer assign | mutex + set.insert | mutex + set.insert |
destruction | pointer assign | mutex + set.remove | mutex + set.remove |
The meanings of the names are not readily apparent, though no concrete alternatives have been proposed.
It is not clear that the sequentially consistent forms are technically needed. It is not clear that the acquire/release forms are practically needed.
An implementation is available from the Google Concurrency Library http://code.google.com/p/google-concurrency-library/ at .../source/browse/include/counter.h. At present, it only includes implementations for relaxed memory ordering and low contention.
The lowest implementation layer level of the organization is a bumper. Bumpers provide the interface of simplex counter, but only the increment and decrement interface is public. The rest are protected. Buffer constructors require a reference to a bumper. Simplex counters, buffers, duplex counters, and buffers are all derived from a bumper, which enables buffers to connect to all of them.
The following synopsis also includes some implementation detail. The base classes and class member variables are listed to provide an indication of implementation organization and cost.
The following definitions are
within namespace counter
within namespace std.
.
counter::atomicity
enum class atomicity { none, semi_relaxed, semi_acq_rel, semi_seq_cst, full_relaxed_low, full_relaxed_high, full_relaxed_adapt, full_acq_rel_low, full_acq_rel_high, full_acq_rel_adapt, full_seq_cst_low, full_seq_cst_high, full_seq_cst_adapt };
counter::bumper
The synopsis includes definitions for the bumper
,
which could be considered an implementation detail
could be considered a visible part of the interface.
The bumper
is implemented as set of partial specializations,
though this detail is ommitted for clarity in this paper.
template< typename Integral, atomicity Atomicity > class bumper; { bumper( const bumper& ); bumper& operator=( const bumper& ); public: void operator +=( Integral by ); void operator -=( Integral by ); void operator ++(); void operator ++(int); void operator --(); void operator --(int); protected: constexpr bumper( Integral in ); constexpr bumper(); Integral load(); Integral exchange( Integral to ); Integral value_; template< typename, atomicity > friend class bumper_array; template< typename, atomicity, atomicity > friend class buffer_array; friend class std::dynarray< bumper >; };
counter::simplex
template< typename Integral, atomicity Atomicity = atomicity::full_seq_cst_low > class simplex : public bumper< Integral, Atomicity > { typedef bumper< Integral, Atomicity > base_type; public: constexpr simplex(); constexpr simplex( Integral in ); simplex( const simplex& ); simplex& operator=( const simplex& ); Integral load(); Integral exchange( Integral to ); };
counter::buffer
template< typename Integral, atomicity PrimeAtomicity = atomicity::full_seq_cst_low, atomicity BufferAtomicity = atomicity::none > class buffer : public bumper< Integral, BufferAtomicity > { typedef bumper< Integral, PrimeAtomicity > prime_type; typedef bumper< Integral, BufferAtomicity > base_type; public: buffer(); buffer( prime_type& p ); buffer( const buffer& ); buffer& operator=( const buffer& ); void push(); ~buffer(); private: prime_type& prime_; };
counter::strong_duplex
template< typename Integral > class strong_duplex : public bumper< Integral, atomicity::full_seq_cst_low > { typedef bumper< Integral, atomicity::full_seq_cst_low > base_type; typedef strong_broker< Integral > broker_type; friend class strong_broker< Integral >; public: strong_duplex(); strong_duplex( Integral in ); Integral load(); Integral exchange( Integral to ); ~strong_duplex(); private: void insert( broker_type* child ); void erase( broker_type* child, Integral by ); mutex serializer_; typedef std::unordered_set< broker_type* > set_type; set_type children_; };
counter::strong_broker
template< typename Integral > class strong_broker : public bumper< Integral, atomicity::full_seq_cst_low > { typedef bumper< Integral, atomicity::full_seq_cst_low > base_type; typedef strong_duplex< Integral > duplex_type; friend class strong_duplex< Integral >; public: strong_broker( duplex_type& p ); strong_broker(); strong_broker( const strong_broker& ); strong_broker& operator=( const strong_broker& ); ~strong_broker(); private: Integral poll(); Integral drain(); duplex_type& prime_; };
counter::weak_duplex
template< typename Integral > class weak_duplex : public bumper< Integral, atomicity::full_seq_cst_low > { typedef bumper< Integral, atomicity::full_seq_cst_low > base_type; typedef weak_broker< Integral > broker_type; friend class weak_broker< Integral >; public: weak_duplex(); weak_duplex( Integral in ); weak_duplex( const weak_duplex& ); weak_duplex& operator=( const weak_duplex& ); Integral load(); ~weak_duplex(); private: void insert( broker_type* child ); void erase( broker_type* child, Integral by ); mutex serializer_; typedef std::unordered_set< broker_type* > set_type; set_type children_; };
counter::weak_broker
template< typename Integral > class weak_broker : public bumper< Integral, atomicity::semi > { typedef bumper< Integral, atomicity::semi > base_type; typedef weak_duplex< Integral > duplex_type; friend class weak_duplex< Integral >; public: weak_broker( duplex_type& p ); weak_broker(); weak_broker( const weak_broker& ); weak_broker& operator=( const weak_broker& ); ~weak_broker(); private: Integral poll(); duplex_type& prime_; };
counter::bumper_array
template< typename Integral, atomicity Atomicity = atomicity::full_seq_cst_low > class bumper_array { public: typedef bumper< Integral, Atomicity > value_type; private: typedef std::dynarray< value_type > storage_type; public: typedef typename storage_type::size_type size_type; bumper_array(); bumper_array( size_type size ); bumper_array( const bumper_array& ); bumper_array& operator=( const bumper_array& ); value_type& operator[]( size_type idx ); size_type size(); protected: Integral load( size_type idx ); Integral exchange( size_type idx, Integral value ); private: storage_type storage; };
counter::simplex_array
template< typename Integral, atomicity Atomicity = atomicity::full_seq_cst_low > class simplex_array : public bumper_array< Integral, Atomicity > { typedef bumper_array< Integral, Atomicity > base_type; public: typedef typename base_type::value_type value_type; typedef typename base_type::size_type size_type; simplex_array(); constexpr simplex_array( size_type size ); simplex_array( const simplex_array& ); simplex_array& operator=( const simplex_array& ); Integral load( size_type idx ); Integral exchange( size_type idx, Integral value ); value_type& operator[]( int idx ); size_type size(); };
counter::buffer_array
template< typename Integral, atomicity PrimeAtomicity = atomicity::full_seq_cst_low, atomicity BufferAtomicity = atomicity::full_seq_cst_low > class buffer_array : public bumper_array< Integral, BufferAtomicity > { typedef bumper_array< Integral, BufferAtomicity > base_type; typedef bumper_array< Integral, PrimeAtomicity > prime_type; public: typedef typename base_type::value_type value_type; typedef typename base_type::size_type size_type; buffer_array(); buffer_array( prime_type& p ); buffer_array( const buffer_array& ); buffer_array& operator=( const buffer_array& ); void push( size_type idx ); void push(); size_type size(); ~buffer_array(); private: prime_type& prime_; };
counter::strong_duplex_array
template< typename Integral > class strong_duplex_array : public bumper_array< Integral, atomicity::full_seq_cst_low > { typedef bumper_array< Integral, atomicity::full_seq_cst_low > base_type; typedef strong_broker_array< Integral > broker_type; friend class strong_broker_array< Integral >; public: typedef typename base_type::value_type value_type; typedef typename base_type::size_type size_type; strong_duplex_array(); constexpr strong_duplex_array( size_type size ) ; strong_duplex_array( const strong_duplex_array& ); strong_duplex_array& operator=( const strong_duplex_array& ); value_type& operator[]( int idx ); size_type size(); ~strong_duplex_array(); private: void insert( broker_type* child ); void erase( broker_type* child, Integral by ); mutex serializer_; typedef std::unordered_set< broker_type* > set_type; set_type children_; };
counter::strong_broker_array
template< typename Integral > class strong_broker_array : public bumper_array< Integral, atomicity::semi > { typedef bumper_array< Integral, atomicity::semi > base_type; typedef strong_duplex_array< Integral > duplex_type; friend class strong_duplex_array< Integral >; public: typedef typename base_type::value_type value_type; typedef typename base_type::size_type size_type; strong_broker_array(); strong_broker_array( duplex_type& p ); strong_broker_array( const strong_broker_array& ); strong_broker_array& operator=( const strong_broker_array& ); size_type size(); ~strong_broker_array(); private: Integral poll( size_type idx ); Integral drain( size_type idx ); duplex_type& prime_; };
counter::weak_duplex_array
template< typename Integral > class weak_duplex_array : public bumper_array< Integral, atomicity::full_seq_cst_low > { typedef bumper_array< Integral, atomicity::full_seq_cst_low > base_type; typedef weak_broker_array< Integral > broker_type; friend class weak_broker_array< Integral >; public: typedef typename base_type::value_type value_type; typedef typename base_type::size_type size_type; weak_duplex_array(); constexpr weak_duplex_array( size_type size ); weak_duplex_array( const weak_duplex_array& ); weak_duplex_array& operator=( const weak_duplex_array& ); value_type& operator[]( int idx ); size_type size(); ~weak_duplex_array(); private: void insert( broker_type* child ); void erase( broker_type* child, Integral by ); mutex serializer_; typedef std::unordered_set< broker_type* > set_type; set_type children_; };
counter::weak_broker_array
template< typename Integral > class weak_broker_array : public bumper_array< Integral, atomicity::semi > { typedef bumper_array< Integral, atomicity::semi > base_type; typedef weak_duplex_array< Integral > duplex_type; friend class weak_duplex_array< Integral >; public: typedef typename base_type::value_type value_type; typedef typename base_type::size_type size_type; weak_broker_array(); weak_broker_array( duplex_type& p ); weak_broker_array( const weak_broker_array& ); weak_broker_array& operator=( const weak_broker_array& ); size_type size(); ~weak_broker_array(); private: Integral poll( size_type idx ); duplex_type& prime_; };
This paper revises ISO/IEC JTC1 SC22 WG21 N3355 = 12-0045 - 2012-01-14. Its changes are as follows:
Add more context to the introduction.
Change named inc
and dec
functions to operators.
Place all declarations within namespace counter
.
The namespace serves to avoid redundancy in names.
Rename types for increased clarity.
Change separate serial and atomic counter types with a single type templatized on the atomicity.
Add issues of memory order and contention to the atomicity parameter.
Add counter arrays.
Add open issues section.
Add implementation section.
Add revision history section.
Add references section.