Audience: LEWG, SG14
Document number: P0447R0
Date: 2016-10-16
Project: Introduction of std::colony to the standard library
Reply-to: Matthew Bentley <mattreecebentley@gmail.com>

Introduction of std::colony to the standard library

I. Introduction

This is an initial draft based on a positive response to functionality and performance of the reference implementation. Colony has at this point been receiving community feedback for the past 1.5 years from Boost, SG14 and individual developers, and on the basis of support from SG14, I am seeking indication of levels of interest for inclusion in the standard. At this stage I would also like to ask for feedback on the existing design/interface, and suggestions prior to proceeding to wording.

Sometimes while programming we come across situations where order is unimportant, but where data is heavily interlinked, iterated over frequently, and changing often. An example would be a central 'entity' class in most video game engines. These are 'has a'-style objects rather than 'is a'-style objects, which reference other shared resources such as sprites, sound fx and so on. Those resources are typically located in separate containers. The 'entities' themselves are in turn referenced by other structures such as quadtrees/octrees (for collision detection), level structures, and the like. When iterating over entities, one has to take into account the fact that they may be removed from the game at any time (for example, a wall gets destroyed and therefore no longer requires processing) and new entities inserted (for example, if a new enemy arrives during a level). While this is happening, the inter-linkages between entities, resource collections and superstructures like levels and quadtrees, must stay valid. The order of the entities and resources themselves within the containers is, typically, unimportant.

There are obviously many different ways of allowing objects to reference each other under C++, but the fastest are pointers, closely followed by indexes (for contiguous containers). Unfortunately the container which has the best iteration performance in the standard library, vector[1], also loses pointer validity to elements within it upon insertion, and also pointer/index validity upon erasure. This leads to sophisticated and restrictive workarounds when developers attempt to utilize vector under the above circumstances.

Colony is a specifically unordered-but-sortable templated data container which allows for fast iteration and direct pointer linkage, without losing pointer/iterator validity to non-erased/non-end() elements when insertions or erasures occur. It is it's own abstract data type, but is most similar to a "bag", "bucket-array" or "multiset" ADT, the central difference being the absence of key values. On the basis of a fully-developed reference implementation the following performance characteristics have been established:

Specifically, a colony has better overall performance than any standard library container when:

  1. Insertion order is unimportant
  2. Insertions and erasures to the container are occurring frequently in performance-critical code, and/or
  3. Pointers/iterators to non-erased/non-end() container elements must not be invalidated by insertion or erasure.

While the benchmarks[1] referenced in the appendix are a better tool for understanding the performance characteristics, the comparative speed characteristics for most-commonly-used operations in the above scenario as described are:

As explored in the benchmarks[1] there are some vector/deque modifications/workarounds which can outperform colony for iteration while maintaining pointer validity, but which have slower insertion and erasure speeds, and also typically have a cost to usability or memory requirements. When the ratio of insertions/erasures to iterations is greater than 1% insertions/erasures per 3600 full iterations over data, colony will outperform these workarounds. Colony's other advantages are the freeing and recycling of unused memory on-the-fly, and the guaranteed validity of pointers/references to non-erased elements regardless of insertion and erasure (which makes programming with containers of inter-related data structures much simpler). Iterators which do not point to end() or erased elements are also guaranteed to remain valid.

II. Motivation and Scope

While the conditions below are common across multiple domains, for the benefit of those unfamiliar with any specific scenarios, I will present the motivation with examples from game development. When working on game engines we are predominantly dealing with collections of data where:

  1. Elements within data collections refer to elements within other data collections (through a variety of methods - indices, pointers, etc). An example is a game entity referring to both a texture object and collision blocks, as well as sound data. These references must stay valid throughout the course of the game/level. For this reason, any container (or use of a container) which causes pointer or index invalidation can create difficulties or necessitate workarounds.
  2. Order is unimportant for the most part. The majority of data collections are simply iterated over, transformed, referred to and utilized with no regard to order.
  3. Erasing or otherwise removing or deactivating objects occurs frequently in-game and in realtime (though often erasures will be implemented to occur at the end of a frame due to multithreading concerns). An example could be destroying a wall, or a game enemy. For this reason methods of erasure which create strong performance penalties are avoided.
  4. Creating new objects and adding them into the gameworld on-the-fly is also common - for example, a tree which drops leaves every so often, or a quadtree.
  5. We don't always know in advance how many elements there will be in a container at the beginning of development, or even at the beginning of a level during playback. Genericized game engines in particular have to adapt to considerably different user requirements and scopes. For this reason extensible containers which can expand and contract in realtime are usually necessary.
  6. For performance reasons, memory storage which is more-or-less contiguous is preferred. Lists, vectors of pointers to dynamically-allocated objects, and maps as implemented in the standard library are unusable.
  7. Memory wastage is avoided, and in particular, any container which allocates upon initialisation tends to be avoided as this can incur purposeless memory and performance costs.

std::vector, in it's default state, does not meet these requirements due to:

  1. Poor (non-fill) singular insertion performance (regardless of position) due to the need for reallocation upon reaching capacity
  2. Poor erasure performance for large amounts of data (even with remove_if duration is highly variable)
  3. Insert invalidates pointers/iterators to all elements
  4. Erase invalidates pointers/iterators/indexes to all elements afer the erased element
  5. Requires single contiguous memory block (typically larger than cache lines) which increases chance of allocation failures for large amounts of data on insert

To meet these requirements, game developers tend to either (a) develop their own custom containers for given scenarios or (b) develop workarounds for the problems of vector. These workarounds are many and varied, but the most common are probably:

  1. Using a boolean flag (or similar) to indicate the inactivity of an object (as opposed to actually erasing from the vector). When erasing, one simply adjusts the boolean flag, and when iterating, items with the adjusted boolean flag are skipped. External elements refer to elements within the container via indexes rather than pointers (which can be invalidated upon insertion).

    Advantages: Fast erasure.
    Disadvantages: Slow to iterate due to branching.
  2. Utilizing a vector of data with a secondary vector of indexes. When erasing, the erasure occurs in the vector of indexes, not the vector of data, and when iterating, one iterates over the vector of indexes, then accessing the data from the vector of data via the index.

    Advantages: Faster iteration.
    Disadvantages: Erasure still incurs some reallocation cost, can increase jitter.
  3. Combining a swap-and-pop mechanism with some form of dereferenced lookup system to enable contiguous element iteration (sometimes known as a 'packed array', various other names: http://bitsquid.blogspot.ca/2011/09/managing-decoupling-part-4-id-lookup.html). In this case when erasing we swap the back element of the vector with the element being erased, then pop the (swapped) back element. When iterating over the data we simply iterate through the vector of elements. To maintain valid external references to elements, which may be swapped at any time, we must also maintain a vector of indexes (or similar) and update the index numbers which correspond with the element being erased and the element being moved. In addition external objects referring to elements within the container must store a pointer to the index for the element in question. There are many varied alternatives to this method. The method is more useful when the container's data is mostly being iterated over, with fewer external references to the individual elements.

    Advantages: Iteration is at standard vector speed.
    Disadvantages: Erase could be slow if objects are large and/or non-trivially copyable, making swap cost thereby large. All references to elements incur additional costs due to the dereferencing system.

All three techniques have the disadvantage of slow singular insertions, and the first two will also continually expand memory usage when erasing and inserting over periods of time. The third deals better with this scenario as it swaps from the back rather than leaving gaps in the elements vector, however it will not free memory if the number of elements expands then contracts. It will also suffer in performance if elements within the container are heavily referred to by external objects/elements in performance-critical situations, or if the elements are large and/or swap/copy is non-trivial in other ways.

Colony is an attempt to bring a more generic solution to this situation. It has the advantage of good iteration speed, a similar erasure speed to the boolean technique described above, and a singular insertion speed which is much faster than a vector's, similar to a good deque implementation's. It never causes pointer invalidation to non-erased elements during erasure or insertion and the memory locations of erased elements are either reused by subsequent insertions or released on-the-fly when the encapsulating memory block becomes empty of elements.

III. Impact On the Standard

This is a pure library addition, no changes are necessary to the standard asides from the introduction of the colony container.
The reference implementation of colony is available here as plf::colony: http://www.plflib.org/colony.htm

IV. Design Decisions

The key technical features of a colony are as follows:

The full list of abstract requirements to support these features are as follows:

To summarise and simplify we can say there are three necessary aspects to colony to make it function as it does, and which define any implementation:

  1. A multiple-memory-block based allocation pattern which allows for the fast removal of memory blocks when they become empty of elements (regardless of their location within the series of blocks), without element pointer invalidation.
  2. A skipfield to enable the skipping over of erased elements during iteration, which should not necessitate the use of branching code during iteration for performance reasons.
  3. A mechanism for reusing erased element locations upon subsequent insertions.

In the case of the reference implementation I utilized a chained-group allocation pattern (http://www.plflib.org/chained_group_allocation_pattern.htm) for the memory blocks; which is a doubly-linked intrusive list of nodes containing (a) memory blocks, (b) memory block metadata and (c) skipfields. The metadata in question includes information necessary for an iterator to iterate over colony elements, such as the last insertion point within the memory block, and other information useful to specific functions such as the total number of active elements in the node. This 'linked-list'-style pattern removes the possibility of non-O(1) operations when freeing empty memory blocks from the colony structure; as compared to a vector of pointers to memory blocks. A vector of pointers to memory blocks may however enable faster insertions while increasing implementation complexity. Comparitive benchmarks would be necessary to establish the overall optimal approach.

For the skipfield a boolean skipfield cannot be used:

Instead the reference implementation utilizes the jump-counting skipfield pattern (full reference paper available here: http://www.plflib.org/the_jump_counting_skipfield_pattern.pdf), a numeric pattern I developed which allows for O(1) time complexity iterator operations and performs better during iteration due to a lack of branching code. It accomplishes this by storing and modifying (during insertion and erasure) numbers corresponding to the number of elements in a run of erased elements, and simply adding them to the current iterator position during iteration instead of checking their state. This avoids the looping branching code necessary for iteration with, for example, a boolean skipfield.

Iteration performance of vector with boolean skipfield vs colony with boolean skipfield vs colony with jump-counting skipfield (GCC5):
test result graph

For the erased location re-use mechanism, the reference implementation uses a stripped-down custom internal stack class based on plf::stack (http://www.plflib.org/stack.htm - outperforms all other std:: containers in a stack context and across compilers). plf::stack uses the same chained-group memory allocation pattern as plf::colony ie. it is made up of a linked chain of 'groups', structs containing a memory block and metadata, with a growth factor of 2 for the memory blocks. When capacity of the current memory block is reached, a new group is created and linked to the current memory block.

Time to push all elements + read-and-pop all elements, plf::stack vs std::stack (std::deque) vs std::vector (GCC5):
test result graph

An alternative route of using a "free list" (explanation: http://bitsquid.blogspot.ca/2011/09/managing-decoupling-part-4-id-lookup.html) of erased elements for the re-use mechanism has been explored and the following issues identified:

  1. A colony element could be smaller in size than a pointer and thus a union with such would dramatically increase the amount of wasted space in circumstances with low numbers of erasures - moreso if the pointer type supplied by the allocator happens to be non-trivial.
  2. Given that a free list will jump between colony groups a lot, consolidating a free list after removing empty groups from a colony would typically result in a very slow operation filled with cache misses. In the context of jitter-sensitive domains such as game development this is unacceptable.

V. Technical Specifications

Colony meets the requirements of the C++ Container, AllocatorAwareContainer, and ReversibleContainer concepts.

For the most part the syntax and semantics of colony functions are very similar to all std:: c++ libraries. Formal description is as follows:

template <class T, class Allocator = std::allocator<T>, typename Skipfield_Type = unsigned short> class colony

T - the element type. In general T must meet the requirements of Erasable, CopyAssignable and CopyConstructible.
However, if emplace is utilized to insert elements into the colony, and no functions which involve copying or moving are utilized, T is only required to meet the requirements of Erasable. If move-insert is utilized instead of emplace, T must also meet the requirements of MoveConstructible .

Allocator - an allocator that is used to acquire memory to store the elements. The type must meet the requirements of Allocator. The behavior is undefined if Allocator::value_type is not the same as T.

Skipfield_Type - an unsigned integer type. This type is used to create the skipfield and the maximum size of element memory blocks is constrained by it's bit-depth due to the nature of a jump-counting skipfield. For example, unsigned short on most platforms is 16-bit and therefore constrains the size of individual memory blocks to a maximum of 65535 elements. unsigned short has been found to be the optimal type for performance based on benchmarking. However there may be some memory-constrained situations where element block allocations of more than 255 elements at a time would not be desirable. In these situations, unsigned char may be used for the skipfield instead, resulting in some additional memory usage saving for the skipfield itself. It is unlikely for there to be any circumstances which benefit from a skipfield bit-depth greater than unsigned short. If Skipfield_Type is not an unsigned integer type, behaviour is undefined.

Basic example of usage

#include <iostream>
#include "plf_colony.h"

int main(int argc, char **argv)
{
  plf::colony<int> i_colony;

  // Insert 100 ints:
  for (int i = 0; i != 100; ++i)
  {
    i_colony.insert(i);
  }

  // Erase half of them:
  for (plf::colony<int>::iterator it = i_colony.begin(); it != i_colony.end(); ++it)
  {
    it = i_colony.erase(it);
  }

  // Total the remaining ints:
  int total = 0;

  for (plf::colony<int>::iterator it = i_colony.begin(); it != i_colony.end(); ++it)
  {
    total += *it;
  }

  std::cout << "Total: " << total << std::endl;
  std::cin.get();
  return 0;
} 

Example demonstrating pointer stability

#include <iostream>
#include "plf_colony.h"

int main(int argc, char **argv)
{
  plf::colony<int> i_colony;
  plf::colony<int>::iterator it;
  plf::colony<int *> p_colony;
  plf::colony<int *>::iterator p_it;

  // Insert 100 ints to i_colony and pointers to those ints to p_colony:
  for (int i = 0; i != 100; ++i)
  {
    it = i_colony.insert(i);
    p_colony.insert(&(*it));
  }

  // Erase half of the ints:
  for (it = i_colony.begin(); it != i_colony.end(); ++it)
  {
    it = i_colony.erase(it);
  }

  // Erase half of the int pointers:
  for (p_it = p_colony.begin(); p_it != p_colony.end(); ++p_it)
  {
    p_it = p_colony.erase(p_it);
  }

  // Total the remaining ints via the pointer colony (pointers will still be valid even after insertions and erasures):
  int total = 0;

  for (p_it = p_colony.begin(); p_it != p_colony.end(); ++p_it)
  {
    total += *(*p_it);
  }

  std::cout << "Total: " << total << std::endl;
  
  if (total == 2500)
  {
    std::cout << "Pointers still valid!" << std::endl;
  }
  
  std::cin.get();
  return 0;
} 

Iterator Invalidation

All read-only operations, swap, std::swap Never
clear, reinitialize, operator = Always
reserve, shrink_to_fit Only if capacity is changed
change_group_sizes, change_minimum_group_size, change_maximum_group_size Only if supplied minimum group size is larger than smallest group in colony, or supplied maximum group size is smaller than largest group in colony.
erase Only for the erased element
insert, emplace If an iterator is == end() it may be invalidated by a subsequent insert/emplace in some cases. Otherwise no.

Member types

Member type Definition
value_type T
allocator_type Allocator
size_type Allocator::size_type (pre-c++11)
std::allocator_traits<Allocator>::size_type (post-c++11)
difference_type Allocator::difference_type (pre-c++11)
std::allocator_traits<Allocator>::difference_type (post-c++11)
reference Allocator::reference (pre-c++11)
value_type & (post-c++11)
const_reference Allocator::const_reference (pre-c++11)
const value_type & (post-c++11)
pointer Allocator::pointer (pre-c++11)
value_type & (post-c++11)
const_pointer Allocator::const_pointer (pre-c++11)
std::allocator_traits<Allocator>::const_pointer (post-c++11)
iterator BidirectionalIterator
const_iterator Constant BidirectionalIterator
reverse_iterator BidirectionalIterator
const_reverse_iterator Constant BidirectionalIterator

Iterators for a colony cannot be random access because any +, -, += and -= operators would be non-O(1), due to the need to iterate over a skipfield. However member overloads for the standard library functions advance(), next(), prev() and distance() are available in the reference implementation. These are significantly faster than O(n) in most cases.

Constructors

default explicit colony(const allocator_type &alloc = allocator_type())
fill explicit colony(const size_type n, const unsigned short min_group_size = 8, const unsigned short max_group_size = std::numeric_limits<Skipfield_type>::max(), const allocator_type &alloc = allocator_type())
explicit colony(const size_type n, const value_type &element, const unsigned short min_group_size = 8, const unsigned short max_group_size = std::numeric_limits<Skipfield_type>::max(), const allocator_type &alloc = allocator_type())
range template<typename InputIterator> colony(const InputIterator &first, const InputIterator &last, const unsigned short min_group_size = 8, const unsigned short max_group_size = std::numeric_limits<Skipfield_type>::max(), const allocator_type &alloc = allocator_type())
copy colony(const colony &source)
move colony(colony &&source) noexcept (C++11 and upwards)
initializer list colony(const std::initializer_list<value_type> &element_list, const unsigned short min_group_size = 8, const unsigned short max_group_size = std::numeric_limits<Skipfield_type>::max(), const allocator_type &alloc = allocator_type())

Member functions

Insert

single element iterator insert (const value_type &val)
fill iterator insert (const size_type n, const value_type &val)
range template <class InputIterator> iterator insert (const InputIterator &first, const InputIterator &last)
move iterator insert (value_type&& val) (C++11 and upwards)
initializer list iterator insert (const std::initializer_list<value_type> &il)

Erase

single element iterator erase(const iterator &it)
range void erase(const iterator &first, const iterator &last)

Other functions

The full implementation details for the main three aspects of the reference implementation are too large to be included in this document, but are available externally below:

Details on the chained-group allocation pattern are here: http://www.plflib.org/chained_group_allocation_pattern.htm.
The 8900-word paper on the jump-counting skipfield pattern is available here: http://www.plflib.org/the_jump_counting_skipfield_pattern.pdf, and benchmarks versus boolean skipfields are presented here: http://www.plflib.org/skipfield_comparison.htm.
Lastly, plf::stack (on which the reference implementation's internal reduced stack implementation is based) is explained in detail here: http://www.plflib.org/stack.htm.

Further details on colony's reference implementation can be found here: http://www.plflib.org/colony.htm, including a FAQ.
The reference implementation of colony is available to download here: http://www.plflib.org/colony.htm#download or via the github repository: https://github.com/mattreecebentley/plf_colony.

VI. Acknowledgements

Thanks to Glen Fernandes and Ion Gaztanaga for restructuring advice, Robert Ramey for documentation advice, various Boost and SG14 members for support, suggestions and bug reports including Sean Middleditch, Patrice Roy and Guy Davidson. And that jerk from Lionhead for annoying me enough to force me to finally implement the jump-counting skipfield pattern. Lastly, thanks to Jon Blow for initial advice and Mike Acton for some positive influence.

VII. References

plf::colony benchmarks

As the benchmarks currently need to be updated prior to Issaquah to reflect the performance updates to the reference implementation over the past six months, only links can be provided at this stage:
Benchmarks for GCC can be found here: http://www.plflib.org/colony.htm#benchmarks, while benchmarks for MSVC can be found here: http://www.plflib.org/colony_benchmark_msvc_results.htm

Cppcon 2016 Talk

An explanatory talk about Colony and the various approaches involved was given at Cppcon 2016 in Bellevue, Seattle:

Note that the benchmarks presented in this lecture do not reflect current colony performance, which has improved considerably since.

The slides for the talk are available here.