Recent papers have shown interest in improving STL allocator performance. For some applications the STL allocator interface has some performance drawbacks and recent efforts like N1953 propose an alternative to solve one of the biggest inefficiencies of current STL allocators: in-place expansion. However, N1953 interface can be further optimized, to achieve better memory use, locking reduction, and more performance. This paper adopts the same versioning system as N1953 and considers that versioning system a correct approach to allow version 1 and version 2 allocator usage in STL containers. For more information see N1953. However should other language features be added which can better perform this functionality (e.g. concepts), this proposal is of course open to that route. The focus of this proposal is on higher performance allocators, and not on the details of any given versioning system.
As N1953, says, we need to add several pieces to achieve optimal performance with STL allocators:
As we know, due to alignment or internal fragmentation issues, memory allocation algorithms usually allocate some extra bytes that can't be used with current allocator interface. Sometimes, this wasted space is important for small allocations.
Dynamic strings like std::string can request 24 characters and it's not uncommon (depending on the memory allocation algorithm) to round that request to 32 or even 64 characters. In some fast allocation algorithms like the Kingsley algorithm, the memory waste can be big. If we knew the real capacity of the allocated buffer, we could avoid expensive new allocations + copies.
Conclusion #1: We need a way to obtain the real size of the allocated memory buffer to avoid memory waste.
Apart from the unusable memory waste, there is another big memory waste with current allocators. This problem is frequent in resource constricted environments, and it's also known as the sandwich effect:
Suppose a fixed size memory buffer (for example a big memory chunk taken from the operating system, like a memory page) where we construct a vector. Let's calculate the biggest use factor of the vector with current allocator approach, supposing a k growth factor:
----------------------------------- | current_memory | k*current_memory | = B bytes -----------------------------------
This figure shows the optimal position of a vector (in the beginning of the segment). The biggest successful reallocation would be to take the rest of the segment, so if current_memory has N bytes:
N + k*N = B
The use factor (maximum vector capacity/segment size) is:
kN/B -> kN/(N + kN) -> k/(1+k)
So with k=1.5, the use factor is 0.6. This means that with a vector using current allocator interface we can only use the 60% of a segment's capacity, due to missing expansion functions. And this is only the best case since the vector is placed in the beginning of the segment and we suppose that growing factor matches exactly the whole segment. If the vector is not in the beginning of the segment, the use factor is lower:
------------------------------------ | B/2 | current_memory | | = B bytes ------------------------------------
If the vector is placed in the middle of the segment, the maximum size is when the reallocation takes the lower half of the memory so the use factor would be 0.5. And this is an optimistic supposition, since we are supposing that growth factor will lead exactly to the maximum free space. The consequence is that we must allocate more segments than necessary because we don't have a way to expand current_memory.
Conclusion #2: We need an interface that allows forward and backwards expansion of memory, to achieve a 100% use factor.
When the capacity of a vector is equal to the vector's size, and we want to insert new N objects, we have to allocate a new memory buffer and copy the data. This is an expensive operation and to avoid too many allocations, normally a growth factor is applied to the current capacity.
In resource constricted environments a common growth factor (for example 1.5), can be too big to succeed. On the other hand, maybe there is memory for N new objects (usually N is smaller than the current capacity). The same would happen if we had expansion functions in the allocator interface. We can try to guess a correct expansion size, but the most efficient approach is to request a minimum size and a preferred size. The allocator will try to expand to preferred size but if it's not possible it can try to expand it between minimum size and preferred size. Normally we would prefer an expansion between minimum size and preferred size than a new allocation. So with the current allocator's interface we are mixing three concepts:
The best approach is to request both minimum_size and preferred_size sizes in the same function and obtain the actual size in received_size in return, so that the allocator can use just one memory lookup (and mutex locking in multithreaded environments). The algorithm tries to find a buffer of size preferred size, if it can't, it can return the biggest block found if its size is at least minimum size. When expanding, it will try to expand it first to preferred size. If this fails, it will expand it as much as it can, as long as the new size is bigger than minimum_size.
Conclusion #3: We need an atomic handshake with the memory allocator that understands minimum size, preferred size and received size concepts both in new buffer allocation and buffer expansion.
The adoption of move semantics will offer an impressive performance boost to containers. std::vector's and other containers' reallocation will be cheaper, since object movement during reallocations can be compared with POD copying. Some might argue that move semantics are enough to achieve good vector performance in reallocations, so that memory expansion functions are not necessary.
However, even moving an object normally has cost proportional to the object's sizeof. Some optimizations for copy operations, actually increase the cost of move operations. For example, it's common to apply small string optimization technique to std::string. The internal buffer is commonly between 8 and 32 bytes. Moving std::string objects, with this optimization activated and no dynamic allocated memory, is cheap but not free and proportional to the internal buffer size. An even more extreme example is vector<fstream>. On gcc the sizeof(fstream) is 680 bytes, indicating a relatively expensive move operation. When a vector of objects grows, the move operation is not so cheap if the contained objects' size is not small. On the other hand, forward memory expansion is almost free.
Conclusion #4: Move semantics and memory expansion are complementary. Move semantics are not enough to obtain optimal performance.
The current allocator interface has a limited object construction function. The construct member is defined taking a copy of the object to construct:
void construct(pointer p, const value_type &v);
Some STL container implementations (for example Dinkumware and Freescale) try to construct contained objects using the construct function. The standard allows supposing allocator::pointer is equal to value_type *, and some implementation choose placement new to construct objects. But to allow advanced pointer types (for example, relative pointers for shared memory) these popular implementations choose construct when initializing raw memory.
This causes an overhead in node containers like list or map: since the actual allocated object is a node containing the value instead of the value alone, the construction function needs a copy of the node, not a copy of the value:
void construct(node_pointer p, const node_type &node);
That would require creating a temporary node from the user's value to initialize the new node in the allocated node memory. Since this is highly inefficient, these implementations choose another approach: they store several allocators and they choose to construct the node using several partial constructions. For example, std::map needs in Dinkumware implementation 3 allocators: one to allocate the node, one to construct the value_type part of the node, and one to construct the pointers stored in the node (used to build the tree). For unordered containers this approach would need 4 allocators, since we have to allocate and construct also the bucked array. With empty allocators, the space overhead is not much, but with stateful allocators (for example, containing only a pointer to a private heap), this would grow the size of an empty map typically from 2 words to 5 words. This also complicates the implementation, that has to manage many objects in constructors, swappings, etc...
Actually, in node containers like std::list and std::map we only need 1 allocator, the rest of the allocators are the result of a limited construct function.
Conclusion #5: We need to improve construct function to achieve the same functionality as a placement new, including many constructor arguments and perfect forwarding. This will allow in-place construction of objects, and a reduction of allocators in node containers. Variadic templates plus perfect forwarding are a perfect combination to achieve proposed full placement new functionality in allocators.
C++ has always taken in care the performance improvement that can be achieved when we use different ways to allocate one instance of an object or an array of N objects. This is why we have new and new[] functions. Single instance allocation does not need to store the number of allocated objects to know how many destructors it must call.
Currently, STL allocators mix both concepts. Current node allocators use node pooling when allocate(size_type size, pointer hint = 0) specifies size == 1, and those node containers use normal array allocation when the size != 1. This works because of the absence of expansion functions. A common node allocator can't easily know the size of the allocated object with just a pointer to an object. After all, node pooling technique erases all size information to achieve optimal memory use, just like operator new.
This would disallow upgrading node allocators to version 2 allocators. If a node allocator uses a pooling technique (for example, segregated storage) when 1 object is requested and forwards request to operator new when N objects are requested, how can it implement a safe memory expansion or memory size obtaining function?
This would lead to a limitation: node allocators can only be version 1 allocators. After all, std::map and std::list are pure node containers (they only allocate nodes), so we might think this is an optimal choice. However, we can't forget hybrid node containers (node containers that use auxiliary array data), for example, unordered_map. If we limit node allocators to version 1 allocators, unordered_map couldn't use expansion functions for its bucket array, obtaining a suboptimal rehash function.
Conclusion #6: If we want maximum performance, we must separate node allocation and array allocation functions, just like new and new[]. Otherwise, we can't add expansion functionality to node allocators and we can have suboptimal memory management in containers like unordered_map or any container that mixes nodes and arrays.
The N1953 proposal is a good proposal to solve some of the problems of the current STL allocator interface. It offers expansion functions (including minimum size and preferred size concepts), offers the possibility to obtain the actual size of the buffer and it has shrinking features. However, it doesn't solve some problems of the current allocator interface:
As mentioned, this paper proposes the possibility of backwards expansion of memory combined with usual forward expansion. There are several reasons to propose it:
However, backwards expansion can't be applied with vector operations that require strong exception guarantee if the values have throwing constructor/assignment. It can be used with move-capable values, if we require non-throwing move-capable values in STL containers. This requires an optional backwards expansion possibility that should be activated on demand by containers.
Checking for expansion is usually a fast operation: check if the adjacent block is free, and if so, merge blocks. In multithreaded systems, however, there is need to apply locks to protect memory operations (when using non lock-free algorithms). In some systems (like multiprocessor machines) locking is an expensive operation due to memory synchronization, so checking for expansion is cheaper than the locking mechanism. If the expansion fails, we usually try to allocate a new buffer, and we need to use locking again.
We can merge both locking needs into one locking, and improve performance. To achieve this, we need an interface that can offer an atomic expand or allocate feature.
The "version 2" allocator interface is a strict superset of the current (version 1) allocator interface. This allows version 2 allocators to be used with containers (and any other objects) which only expect the version 1 interface. This fact also allows the presentation of the version 2 interface without repeating the current interface.
namespace std {
enum allocation_type
{
    //Bitwise OR (|) combinable values
    allocate_new        = ...,
    expand_fwd          = ...,
    expand_bwd          = ...,
    shrink_in_place     = ...,
    nothrow_allocation  = ...
};
template <class T>
class allocator
{
public:
    ...
    // as today
    ...
    // plus:
    //Version identifier, see N1953
    typedef version_type<allocator, 2> version;
    //Returns the size of an array
    size_type size(pointer p) const throw();
    //An array allocation handshake function
    std::pair<pointer, bool>
        allocation_command(std::allocation_type command, size_type limit_size
                          ,size_type preferred_size,     size_type& received_size
                          ,pointer reuse = 0);
    //Node allocation function
    pointer allocate_one()
    //Node deallocation function
    void    deallocate_one() throw();
    //Default construction function
    void construct(pointer p);
    //General purpose in-place construction function
    template<... Args>
    void construct(pointer p, Args &&... args);
};
}  //namespace std {
The size function allows the client to ask how many elements a previously allocated array pointer actually points to. The amount returned may be greater than the number which was requested via a previous allocate or allocation_command. Clients such as string or vector can use this information to add to their capacity at no cost.
The allocation_command function is an advanced handshake function to achieve allocation or expansion of memory. It can be also used to request a shrink command. The user specifies the allocation methods that wants to try in the command parameter: new allocation, forward expansion, backwards expansion or shrinking. The commands can be combined using a bitwise OR operation, obtaining atomic allocation features. It's also possible to disable exception throwing and obtain failures via null pointer. The user also specifies the limit_size to succeed (the minimum size in case of an expansion/allocation, the maximum size in case of a shrinking), the preferred_size of the operation and the pointer of the memory to be expanded/shrunk if an expansion or a shrinking has been requested. The allocator always returns the actual size of the memory in received_size parameter. When success, the function returns the new address in the first member of the returned pair and returns true in the second member if an expansion has occurred. On failure, it throws an exception if the user hasn't specified std::nothrow_allocation in the command parameter, otherwise returns 0 in the first member of the pair. On failure, the function also returns (in the received_size parameter) a suggested limit_size that might succeed in the future.
The allocate_one member function is a function requesting the allocation of just one node. The pointer returned by this function can't be used for expansion or size guessing. It's the equivalent of operator new for allocators. Throws an exception on failure. The returned pointer can only be deallocated using deallocate_one.
The deallocate_one member function is a function requesting the deallocation of just one node has been allocated using allocate_one. It's the equivalent of operator delete for allocators.
The construct(pointer p) member function will construct a default constructed element in the address pointed by p. It will throw if the default constructor throws.
The construct(pointer p, Args &&... args) member function will be equivalent to a placement new construction:
new((void*)p) value_type(std::forward<Args>(args)...);
It will throw if the equivalent constructor throws.
Like N1953, this paper proposes the specification of the versioning system in terms of std::tr1::integral_constant (part of the type traits library) as if integral_constant was already part of the working draft. This dependency can be easily removed.
This paper proposes an advanced construct function using rvalue reference (N1952), and variadic templates (N1704) as if they were already part of the working draft. The paper shows a clear non-metaprogramming common use case for variadic templates: perfect forwarding with in-place construction. Currently there is no generic way to implement a generic in-place construction in C++. Rvalue reference plus variadic templates is an elegant solution to achieve this.
If variadic templates are not available, the following limited function can be used:
template<class ConvertibleToValue> void construct(ConvertibleToValue &&convertible_to_value);
If rvalue reference is not available, the following more limiting function is proposed:
template<class ConvertibleToValue> void construct(const ConvertibleToValue &convertible_to_value);
The proposed interface has been implemented in Boost.Interprocess library for shared memory and memory mapped capable allocators and containers. The time needed to implement the proposed interface was 50 man-hours, including modifications of the shared memory allocation algorithm, STL interface allocator and the vector container.
An implementation of C realloc function can be easily changed to implement the allocation algorithm of allocation_command. The time to implement the needed allocation algorithm based on realloc code is estimated in less than 40 man-hours.
The code to handle backwards + forward expansion in vector is a more generalized case of an insertion in a deque container. Library vendors might need less than 30 man-hours to implement it.
Overall, size and speed improvement is big. The performance gain is impressive in initialization patterns, when a vector can be filled with repeated push_back calls, without a single reallocation, since with no other thread requesting memory, all expanding calls succeed. In resource constrained memories, like shared memory, the size savings are huge since the allocator can avoid the sandwich effect and the vector can use all the adjacent existing memory.
This wording is a modified wording from N1953, so the implementation is required to provide versioning infrastructure, but is not required to provide containers that recognize version allocators or a std::allocator which meets the version 2 specification. This wording only specifies the syntax and behavior for these optimizations.
Modify the <utility> synopsis in 20.2:
namespace std {
  //  lib.operators, operators:
  namespace rel_ops {
    template<class T> bool operator!=(const T&, const T&);
    template<class T> bool operator> (const T&, const T&);
    template<class T> bool operator<=(const T&, const T&);
    template<class T> bool operator>=(const T&, const T&);
  }
  
  //  lib.pairs, pairs:
  template <class T1, class T2> struct pair;
  template <class T1, class T2>
    bool operator==(const pair<T1,T2>&, const pair<T1,T2>&);
  template <class T1, class T2>
    bool operator< (const pair<T1,T2>&, const pair<T1,T2>&);
  template <class T1, class T2>
    bool operator!=(const pair<T1,T2>&, const pair<T1,T2>&);
  template <class T1, class T2>
    bool operator> (const pair<T1,T2>&, const pair<T1,T2>&);
  template <class T1, class T2>
    bool operator>=(const pair<T1,T2>&, const pair<T1,T2>&);
  template <class T1, class T2>
    bool operator<=(const pair<T1,T2>&, const pair<T1,T2>&);
  template <class T1, class T2> pair<T1,T2> make_pair(T1, T2);
  //  lib.versioning, version:
  template <class T, unsigned V> struct version_type;
  template <class T> struct version;
}
Add new section: 20.2.3:
-1- A standardized infrastructure for giving class types an easily extracted version number is defined. Existing class types which lack explicit version numbers are automatically assigned a version number of 1 by this system. This system allows classes to communicate versioned interfaces to each other.
-1- The library provides a template which can be used to explicitly declare a version number for an user-defined class or struct.
template <class T, unsigned V>
struct version_type
    : public integral_constant<unsigned, V>
{
    typedef T type;
    version_type(const version_type<T, 0>&);
};
-2- It is unspecified whether the version_type constructor has a definition.
template <class T> struct version;
-1- version publicly derives from integral_constant<unsigned, 1> for all types T except for the case that T is a struct or class and meets all of the following requirements:
For types T meeting the above requirements, version<T> will publicly derive from integral_constant<unsigned, T::version::value>.
-2- In the case that T is a reference type, version will drop the reference and instead operate on the referenced type.
-3- In the case that T is a cv-qualified type, version will ensure that the same value will be returned as for a non-cv-qualified T.
-4- [Example:
struct A {};
const unsigned vA = std::version<A>::value;  // vA == 1
struct B {
    typedef std::version_type<B, 2> version;
};
const unsigned vB  = std::version<      B >::value; // vB  == 2
const unsigned vBr = std::version<const B&>::value; // vBr == 2
struct C : public B {};
const unsigned vC = std::version<C>::value;  // vC == 1
-- end example]
Add new section 20.1.6.1 - Optional Allocator requirements:
-1- Allocators may choose to implement an extended API in addition to those requirements defined in lib.allocator.requirements. If such an allocator declares its intention to conform to this extended interface, it must implement all of it.
-2- An allocator (named Alloc for example) declares its intent to conform to this optional interface by having the following nested declaration:
class Alloc
{
public:
    typedef std::version_type<Alloc, 2> version;
    /* ... */
};
-3- Allocators having this nested version type will define the following additional member functions:
size_type size(pointer p) const throw();
-4- Requires: p is non-null and has been returned by a previous call to allocate or allocation_command, and has not yet been deallocated by deallocate.
-5- Returns: The returned value indicates that the client can store valid values of T to the range [ptr, ptr + returned-value) without fear of corrupting the heap.
-6- Throws: Nothing.
std::pair<pointer, bool>
   allocation_command(std::allocation_type command, size_type limit_size
                     ,size_type preferred_size,     size_type& received_size
                     ,pointer reuse = 0);
-7- Requires: If the parameter command contains the value std::shrink_in_place it can't contain any of these values: std::expand_fwd, std::expand_bwd. If the parameter command contains std::expand_fwd or std::expand_bwd, the parameter reuse must be non-null and returned by a previous call to allocate or allocation_command, and not yet been deallocated by deallocate. If the parameter command contains the value std::shrink_in_place, the parameter limit_size must be equal or greater than the parameter preferred_size. If the parameter command contains any of these values: std::expand_fwd or std::expand_bwd, the parameter limit_size must be equal or less than the parameter preferred_size.
-8- Effects: If the parameter command contains the value std::shrink_in_place, the allocator will try to reduce the size of the memory block referenced by pointer reuse to the value preferred_size moving only the end of the block. If it's not possible, it will try to reduce the size of the memory block as much as possible as long as this results in size(p) <= limit_size. Success is reported only if this results in preferred_size <= size(p) and size(p) <= limit_size.
If the parameter command only contains the value std::expand_fwd (with optional additional std::nothrow_allocation), the allocator will try to increase the size of the memory block referenced by pointer reuse moving only the end of the block to the value preferred_size. If it's not possible, it will try to increase the size of the memory block as much as possible as long as this results in size(p) >= limit_size. Success is reported only if this results in limit_size <= size(p).
If the parameter command only contains the value std::expand_bwd (with optional additional std::nothrow_allocation), the allocator will try to increase the size of the memory block referenced by pointer reuse only moving the start of the block to a returned new position new_ptr. If it's not possible, it will try to move the start of the block as much as possible as long as this results in size(new_ptr) >= limit_size. Success is reported only if this results in limit_size <= size(new_ptr).
If the parameter command only contains the value std::allocate_new (with optional additional std::nothrow_allocation), the allocator will try to allocate memory for preferred_size objects. If it's not possible it will try to allocate memory for at least limit_size objects.
If the parameter command only contains a combination of std::expand_fwd and std::allocate_new, (with optional additional std::nothrow_allocation) the allocator will try first the forward expansion. If this fails, it would try a new allocation.
If the parameter command only contains a combination of std::expand_bwd and std::allocate_new (with optional additional std::nothrow_allocation), the allocator will try first to obtain preferred_size objects using both methods if necessary. If this fails, it will try to obtain limit_size objects using both methods if necessary.
If the parameter command only contains a combination of std::expand_fwd and std::expand_bwd (with optional additional std::nothrow_allocation), the allocator will try first forward expansion. If this fails it will try to obtain preferred_size objects using backwards expansion or a combination of forward and backwards expansion. If this fails, it will try to obtain limit_size objects using both methods if necessary.
If the parameter command only contains a combination of std::allocation_new, std::expand_fwd and std::expand_bwd, (with optional additional std::nothrow_allocation) the allocator will try first forward expansion. If this fails it will try to obtain preferred_size objects using new allocation, backwards expansion or a combination of forward and backwards expansion. If this fails, it will try to obtain limit_size objects using the same methods.
The allocator always writes the size or the expanded/allocated/shrunk memory block in received_size. On failure the allocator writes in received_size a possibly successful limit_size parameter for a new call.
-9- Throws: Throws an exception if two conditions are met:
-10- Returns: The address of the allocated memory or the new address of the expanded memory as the first member of the pair. If the parameter command contains std::nothrow_allocation the first member will be 0 if the allocation/expansion fails or there is an error in preconditions. The second member of the pair will be false if the memory has been allocated, true if the memory has been expanded. If the first member is 0, the second member has an undefined value.
pointer allocate_one();
-11- Effects: Returns memory just for one object. [Note: The allocator does not need to store information about the size of the memory block. The memory returned by this function can't be used with size, allocation_command, allocate and deallocate deallocate functions. Otherwise the behavior is undefined-- end note]
-12- Throws: An exception if memory can't be allocated.
void deallocate_one(pointer p) throw();
-13- Requires: p is non-null and has been returned by a previous call to allocate_one, and has not yet been deallocated by deallocate_one.
-14- Effects: Deallocates memory allocated with allocate_one. [Note: If this function is used to deallocate memory obtained with other allocation functions, behavior is undefined.-- end note]
-15- Throws: Nothing.
void construct(pointer p);
-16- Requires: p has been returned by a previous allocation function or is contained in an array allocated by previous allocation function and the memory hasn't been initialized.
-17- Effects: Default constructs an object in the address pointed by p.
-18- Throws: If the default construction of the value throws, throws the exception thrown by the constructor.
template <...Args> void construct(pointer p, Args &&... args);
-19- Requires: p has been returned by a previous allocation function or is contained in an array allocated by previous allocation function and the memory hasn't been initialized.
-20- Effects: Equivalent to the following placement new if p is a raw pointer:
new((void*)p) value_type(std::forward<Args>(args)...);
-21- Throws: If the equivalent constructor of the value throws, throws the exception thrown by the constructor.
Thanks to: