A Proposal to Add Hash Tables to the Standard Library

Hashed associative containers—hash tables—are one of the most frequently requested additions to the standard C++ library. Although hash tables have poorer worst-case performance than containers based on balanced trees, their performance is better in many real-world applications.

Hash tables are appropriate for this TR because they plug an obvious hole in the existing standard library. They are not intended for any one specific problem domain, style of programming, or community of programmers. I expect them to be used by a wide range of programmers.

There is extensive experience with hash tables implemented in C++ in the style of standard containers. Hash tables were proposed for the C++ standard in 1995; the proposal was rejected for reasons of timing. Three independently written libraries, SGI, Dinkumware, and Metrowerks, now provide hashed associative containers as an extension. (The GNU C++ library includes hash tables derived from SGI's.)

The three shipping hash table implementations are similar, but not identical; this proposal is not identical to any of them. Some of the differences will be discussed in section III. An implementation of this proposal exists, but it is not yet in widespread use.

II. Impact On the Standard

This proposal is a pure extension. It proposes a minor change to an existing header (a new function object in <functional>), but it does not require changes to any standard classes or functions and it does not require changes to any of the standard requirement tables. It does not require any changes in the core language, and it has been implemented in standard C++.

This proposal does not depend on any other library extensions. The initial implementation does use some nonstandard components, but they're part of the implementation rather than the interface and they aren't part of this proposal.

III. Design Decisions

A. Packaging Issues

The three implementations in current use, as well as the Barreiro/Fraley/Musser proposal, all used the names hash_set, hash_map, hash_multiset, hash_multimap. Existing practice suggests that these names should be retained.

The distinction between the four hashed containers is the same as the distinction between the four standard associative containers. All four hashed containers allow lookup of elements by key. In hash_set and hash_multiset the elements are the keys; modification of elements is not allowed. In hash_map and hash_multimap the elements are of type pair<const Key, Value>. The key part can't be modified, but the value part can. In hash_set and hash_map, no two elements may have the same key; in hash_multiset and hash_multimap there may be any number of elements with the same key.

Similarly, because of existing practice, this proposal defines the hash_set and hash_multiset classes within a new <hash_set> header, and hash_map and hash_multimap within <hash_map>. It defines a default hash function, hash<>, within the standard header <functional>; see III.D for a discussion of the decision to define a hash<> within <functional>.

This proposal has a nasty backward compatibility problem, precisely because hash tables are so frequently requested and because there is such extensive existing practice. A number of vendors already provide hashed associative containers as an extension. Some of them have chosen to define hash table classes in namespace std. If this proposal is accepted, those vendors will have to come up with a transition strategy to move users from the nonstandard hashed associative containers to the standard ones. Might a clever naming strategy make that transition easier?

One suggestion has been that we deliberately avoid the names used by existing implementations: we might, for example, use the names hashset, hashmap, hashmultiset, hashmultimap, and, similarly, we might use the header names <hashset> and <hashmap> instead of <hash_set> and <hash_map>. I've chosen not to follow that suggestion, because I think that having two similar-but-not-identical headers, with confusingly similar names, has the potential to make the transition harder instead of easier. However, I have included this suggestion in the list of unresolved issues (section V).

This proposal defines a set of Hashed Associative Container requirements, and then separately describes four class that conform to those requirements. In that respect this proposal follows the lead of the standard and of the Barreiro/Fraley/Musser proposal. However, Barreiro, Fraley, and Musser proposed more extensive changes to the container requirements. They proposed two new requirements tables, not one: Sorted Associative Container, satisfied by the existing standard associative containers, and Hashed Associative Container. They then modified the existing Associative Container requirements (table 69 in the C++ standard) so that both sorted associative containers and hashed associative containers would satisfy the new Associative Container requirements. The difference is shown in Figure 1.

I believe that the Barreiro/Fraley/Musser taxonomy is better: the generality of the name "Associative Container", and the specificity of table 69, aren't a good match. However, that proposal was made before the C++ standard was finalized. The superiority of the Barreiro/ Fraley/Musser taxonomy isn't so great as to justify changing an existing requirements table that users may be relying on.

The three hash table implementations in current use are not identical, but they are similar enough that for simple uses they are interchangeable. This proposal attempts to maintain a similar level of compatibility.

B. Chaining Versus Open Addressing

Knuth (section 6.4 of The Art of Computer Programming) distinguishes between two kinds of hash tables: "chaining", where a hash code is associated with the head of a linked list, and "open addressing", where a hash code is associated with an index into an array.

I'm not aware of any satisfactory implementation of open addressing in a generic framework. Open addressing presents a number of problems:

Solving these problems could be an interesting research project, but, in the absence of implementation experience in the context of C++, it would be inappropriate to standardize an open-addressing container class.

All further discussion will assume chaining. Each linked list within a hash table is called a "bucket". The average number of elements per bucket is called the "load factor", or z.

C. Lookup Within a Bucket

When looking up an item in a hash table by key k, the general strategy is to find the bucket that corresponds to k and then to perform a linear search within that bucket. The first step uses the hash function; the second step must use something else.

The most obvious technique is to use std::find() or the equivalent: look for an item whose key is equal to k. Naturally, it would be wrong for operator== to be hard-wired in; it should be possible for the user to provide a function object with semantics of equality. As an example where some predicate other than operator== is useful, suppose the user is storing C-style strings, i.e. pointers to null-terminated arrays of characters. In this case equality of keys k1 and k2 shouldn't mean pointer comparison; it should mean testing that strcmp(k1, k2) == 0.

This proposal takes such an approach. Hashed associative containers are parameterized by two function objects, a hash function and an equality function. Both have defaults.

An alternative technique is possible: instead of testing for equality, sort each bucket in ascending order by key. Linear search for a key k would mean searching for a key k' such that k < k' and k' < k are both false. Again, this shouldn't be taken to mean that operator< would literally be hard-wired in; users could provide their own comparison function, so long as that function has less-than semantics.

The performance characteristics of the two techniques are slightly different. The following table shows the average number of comparisons required for a search through a bucket of n elements:

The difference for a failed search is because with less-than you can tell that a search has failed as soon as you see a key that's larger than k; with equal-to you have to get to the end of the bucket. The difference for a successful search is because with equal-to you can tell that a search has succeeded as soon as you see a key that's equal to k; with less-than all you know when you find a key that's not less than k is that the search has terminated, and you need an extra comparison to tell whether it has terminated in success or failure.

I do not see a clear-cut performance advantage from either technique. Which technique is faster depends on usage pattern: the load factor, and the relative frequency of failed and successful searches. There are also performance implications for insertion, but I expect those differences to be smaller because in most cases I expect insertion to be dominated by the cost of memory allocation or element construction, not by the cost of lookup.

For users, it's sometimes important for a container to present its elements as a sorted range; sorting elements by inserting them into an std::set, for example, is a common idiom. However, I see no value (other than the performance issues discussed above) in sorting elements within a single bucket. If the hash function is well chosen, after all, elements will be distributed between buckets in a seemingly random way. I believe it is more helpful to tell users that they should use the existing associative containers (set, map, multiset, multimap) when they need useful guarantees on element ordering.

From the point of view of user convenience, there isn't a huge difference between the two alternatives of equal-to and less-than. I view equality as slightly more convenient, since it's common to define data types that have equality operations but not less-than operations, and rather less common to do the reverse. There are some types where less-than is not a natural operation, and users would have to define a somewhat arbitrary less-than operation for no reason other than to put the objects in a hash table. One obvious example is std::complex<>.

Existing implementations differ. The SGI and Metrowerks implementations use equal-to, and the Dinkumware implementation uses less-than.

An aside: in principle, linear search isn't strictly necessary. A bucket doesn't have to be structured as a linked list; it could be structured as a binary tree, or as some other data structure. This proposal assumes linked lists, partly for reasons of existing practice (all of the C++ hash table implementations in widespread use are implemented in terms of linked lists) and partly because I believe that in practice a tree structure would hurt performance more often than it would help. Balanced trees have large per-node space overhead, and binary tree lookup is faster than linear search only when the number of elements is large. If the hash table's load factor is small and the hash function well chosen, trees have no advantage over linear lists.

D. The Hash Function Interface

Abstractly, a hash function is a function f(k) that takes an argument of type Key and returns an integer in the range [0, B), where B is the number of buckets in the hash table. A hash function must have the property that f(k1) == f(k2) when k1 and k2 are the same. A good hash function should also have the property that f(k1) and f(k2) are unlikely to be the same when k1 and k2 are different.

It is impossible to write a fully general hash function that's valid for all types. (You can't just convert an object to raw memory and hash the bytes; among other reasons, that idea fails because of padding.) Because of that, and also because a good hash function is only good in the context of a specific usage pattern, it's essential to allow users to provide their own hash functions.

There can be a default hash function for a selected set of types; ideally, it should include the most commonly used types.

In principle there are two possible answers to the first question. First, the hash function could take two arguments instead of one, where the second argument is B. The hash function would have the responsibility of returning some number in the range [0, B). Second, the hash function could return a number in a very large range, say [0, std::numeric_limits<std::size_t>::max()). The hash table class would be responsible for converting the hash code (the value returned by the hash function) into a bucket index in the range [0, B).

If the hash table should be packaged along with other aspects of hash policy, what should those aspects be? There are two obvious candidates. First, it could be packaged along with the the function object that tests for key equality, or perhaps, even more generally, with a function object that specifies a policy for linear search within a bucket. (See section III.C.) Second, it could be packaged along with the parameters that govern changing the bucket count. (See section III.E.)

This proposal uses a standalone hash function, rather than a hash function that's part of a policy package. This is mostly a consequence of other design decisions. First, bucket resizing is determined by floating-point parameters that can be changed at runtime (see III.E), so there is no advantage in putting them in a policy class. Second, linear search within a bucket uses equality (see III.C), and equality is such a common operation that in most cases I expect that a user-supplied equality predicate will have been written for some other purpose, and will be reused as a hash table template argument. Making equality part of a larger policy class would make such reuse harder.

This proposal includes a function object hash<>, with an operator() that takes a single argument and returns an std::size_t. The hash<> template is an incomplete type; it is specialized, and declared as a complete type, for a few common types. I've chosen all of the built-in integer types, all floating-point types, all pointer types, and std::basic_string<charT, traits, Allocator>. I believe that std::basic_string is especially important, because hash tables are often used for strings. Beyond that, the list is fairly arbitrary.

For character pointers the default hash function looks at the character array being pointed to; for other kinds of pointers it looks only at the address.

Why should we include both char* and const char* when the keys have to be unmodifiable anyway? Simple convenience. Lots of code, especially legacy code, uses pointers of type char* to point to constant arrays of characters, even though const char* would do just as well. This is arguably sloppy, but it's sufficiently common that I believe it ought to be possible to instantiate such types as hash_set<char*> and hash_map<char*, int>. This requires making it possible to instantiate hash<char*>.

The hash<> function object is defined in the <functional> header. Another sensible alternative would have been to declare it in both <hash_set> and <hash_map>. Implementers would have to arrange for there to be only a single definition when both headers are used, but that's straightforward. The main reason I chose to put it in <functional> is that authors of user-defined types may need to specialize hash<> (which means they need the declaration) even if they have no need to use any of the hashed containers.

Trivial as it may seem, hash function packaging may be the most contentious part of this proposal. Existing implementations differ. The SGI and Metrowerks implementations use hash functions that aren't bundled with anything else, but the Dinkumware implementation uses a more general hash policy class. Since this is an interface issue, a decision is necessary.

E. Control of Hash Resizing

The time required for looking up an element by key k is c₁ + c₂ n, where c₁ and c₂ are constants, and where n is the number of elements in the bucket indexed by k's hash code. If the hash function is well chosen, and elements are evenly distributed between buckets, this is approximately c₁ + c₂ N/B, where N is the number of elements in the container and B is the bucket count. If the bucket count is taken as a constant, then the asymptotic complexity for element lookup is O(N).

To maintain average case complexity O(1) for lookup, the bucket count must grow as elements are added to the hash table; on average the bucket count must be proportional to N. Another way of putting this is that the load factor, N/B, must be approximately constant.

First is traditional rehashing. When the load factor becomes too large, choose a new and larger bucket count, B'. Then go through every element in the hash table, computing a new bucket index based on B'. This is an expensive operation. Since we want the amortized complexity of element insertion to be constant, we must use exponential growth; that is, B' = γ B, where the growth factor, γ, is larger than 1. In general this proportionality can only be approximate, since many hashing schemes require B to have special numerical properties—primality, for example.

Second is a newer technique, incremental hashing. (See [Plauger 1998].) Incremental rehashing structures the hash table in such a way that it is possible to add a single bucket at a time. When adding a bucket it is only necessary to examine the elements of a single old bucket, distributing some of them to the new one.

The advantage of incremental hashing is that insertion time becomes more predictable: there's no longer a large time difference between insertions that do trigger rehashing and insertions that don't. The disadvantage is that incremental hashing makes lookup slightly slower. The slowdown is for two reasons. First, the logic to determine a bucket index from a hash code is slightly more complicated: it requires one extra test. Second, incremental hashing results in a somewhat less uniform distribution of elements within buckets. It relies on a construction where there are conceptually B buckets, of which U are in current use; B is a power of 2, and U > B/2. We first find a bucket index i in the range [0, B), and then find a bucket index j in the range [0, U) by subtracting U from i if necessary. If the original hash codes are evenly distributed, a bucket in the range [0, B-U) will on average have twice the number of elements as a bucket in the range [B-U, U).

Because of this tradeoff, there is not a clear choice between incremental hashing and traditional rehashing; both are legitimate implementation techniques. A standard, of course, need not and should not dictate implementation technique. The goal of this proposal is to allow both.

From a user's perspective, all of this is invisible in normal use. It's visible when users want to do one of these things:

What are those parameters? The most obvious is the maximum load factor, since that's what triggers an automatic rehash. There's also a second parameter, which can be thought of in two different ways: as a growth factor (the constant of proportionality by which the bucket count grows in a rehash) or as a minimum load factor.

Letting users control that second parameter is more complicated than it seems at first.

I don't know how to specify invariants that are precise enough to be meaningful and normative, but loose enough to accommodate traditional and incremental hashing, empty hash tables, hash tables with bucket counts that are restricted to prime numbers, manual rehashing, and empty hash tables. We could include a member function for setting the growth factor (or minimum load factor) but not say exactly how that number is used. However, I see very little value in such a vacuous requirement. This proposal provides user control of the maximum load factor, but not of growth factor or minimum load factor.

One implication is that this proposal says what happens as the number of elements in a hash table increase, but doesn't say what happens as the number decreases. This is unfortunate. An unnecessarily low load factor wastes space (the magnitude of the load factor is a time/space tradeoff), and can lead to unnecessarily slow iteration.

There's still one last question about the maximum load factor: how should the user specify it? An integer is an unnecessarily restrictive choice, since fractional values (especially, ones in the range 0 < z_max < 1) are sensible. There are three reasonable options: as a rational number (perhaps an ad hoc rational number, where the user provides the numerator and denominator separately), as an enum (the user may select one of a small number of predetermined values, such as 1/4, 1/2, 1, 3/2, 2), or as a floating-point number.

This proposal provides a member function that allows the user to set the maximum load factor at runtime, using a floating-point number. The reasons are:

The Dinkumware hash table implementation uses a compile-time integer constant (part of a hash traits class) to control the maximum load factor, and the Metrowerks implementation uses a runtime floating-point parameter. The SGI implementation does not provide any mechanism for controlling the maximum load factor.

F. Iterators

There is one basic decision to be made about hash table iterators, which can be expressed either from the implementer's or the user's point of view. From the implementer's point of view: are the buckets singly linked lists, or doubly linked lists? From the user's point of view: are the iterators forward iterators, or bidirectional iterators?

From the implementer's point of view, there's no question that doubly linked lists are much easier to work with. One advantage is that you don't have to maintain a separate list for each bucket. You can keep a single long list, taking care that elements within a bucket remain adjacent; a bucket is then just a pair of pointers into the list. This is nice for the implementer, because a hash table iterator can just be a recycled std::list<>::iterator. It's nice for the user, because iteration is fast. I don't know of a way to make the single-long-list technique work for singly linked lists: some operations that ought to be constant time would require a linear search through buckets. (The sticking point turns out to be that erasing a node in a singly linked list requires access to the node before it. It's possible to get around this problem, but every technique I know of ends up introducing linear time behavior somewhere else.)

From the user's point of view, the choice is a tradeoff. Singly linked lists have slower iterators, because an iterator first steps within a bucket and then, upon reaching the end of a bucket, steps to the next. Additionally, users may sometimes want to apply algorithms that require bidirectional iterators. If a hash table supplies bidirectional iterators, it's easier for users to switch between (say) hash_set<> and std::set<>. (But "easier" still doesn't mean easy. Some applications use the standard associative containers because of those containers' ordering guarantees, which can't possibly be preserved by hash tables.)

For the user, the disadvantage of bidirectional iterators is greater space overhead. The space overhead for singly linked lists is N + B words, where N is the number of elements and B is the bucket count, and the space overhead for doubly linked lists is 2N + 2B. This is an important consideration, because the main reason for using hashed associative containers is performance.

The SGI and Metrowerks implementations provide forward iterators. The Dinkumware implementation provides bidirectional iterators.

This proposal allows both choices. It requires hashed associative containers to provide forward iterators. An implementation that provides bidirectional iterators is conforming, because bidirectional iterators are forward iterators.

G. Bucket Interface

Like all standard containers, each of the hashed containers has member function begin() and end(). The range [c.begin(), c.end()) contains all of the elements in the container, presented as a flat range. Elements within a bucket are adjacent, but the iterator interface presents no information about where one bucket ends and the next begins.

It's also useful to expose the bucket structure, for two reasons. First, it lets users investigate how well their hash function performs: it lets them test how evenly elements are distributed within buckets, and to look at the elements within a bucket to see if they have any common properties. Second, if the iterators have an underlying segmented structure (as they do in existing singly linked list implementations), algorithms that exploit that structure, with an explicit nested loop, can be more efficient than algorithms that view the elements as a flat range.

The most important part of the bucket interface is an overloading of begin() and end(). If n is an integer, [begin(n), end(n)) is a range of iterators pointing to the elements in the nth bucket. These member functions return iterators, of course, but not of type X::iterator or X::const_iterator. Instead they return iterators of type X::local_iterator or X::const_local_iterator. A local iterator is able to iterate within a bucket, but not necessarily between buckets; in some implementations it's possible for X::local_iterator to be a simpler data structure than X::iterator. X::iterator and X::local_iterator are permitted to be the same type; implementations that use doubly linked lists will probably take advantage of that freedom.

This bucket interface is not provided by the SGI, Dinkumware, or Metrowerks implementations. It is inspired partly by the Metrowerks collision-detection interface, and partly by earlier work (see [Austern 1998]) on algorithms for segmented containers.

H. Exception guarantees

The C++ Standard gives a minimum set of exceptions guarantees for library components. (Roughly: exceptions don't corrupt data structures or cause memory leaks.) There are two important questions we have to answer. First: which operations on hash tables, if any, provide a stronger guarantee? Second: what restrictions, if any, do we need to impose on the user-defined function objects, the hash function and the equality function, used to instantiate hash tables.

In practice, I believe there are only two interesting operations: erase and insert. Erase is an interesting operation because in general it must invoke both of these function objects and may therefore throw exceptions. We have to say something about the circumstances in which it may throw exceptions (answer: only when they're thrown from one of these function objects), and we need to say that clear may not throw exceptions even though it's defined in terms of erase.

In the simple case (no rehash is necessary), the strong guarantee is easy: we can invoke the hash code and find the appropriate bucket before performing any allocations. After that point, there isn't any need to modify any list pointers until all comparisons have been performed and the insertion point is known. The trouble comes if a rehash is necessary, and if the user-provided hash function throws an exception during the rehash. At that point it's likely that the data structures will have been modified in unrecoverable ways (the only way to recover would involve invoking the hash function again), and the only way to ensure integrity of the data structures is to lose some or all elements.

What we can say is that single-element insert provides the strong guarantee if the hash function is guaranteed not to throw exceptions. Note that this is true for the default hash functions.

I. Stored hash codes

There is an interesting space/time tradeoff for hash table implementers: along with an element, should one store the element's hash code? This can improve speed in two ways. First, it makes rehashes faster, because there's no need to recompute the hash code of every element. Second, it may make searches faster: when searching through a bucket the implementation can compare hash codes before doing a full element comparison. This is two tests instead of one, but integer comparisons are inexpensive and full element comparisons may sometimes (for strings, for example) be expensive.

Again, my goal is neither to require nor to forbid stored hash codes. I don't know of an implementation that currently stores hash codes, but I also don't know of anything in this proposal that would forbid it.

One might imagine trying to achieve greater flexibility: allowing users to control whether or not hash codes are stored and used for searches, so that they're only stored in cases where the user believes that this would be a performance benefit. (One might imagine using a policy class, for example.) I haven't tried to provide that kind of flexibility, because I don't think the extra gain would be justified by the increased complexity of the interface.

IV. Proposed Text

A. Requirements

1. To be added as a separate requirements section, following clause 23.1.2

Hashed associative containers provide an ability for fast retrieval of data based on keys. The worst-case complexity for most operations is linear, but the average case is much faster. The library provides four basic kinds of hashed associative containers: hash_set, hash_map, hash_multiset, and hash_multimap.

Each hashed associative container is parameterized by Key, by a function object Hash that acts as a hash function for values of type Key, and on a binary predicate Pred that induces an equivalence relation on values of type Key. Additionally, hash_map and hash_multimap associate an arbitrary mapped type T with the Key.

A hash function is a function object that takes a single argument of type Key and returns a value of type std::size_t in the range [0, std::numeric_limits<std::size_t>::max()).

Two values k1 and k2 of type Key are considered equal if the container's equality function object returns true when passed those values. If k1 and k2 are equal, the hash function must return the same value for both.

A hashed associative container supports unique keys if it may contain at most one element for each key. Otherwise, it supports equivalent keys. hash_set and hash_map support unique keys. hash_multiset and hash_multimap support equivalent keys. In containers that support equivalent keys, elements with equivalent keys are adjacent to each other.

For hash_set and hash_multiset the value type is the same as the key type. For hash_map and hash_multimap it is equal to std::pair<const Key, T>.

The elements of a hashed associative container are organized into buckets. Keys with the same hash code appear in the same bucket. The number of buckets is automatically increased as elements are added to a hashed associative container, so that the average number of elements per bucket is kept below a bound. Rehashing invalidates iterators, changes ordering between elements, and changes which buckets elements appear in, but does not invalidate pointers or references to elements.

In the following table, X is a hashed associative container class, a is an object of type X, b is a possibly const object of type X, a_uniq is an object of type X when X supports unique keys, a_eq is an object of type X when X supports equivalent keys, i and j are input iterators that refer to value_type, [i, j) is a valid range, p and q2 are valid iterators to a, q and q1 are valid dereferenceable iterators to a, [q1, q2) is a valid range in a, r and r1 are valid dereferenceable const iterators to a, r2 is a valid const iterator to a, [r1, r2) is a valid range in a, t is a value of type X::value_type, k is a value of type key_type, hf is a possibly const value of type hasher, eq is a possibly const value of type key_equal, n is a value of type size_type, and z is a value of type double.

Hashed associative container requirements (in addition to container)
Expression	Return type	assertion/note pre/post-condition	complexity
`X::key_type`	`Key`	`Key` is `Assignable` and `CopyConstructible`	compile time
`X::hasher`	`Hash`	`Hash` is a unary function object that take an argument of type `Key` and returns a value of type `std::size_t`.	compile time
`X::key_equal`	`Pred`	`Pred` is a binary predicate that takes two arguments of type `Key`. `Pred` is an equivalence relation.	compile time
`X::local_iterator`	An iterator type whose category, value type, difference type, and pointer and reference types are the same as `X::iterator`'s.	A `local_iterator` object may be used to iterate through a single bucket, but may not be used to iterated across buckets.	compile time
`X::const_local_iterator`	An iterator type whose category, value type, difference type, and pointer and reference types are the same as `X::const_iterator`'s.	A `const_local_iterator` object may be used to iterate through a single bucket, but may not be used to iterated across buckets.	compile time
`X(n, hf, eq) X a(n, hf, eq)`	X	Constructs an empty container with at least `n` buckets, using `hf` as the hash function and `eq` as the key equality predicate.	O(n)
`X(n, hf) X a(n, hf)`	X	Constructs an empty container with at least `n` buckets, using `hf` as the hash function and `key_equal()` as the key equality predicate.	O(n)
`X(n) X a(n)`	X	Constructs an empty container with at least `n` buckets, using `hasher()` as the hash function and `key_equal()` as the key equality predicate.	O(n)
`X() X a`	X	Constructs an empty container with an unspecified number of buckets, using `hasher()` as the hash function and `key_equal` as the key equality predicate.	constant
`X(i, j, n, hf, eq) X a(i, j, n, hf, eq)`	X	Constructs an empty container with at least `n` buckets, using `hf` as the hash function and `eq` as the key equality predicate, and inserts elements from `[i, j)` into it.	Average case O(N) (N is `std::distance(i, j)`), worst case O(N²)
`X(i, j, n, hf) X a(i, j, n, hf)`	X	Constructs an empty container with at least `n` buckets, using `hf` as the hash function and `key_equal()` as the key equality predicate, and inserts elements from `[i, j)` into it.	Average case O(N) (N is `std::distance(i, j)`), worst case O(N²)
`X(i, j, n) X a(i, j, n)`	X	Constructs an empty container with at least `n` buckets, using `hasher()` as the hash function and `key_equal()` as the key equality predicate, and inserts elements from `[i, j)` into it.	Average case O(N) (N is `std::distance(i, j)`), worst case O(N²)
`X(i, j) X a(i, j)`	X	Constructs an empty container with an unspecified number of buckets, using `hasher()` as the hash function and `key_equal` as the key equality predicate, and inserts elements from `[i, j)` into it.	Average case O(N) (N is `std::distance(i, j)`), worst case O(N²)
`X(b) X a(b)`	`X`	Copy constructor. In addition to the contained elements, the hash function, predicate, and maximum load factor are copied.	Average case linear in `b.size()`, worst case quadratic.
`a = b`	`X`	Copy assignment operator. In addition to the contained elements, the hash function, predicate, and maximum load factor are copied.	Average case linear in `b.size()`, worst case quadratic.
`b.hash_function()`	`hasher`	Returns the hash function out of which `a` was constructed.	constant
`b.key_eq()`	`key_equal`	Returns the key equality function out of which `a` was constructed.	constant
`a_uniq.insert(t)`	`std::pair<iterator, bool>`	Inserts `t` if and only if there is no element in the container with key equivalent to the key of `t`. The `bool` component of the returned pair indicates whether the insertion takes place, and the `iterator` component points to the element with key equivalent to the key of `t`.	Average case O(1), worst case O(`a_uniq.size()`).
`a_eq.insert(t)`	`iterator`	Inserts `t`, and returns an iterator pointing to the newly inserted element.	Average case O(1), worst case O(`a_uniq.size()`).
`a.insert(r, t)`	`iterator`	Equivalent to a.insert(t). Return value is an iterator pointing to the element with the key equivalent to that of `t`. The const iterator `r` is a hint pointing to where the search should start. Implementations are permitted to ignore the hint.	Average case O(1), worst case O(`a_uniq.size()`).
`a.insert(i, j)`	`void`	Pre: `i` and `j` are not iterators in `a`. Equivalent to `a.insert(t)` for each element in `[i,j)`.	Average case O(N), where N is `std::distance(i, j)`. Worst case O(N * `a.size()`).
`a.erase(k)`	`size_type`	Erases all elements with key equivalent to `k`. Returns the number of elements erased.	Average case O(`a.count(k)`). Worst case O(`a.size())`.
`a.erase(r)`	`void`	Erases the element pointed to by `r`.	Average case O(1), worst case O(`a.size()`).
`a.erase(r1, r2)`	`void`	Erases all elements in the range `[r1, t2)`.	Average case O(`std::distance(r1, r2)`), worst case O(`a.size()`).
`a.clear()`	`void`	Erases all elements in the container. Post: `a.size() == 0`	Linear.
`b.find(k)`	`iterator`; `const_iterator` for const `a`.	Returns an iterator pointing to an element with key equivalent to `k`, or `a.end()` if no such element exists.	Average case O(1), worst case O(`a.size()`).
`b.count(k)`	`size_type`	Returns the number of elements with key equivalent to `k`.	Average case O(1), worst case O(`a.size()`).
`b.equal_range(k)`	`std::pair<iterator, iterator>`; `std::pair<const_iterator, const_iterator>` for const `b`.	Returns a range containing all elements with keys equivalent to `k`. Returns `std::make_pair(a.end(), a.end())` if no such elements exist.	Average case O(`a.count(k)`). Worst case O(`a.size())`.
`b.bucket_count()`	`size_type`	Returns the number of buckets that `b` contains.	Constant
`b.max_bucket_count()`	`size_type`	Returns an upper bound on the number of buckets that `b` might ever contain.	Constant
`b.bucket(k)`	`size_type`	Returns the index of the bucket in which elements with keys equivalent to `k` would be found, if any such element existed. Post: the return value is in the range `[0, b.bucket_count())`.	Constant
`b.bucket_size(n)`	`size_type`	Pre: `n` is in the range `[0, b.bucket_count())`. Returns the number of elements in the `n`^th bucket.	O(`a.bucket_size(n)`)
`b.begin(n)`	`local_iterator`; `const_local_iterator` for const `b`	Pre: `n` is in the range `[0, b.bucket_count())`. Note: `[b.begin(n), b.end(n))` is a valid range containing all of the elements in the `n`^th bucket.	Constant
`b.end(n)`	`local_iterator`; `const_local_iterator` for const `b`	Pre: `n` is in the range `[0, b.bucket_count())`.	Constant
`b.load_factor()`	`double`	Returns the average number of elements per bucket.	Constant
`b.max_load_factor()`	`double`	Returns a number that the container attempts to keep the load factor less than or equal to. The container automatically increases the number of buckets as necessary to keep the load factor below this number. Post: return value is positive.	Constant
`a.max_load_factor(z)`	`void`	Pre: `z` is positive. Changes the container's maximum load load factor. Post: `a.max_load_factor() == z`	Constant
`a.rehash(n)`	`void`	Pre: `n > a.size() / a.max_load_factor()`. Changes the number of buckets so that it is at least `n`.	Average case linear in `a.size()`, worst case quadratic.

The iterator types iterator and const_iterator of a hashed associative container are of at least the forward iterator category. For hashed associative containers where the key type and value type are the same, both iterator and const_iterator are const iterators.

The insert members shall not affect the validity of references to container elements, but may invalidate all iterators to the container. The erase members shall invalidate only iterators and references to the erased elements.

2. Exception safety guarantees

Add the following bullet items to the list of exception safety guarantees in clause 23.1, paragraph 10:

B. Hash Function

1. To be added to the <functional> synopsis

2. Class template hash

The function object hash is used as the default hash function by the hashed associative containers. This class template is only required to be instantiable for integer types (3.9.1), floating point types (3.9.1), pointer types (8.3.1), and (for any valid set of charT, traits, and Alloc) std::basic_string<charT, traits, Alloc>.

The return value of operator() is unspecified, except that equal arguments yield the same result.

C. Hashed Associative Containers

1. Header <hash_set> synopsis

2. Header <hash_map> synopsis

3. Class template hash_set

A hash_set is a kind of hashed associative container that supports unique keys (a hash_set contains at most one of each key value) and in which the elements' keys are the elements themselves.

A hash_set satisfies all of the requirements of a container and of a hashed associative container. It provides the operations described in the preceding requirements table for unique keys; that is, a hash_set supports the a_uniq operations in that table, not the a_eq operations. For a hash_set<Value> the key type and the value type are both Value. The iterator and const_iterator types are both const iterator types. It is unspecified whether or not they are the same type.

This section only describes operations on hash_set that are not described in one of the requirement tables, or for which there is additional semantic information.

Effects: Constructs an empty hash_set using the specified hash function, key equality function, and allocator, and using at least n buckets. If n is not provided, the number of buckets is implementation defined. max_load_factor() is 1.0.

Effects: Constructs an empty hash_set using the specified hash function, key equality function, and allocator, and using at least n buckets. (If n is not provided, the number of buckets is implementation defined.) Then inserts elements from the range [first, last). max_load_factor() is 1.0.

4. Class template hash_map

A hash_map is a kind of hashed associative container that supports unique keys (a hash_map contains at most one of each key value) and that associates values of another type mapped_type with the keys.

A hash_map satisfies all of the requirements of a container and of a hashed associative container. It provides the operations described in the preceding requirements table for unique keys; that is, a hash_map supports the a_uniq operations in that table, not the a_eq operations. For a hash_map<Key, T> the key type is Key, the mapped type is T, and the value type is std::pair<const Key, T>.

This section only describes operations on hash_map that are not described in one of the requirement tables, or for which there is additional semantic information.

Effects: Constructs an empty hash_map using the specified hash function, key equality function, and allocator, and using at least n buckets. If n is not provided, the number of buckets is implementation defined. max_load_factor() is 1.0.

Effects: Constructs an empty hash_map using the specified hash function, key equality function, and allocator, and using at least n buckets. (If n is not provided, the number of buckets is implementation defined.) Then inserts elements from the range [first, last). max_load_factor() is 1.0.

Effects: If the hash_map does not already contain an element whose key is equivalent to k, inserts std::pair<const key_type, mapped_type>(k, mapped_type()).

Returns: A reference to x.second, where x is the (unique) element whose key is equivalent to k.

5. Class template hash_multiset

A hash_multiset is a kind of hashed associative container that supports equivalent keys (a hash_multiset may contain multiple copies of the same key value) and in which the elements' keys are the elements themselves.

A hash_multiset satisfies all of the requirements of a container and of a hashed associative container. It provides the operations described in the preceding requirements table for equivalent keys; that is, a hash_multiset supports the a_eq operations in that table, not the a_uniq operations. For a hash_multiset<Value> the key type and the value type are both Value. The iterator and const_iterator types are both const iterator types. It is unspecified whether or not they are the same type.

This section only describes operations on hash_multiset that are not described in one of the requirement tables, or for which there is additional semantic information.

Effects: Constructs an empty hash_multiset using the specified hash function, key equality function, and allocator, and using at least n buckets. If n is not provided, the number of buckets is implementation defined. max_load_factor() is 1.0.

Effects: Constructs an empty hash_multiset using the specified hash function, key equality function, and allocator, and using at least n buckets. (If n is not provided, the number of buckets is implementation defined.) Then inserts elements from the range [first, last). max_load_factor() is 1.0.

6. Class template hash_multimap

A hash_multimap is a kind of hashed associative container that supports equivalent keys (a hash_multimap may contain multiple copies of each key value) and that associates values of another type mapped_type with the keys.

A hash_multimap satisfies all of the requirements of a container and of a hashed associative container. It provides the operations described in the preceding requirements table for equivalent keys; that is, a hash_multimap supports the a_eq operations in that table, not the a_uniq operations. For a hash_multimap<Key, T> the key type is Key, the mapped type is T, and the value type is std::pair<const Key, T>.

This section only describes operations on hash_multimap that are not described in one of the requirement tables, or for which there is additional semantic information.

Effects: Constructs an empty hash_multimap using the specified hash function, key equality function, and allocator, and using at least n buckets. If n is not provided, the number of buckets is implementation defined. max_load_factor() is 1.0.

Effects: Constructs an empty hash_multimap using the specified hash function, key equality function, and allocator, and using at least n buckets. (If n is not provided, the number of buckets is implementation defined.) Then inserts elements from the range [first, last). max_load_factor() is 1.0.

V. Unresolved issues

The following issues have been raised and not addressed, or have been addressed in a way that some people may consider inadequate.

1. Naming: should the container classes be hash_set, hash_map, hash_multiset, and hash_multimap as proposed, or should the names be changed to something that hasn't been used in existing libraries?

2. Naming: should the new headers be <hash_set> and <hash_map>, or should the names be changed to remove the underscore?

3. Naming: Howard proposes changing the name rehash to an overload of bucket_count. Should we do that?

4. Naming: what should be the name for hash_function's return type? This proposal, following the original Barreiro/Fraley/Musser proposal, chooses "hasher". That sounds funny. Do we care? If so, is there a better choice?

5. Hash table equality. From the container requirements, we know that two hash tables x and y are equal if and only if the expression std::equal(x.begin(), x.end(), y.begin()) returns true. This is very stringent! Two hash_set objects may contain exactly the same elements and still not be equal. (If the elements were added in a different order, or if they have different bucket counts.) Arguably operator== isn't very useful. But if we do believe it's useful, then should we guarantee that copy construction and copy assignment preserve equality in the usual way, e.g. that the postcondition for x = y is that x == y? This guarantee imposes a significant burden on implementers.

6. Interface: should we have policy classes (or some other mechanism) to affect (a) whether hash codes are stored; and/or (b) whether the hash table uses forward iterators or bidirectional iterators?

7. Iterator complexity. The container requirements specify that x.begin() is O(1). Implementations can do this, but it's a burden. Is it worth requiring them to do that? (Note that we're implicitly making that requirement just by saying that a hash table is a container.) This is an annoying problem: on the one hand, we don't want to impose a requirement that may be widely ignored. On the other hand, we don't want to do something so drastic as changing the container requirements.

8. Pairs and combining. Should we define a general hash combiner, that takes two hash codes and gives a hash code for the combination? Should we define a default hash function for std::pair<T,U>? (A 'yes' answer to the latter question essentially implies a 'yes' answer to the former.)

9. Default hash function. What, if anything, should the generic hash<T> do? In this proposal it's left undefined.

10. Bucket interface. Should it be kept as is, or should it be changed to a more container-like interface? (e.g. bucket(n) might have a return type const bucket_type&, where bucket_type is an implementation defined container type.

	Using equality	Using less-than
Failed search	n	n/2
Successful search	n/2	n/2 + 1

A Proposal to Add Hash Tables to the Standard Library (revision 3)

I. Motivation

II. Impact On the Standard

III. Design Decisions

A. Packaging Issues

B. Chaining Versus Open Addressing

C. Lookup Within a Bucket

D. The Hash Function Interface

E. Control of Hash Resizing

F. Iterators

G. Bucket Interface

H. Exception guarantees

I. Stored hash codes

IV. Proposed Text

A. Requirements

1. To be added as a separate requirements section, following clause 23.1.2

2. Exception safety guarantees

B. Hash Function

1. To be added to the <functional> synopsis

2. Class template `hash`

C. Hashed Associative Containers

1. Header <hash_set> synopsis

2. Header <hash_map> synopsis

3. Class template `hash_set`

4. Class template `hash_map`

5. Class template `hash_multiset`

6. Class template `hash_multimap`

V. Unresolved issues

VI. Revision history

VII. References

A Proposal to Add Hash Tables to the Standard Library (revision 3)

I. Motivation

II. Impact On the Standard

III. Design Decisions

A. Packaging Issues

B. Chaining Versus Open Addressing

C. Lookup Within a Bucket

D. The Hash Function Interface

E. Control of Hash Resizing

F. Iterators

G. Bucket Interface

H. Exception guarantees

I. Stored hash codes

IV. Proposed Text

A. Requirements

1. To be added as a separate requirements section, following clause 23.1.2

2. Exception safety guarantees

B. Hash Function

1. To be added to the <functional> synopsis

2. Class template hash

C. Hashed Associative Containers

1. Header <hash_set> synopsis

2. Header <hash_map> synopsis

3. Class template hash_set

4. Class template hash_map

5. Class template hash_multiset

6. Class template hash_multimap

V. Unresolved issues

VI. Revision history

VII. References

2. Class template `hash`

3. Class template `hash_set`

4. Class template `hash_map`

5. Class template `hash_multiset`

6. Class template `hash_multimap`