Hashing User-Defined Types in C++1y

Background

C++11 defined a set of standard hashing containers, which rely on a specialization of std::hash<KeyType> to exist in order to hash their keys. However, we provided no help to users trying to implement this hash function. We also required them to specialize a template in namespace std as opposed to the namespace their code actually lives in. This leads to confused users and weak hash functions. For example:

namespace my_namespace {
  struct OtherType; // Defined elsewhere.
  struct MyType {
    OtherType field1;
    int field2;
    char field3[500];
  };
}
namespace std {
  template<> struct hash<::my_namespace::MyType> {
    size_t operator()(const MyType& val) {
      return (hash<::my_namespace::OtherType>()(field1) ^
              // Wow, that's verbose, and the xor makes it weak.
              hash<int>()(field2) ^
              // Oh noes, a copy:
              hash<std::string>()(field3, field3 + 500));
    }
  };
}

This paper proposes an improvement to the situation. But first a digression on the purpose of "hash" functions:

Hashing vs Fingerprinting

There are several uses for this thing we refer to as hashing.

The use in the standard library is to look things up in a hash table that is local to the current process.
We may want to build a similar table that's distributed across several machines running either several copies of the same binary, or several binaries.
We may also want to save the table to disk and keep it for several years.
We might hash some strings, and then compare just the hash values, trusting probability to protect us from collisions.
We might wish to use the hash as part of a security-sensitive system. This generally requires a cryptographic hash.

No single algorithm is optimal for all of the above use cases. If we try to make std::hash<> fill all of those roles, it will be the wrong choice for all of them instead.

This paper proposes that std::hash<> be designed for only the first use in the above list, and focuses on making it as simple as possible for the average user of standard library hashing containers. A later paper is likely to explore the more sophisticated interface needed to make user-defined types visible to other "fingerprinting" algorithms.

Hash tables only require that std::hash<T>()(object) return a stable value within a single program execution. In theory this allows the standard library to improve the hash implementation over time. However, we found at Google that without very clear documentation, we upgrade hash implementations rarely enough that developers save the values to disk anyway, which causes upgrades to break code. Therefore, we propose that hash functions return a different value in different processes, not just in different implementations, so that programs fail fast when they're incorrect. This has the secondary benefit that it frustrates pre-calculated collision tables used for denial-of-service attacks.

Defining hash functions easily and in a sensible namespace

At Google, we find many users who are confused about what kinds of things they're allowed to put in namespace std. Requiring them to specialize std::hash<> encourages them to define new things in there, which causes undefined behavior.

Even more users dislike the verbosity caused by the need to close out their current namespace, open the std namespace, specialize hash, close the std namespace, and then re-open their original namespace.

To fix this, we propose finding hash_value() by argument-dependent lookup, like swap. std::hash<T>::operator()(t) is redefined to forward to hash_value(t).

This leaves the problem of how to hash built-in types. In c++std-lib-31719, Peter Dimov observed that a "practical problem with the hash_value design is that types implicitly convertible to bool such as shared_ptr have a hash_value [of] 0 or 1 unless overridden." C++11's std::shared_ptr only has an explicit conversion to bool, so this objection doesn't apply there, but it would be poor form to lay such traps for user-defined types that aren't using new C++11 features. Instead, define:

namespace std {
  template<typename Bool>
  typename enable_if<is_same<Bool, bool>::value, size_t>::type
  hash_value(Bool b) { return ...; }
}

struct BoolConvertible { operator bool() { return true; } };
void f() {
  BoolConvertible bc;
  using std::hash_value;
  hash_value(bc);  // error
}

This templating should be done for every hash_value(primitive) overload to avoid similar conversion mistakes. Other library, extension, or language techniques to prevent implicit conversions to the argument types should also be allowed.

In order to keep compatibility with existing specializations of std::hash, the standard library should continue calling std::hash<T>()(value) where it needs a hash value, but user code that doesn't need such compatibility can move to the more natural hash_value(value).

Templated user code needs to use the more verbose

using std::hash_value;
hash_value(value);

in order to pick up the definitions for primitive types. Perhaps the standard library should also include an adl_hash_value() with that definition to make this a bit less verbose. On the other hand, only hash table implementations are likely to call hash_value, so adl_hash_value may not be worth it.

Helping users implement their `hash_value()`

If users are asked to implement a hash function for their own types with no guidance, they generally write bad hash functions. Instead, we should provide a simple function to pass hash-relevant member variables into, in order to define a decent hash function:

namespace my_namespace {
  struct OtherType { ... };
  struct MyType {
    OtherType field1;
    int field2;
    char field3[500];
    float field4;  // Does not contribute to equality.
  };
  std::hash_code hash_value(const MyType& val) {
    return std::hash_combine(val.field1,
                             val.field2,
                             val.field3);
  }
}

std::hash_code will be explained and defined later.

`hash_combine()`

hash_combine() is declared as:

namespace std {
  template<typename T, typename... U>
  hash_code hash_combine(const T& arg1, const U& ...args);
}

and has an implementation-defined definition satisfying the following properties:

hash_combine(args...) interacts with user-defined types by calling unqualified hash_value() on them.
Within the same process, if all_of((args1 == args2)...) then hash_combine(args1...) == hash_combine(args2...).
If two calls to hash_combine() aren't defined to return equal values, then their return values must be completely different with high probability See http://en.wikipedia.org/wiki/Avalanche_effect for a possible (but not the only possible) meaning of "completely different with high probability". Situations that don't require equal hash_combine results include:
1. changing a single bit in any of hash_combine()'s arguments. (This includes changing hash_combine(false) to hash_combine(true).)
2. calling hash_combine in a different execution of the same binary.
3. Replacing hash_combine(arg1, arg2, arg3) with hash_combine(hash_combine(arg1, arg2), arg3)

A downside to this approach is that modern hash functions contain initialization and finalization phases that will be re-run for each call to hash_combine, where they could be skipped if the hash algorithm knew it was only dealing with an intermediate result. The authors of CityHash (Geoff Pike) and MurmurHash (Austin Appleby) don't think this inefficiency will be a significant problem in practice.

Different processes have different hash values

Ensuring that the result of hash_combine is different in different processes requires picking a seed at process start time. I know of three ways to do this:

Use the technique of <iostream>: Define a static object in the header that defines hash_combine(), and have it initialize the global seed if it's the first such object to be constructed.

Implement hash_combine like:

atomic<seed_t> global_seed;
hash_combine(...) {
  seed_t seed = global_seed.load(memory_order_relaxed);
  if (seed == 0) seed = initialize_global_seed_to_nonzero();
  ...
}

Implement a get_seed() function as:
```
seed_t get_seed() {
  static const seed_t seed = initialize_global_seed();
  return seed;
}
```
The implementation of this is likely to be similar to, but not quite as efficient as, option 2.

Implementations should provide a way for users to set the seed explicitly, primarily to track down bugs that result from assuming a stable hash value.

This is actually stricter than the current requirements on the Hash concept and std::hash, which only require that k1==k2 imply h(k1)==h(k2) for a single hash instance 'h'. We're tightening this to require that two std::hash<T> instances in the same process return the same value for equal inputs.

Hashing ranges

It's possible to hash ranges of values using code like:

hash_code hash_value(const MyType& val) {
  hash_code result = 0;
  for (const auto& v : val) {
    result = hash_combine(result, c);
  }
  return result;
}

However, this code is likely to be slower than it needs to be because the hash algorithm needs to initialize and finalize for each call to hash_combine. Instead, we define:

namespace std {
  template<typename InputIterator>
  hash_code hash_combine_range(InputIterator first,
                               InputIterator last);
}

which combines hashes for each element in the range. If we had a std::range<Iterator> object, we could define hash_value(std::range) directly in terms of this. The semantics of the combining for both hash_combine and hash_combine_range are the same, leading to the following invariants:

std::vector<int> v = {1, 2, 3};
assert(std::hash_combine(1, 2, 3) ==
       std::hash_combine_range(v.begin(), v.end()));
high_probability(std::hash_combine(1, 2, 3) !=
                 std::hash_combine(
                   std::hash_combine_range(v.begin(), v.end())));

// And likely following from the above:
assert(std::hash_combine(1, 2, 3) != std::hash_combine(v));

Note that this requires that ranges of equal values hash equally, even if they have different types. This could require more implementation complexity in implementations that want to hash ranges of characters very efficiently, but it appears doable.

Hashing objects as byte arrays

Users may also want to improve efficiency by hashing several sequential standard-layout and trivially-copyable fields as a single byte array, rather than passing them each individually to hash_combine. Unfortunately, those requirements aren't enough: the fields also have to be free of padding. To handle this, we propose the following type trait and a third hashing function.

namespace std {
  template<typename T> is_contiguous_layout;
  template<typename T>
  hash_code hash_as_bytes(const T& obj);
}

std::is_contiguous_layout<T> shall be a UnaryTypeTrait whose BaseCharacteristic is: std::true_type if all bits of the object representation participate in the value representation, or std::false_type otherwise. In other words, if there are any padding bits in the bytes making up an object of type T, then is_contiguous_layout<T> inherits from false_type. is_contiguous_layout would also be useful in optimizing uses of compare_exchange.

hash_as_bytes(obj) is ill-formed (a diagnostic is required) if the type T of obj does not satisfy

is_trivially_copyable<T>::value &&
is_standard_layout<T>::value &&
is_contiguous_layout<T>::value

hash_as_bytes(args...) is implemented as:

unsigned char* addr = reinterpret_cast<unsigned char*>(&obj);
return hash_range(addr, addr + sizeof(obj));

If a user-defined type has a subset of fields that satisfy the requirements for hash_as_bytes(), its author can place those fields in a sub-struct.

Note: We considered allowing users to pass multiple objects to hash_as_bytes, and having it treat them as a single long byte array, optimizing cases where objects are adjacent. However, because fast hash functions operate block-at-a-time, this would be difficult to implement without either copying data or changing the resulting hash value.

It would be possible to perform the above optimization automatically inside some calls to hash_combine() where arguments are adjacent and appropriate sizes, but Austin Appleby believed this was likely to surprise people, so we have a way for users to request it explicitly.

The `hash_code` type

The authors of this paper initially believed that having all intermediate hash values bottleneck through size_t would increase the collision rate and reduce performance. Geoff Pike and Austin Appleby agreed that size_t does increase the collision rate and reduce performance slightly, but not enough to justify the extra complexity of avoiding it. So, algorithmically, it would be fine to have all hash functions return size_t.

However, Austin argued that "if hash() doesn't behave like a function in the mathematical 'y = f(x)' sense due to the intentionally unstable implementation, then hash_code should not behave like a number and should be as opaque as possible" in order to avoid user confusion. We've decided to have the three hash utility functions return a new hash_code object:

namespace std {
  class hash_code {
    // Implementation-specific state; probably size_t.
  public:
    // For compatibility with existing hash functions
    hash_code(size_t value);
    // Copyable, assignable.
    hash_code(const hash_code&);
    hash_code& operator=(const hash_code&);
    // Explicit conversion to size_t.
    explicit operator size_t() const;
  }
  bool operator==(const hash_code&, const hash_code&);
  bool operator!=(const hash_code&, const hash_code&);
  hash_code hash_value(const hash_code&);
}

For backward compatibility with existing hash tables using std::hash<T>, its operator() still needs to return size_t.

Future work

A single hash function that's easy to implement isn't enough for all use cases. The standard library should offer a collection of stable Fingerprint algorithms so that users can share their results between different implementation, and get more specialized characteristics in the output. These algorithms need a way to traverse user-defined types, and the above hash_value() function doesn't directly generalize. It is outside the scope of this paper to discuss approaches to generic fingerprinting algorithms.

Acknowledgements

Thanks to Geoff Pike (the author of CityHash) and Austin Appleby (the author of MurmurHash) for providing expert guidance. Thanks to Lawrence Crowl, Matt Austern, and Richard Smith for helping with the C++ interfaces and editing this paper.