Tag Archives: hashing

Customizing TR1 unordered_map Hashing and Equality Functions

A common problem people run into when using the C++ std::tr1::unordered_map or std::tr1::unordered_set is when they want to use a custom or aggregate type as the key in the hash table, and aren’t sure how to do it.

There isn’t really any official documentation about this; you can find some forum and StackOverflow discussions which may help; but I’ll run through a detailed example here too. Let’s start by taking a look at how the unordered_map class is defined:

template<class Key,
         class T,
         class Hash = hash<Key>,
         class Pred = std::equal_to<Key>,
         class Alloc = std::allocator<std::pair<const Key, T> > >
  class unordered_map;

The templated types in the class are:

  • class Key: key type (duh)
  • class T: value type
  • class Hash: hash function for type Key
  • class Pred: an equality predicate for type Key
  • class Alloc: the memory allocator for <Key, T>

The function represented in class Hash produces a hash value given a reference to type Key. This is the value that the map will use to locate the appropriate bucket to insert the corresponding value. The function represented by Pred is a predicate; it returns true if the values (as opposed to hash values) of Key types are equal. If the Key type is a class with operator== () defined, then defining a separate Pred function won’t be an issue.

Changing the Hash and Pred types is easy, but non-obvious. The most common reasons for this are either using a custom type as the key for the hash table, as I mentioned above, or for using a custom hash function to achieve better performance from the container.

Let’s say we have the following setup:

typedef struct
{
  uint32_t a;
  double x;
  double y;
} AggregateKey;

{ ... }

AggregateKey k;

{...}

std::tr1::unordered_map<AggregateKey, int> M;
M[k] = 1;

{...}

Attempting to compile this will generate a ton of error output. Depending on the situation, it’ll complain about Pred with something like

error: no match for ‘operator==’ in ‘__x == __y’

or about Hash with

undefined reference to `std::tr1::hash<AggregateKey>::operator()(AggregateKey) const'

Fix the issue by defining the following functors:

typedef struct
{
  long operator() (const AggregateKey &k) const { return my_hash_fnct (k); }
} AggregateKeyHash;

typedef struct
{
  bool operator() (const AggregateKey &x, const AggregateKey &y) const { return my_eq_test (x, y); }
} AggregateKeyEq;

AggregateKey k;

{...}

// Now, hash value generation and equality&nbsp;testing are
// defined for the AggregateKeyHash&nbsp;type, so declare the
// map using the functors&nbsp;above in the object's template list
std::tr1::unordered_map<AggregateKey, int, AggregateKeyHash, AggregateKeyEq> M;
M[k] = 1;

{...}

The same logic applies to the std::tr1::unordered_set, with the exception that its template list doesn’t include a class T value type. Also note that you can grab the any map’s default hash function and equality predicate at any time from the object:

std::tr1::unordered_map<int, int> M;

const std::tr1::unordered_map<int, int>::hasher &hfn = M.hash_function ();
const std::tr1::unordered_map<int, int>::key_equal &eqfn = M.key_eq ();

long h = hfn (123);
bool b = eqfn (1, 2);
Advertisements

Hash Functions

Hash functions are a particular interest of mine, and have been ever since back in university when I began writing password crackers for GNU/Linux systems. I like talking about them so I figured I’d ramble a little bit on here about the subject.

There are two principal uses for these functions:

  1. Cryptography
  2. Hash table mapping

These two uses should probably be considered mutually exclusive. A crypto-based hash function is (relatively) expensive to compute, and a quick hash-table hash function does not actively take steps to prevent reversibility. In cryptography, hash functions are primarily used for things like computing (relatively) unique digests of messages or files, or creating digital signatures and certificates. The most common algorithms belonging to this category are MD5 and SHA-1. However, both of these algorithms have had flaws discovered by security researchers, and are considered compromised for critical cryptographic use. Vulnerabilities have yet to be discovered in the SHA-2 family and WHIRLPOOL which are currently considered safer alternatives.

Hash functions are typically described by an expression such as this:

h : {0,1}* -> {0,1}n

Meaning, a hash function h(M) takes an arbitrarily long binary message M and computes a finite binary sequence of length n. Because we’re mapping from a space of infinite binary combinations to a finite string, we can’t expect that the hash of every message is unique. This is the Pigeonhole Principle. But there are some requirements h(M) must meet in order to be considered a secure hash function:

  1. Pre-image resistance: Given hash output H, it should be computationally infeasible to find an M such that h(M) = H. This should require about 2n operations.
  2. Second pre-image resistance: Given M and its output from h(M) -> H, it should be computationally infeasible to find another M’ such that h(M) = h(M’). Again, this should require about 2n hashing operations.
  3. Collision resistance: It should be computationally infeasible to find any two different messages M and M’ such that h(M) = h(M’), taking at least  2n/2 operations. It is likely that only these many hash operations would be required because of the Birthday Paradox.

If a proposed attack on a cryptographic hash function violates one of the above properties, the system has been compromised. Note that the biggest difference between (2) and (3) above is that (3) applies to any pair of messages M, whereas (2) refers to finding a collision for a specific message.

In general, a secure hash function will satisfy the following (non-exhaustive list) of properties:

  • The algorithm should be efficient; it should be relatively quick to compute the message digest
  • The produced hash should be irreversible
  • The hash is deterministically created, and all hashes are the same size
  • The hash should indicate nothing about the input message (see avalanche effect)
  • Produced hashes should represent a good distribution over the available finite hash domain

Both MD5 and the current SHA families, as well as many other cryptographic hash functions, are based on an iterative principle called the Merkle-Damgaard construction. It works something like the following:

  1. Pad the message to be hashed until its length is some multiple of some desired message block size
  2. Divide the message into blocks
  3. Iteratively mix and compress data from the message and from previous rounds together, passing through nonlinear one-way compression functions, and possibly mixed with auxiliary disturbance vectors
  4. Condense the intermediate hash data into the final hash value of some standard size and output

Graphically, it looks like this:

  • B1..6 are the blocks of the message to be hashed
  • The initial mixing values are just some numbers selected to mix in with the initial message block
  • Each function f() is a one-way compression function (or collection of them)
  • The final function g() collects and finalizes all data from the iterative rounds, and produces the hash

Although this has been the way cryptographic hash functions have traditionally been designed, there are some newer algorithms which have a different model. For example, MD6 uses a hash tree to allow parallel computation of the hash. WHIRLPOOL uses a Miyaguchi-Preneel based compression function instead of the traditional Merkle-Damgaard based one. There’s been some interesting research into the parallel computation of hashes; its been reported that parallel implementations of MD6 can reach over 1 Gb/s in processing speed. Note, however, that non-cryptographic hashes designed for speed can operate at two to three times that in a single thread of execution. MurmurHash is probably my favourite algorithm of this category.

The National Institute of Standards and Technology is actually holding a cryptographic hash function competition to replace the SHA-2 family of hash functions with a modern and more secure variant, which they’ll call SHA-3.


Bloom Filters

I’ve always thought that bloom filters were really cool data structures. They belong to a small class of data structures described as “probabilistic”, meaning, there’s a trade-off between performance and accuracy. CS people know about trade-offs all too well, whether they’re related to space/time or approximations of expensive algorithms. Bloom filters kind of fall into both categories.

The purpose of a bloom filter is to indicate, with some chance of error, whether an element belongs to a set. This error refers to the fact that it is possible that the bloom filter indicates some element is in the set, when it in fact is not in the set (false positive). The reverse, however, is not possible – if some element is in the set, the bloom filter cannot indicate that it is not in the set (false negative).

So how does it work?

A bloom filter requires a set of k hash functions (one of my favourite topics) to generate indices into a table. These can either be a series of fairly simple and quick functions, or some manipulation of one or few hash functions. For example, you could run your data through SHA1 and extract integral types to use as indices. Or, use a faster hash function like murmur hash with several different seeds.

Each of these k hashes is mapped to a bit in a bitmap array, which is initialized to contain all zeroes. Think of it like a hash table where each entry is hashed to multiple locations. But instead of storing the hash and the object, you set the bit at each location to 1. Because we only care about existence or non-existence in this set, we don’t need to store any additional information.

Shamelessly taken from Wikipedia (and yes, I checked the license first =)

So, for example, lets say that we’ve inserted elements x, y, and z into the bloom filter, with k = 3 hash functions, like above. Each of these three elements have three bits each set to 1. In the case when a bit is already set to 1, its left that way. When we try to look up element w in the set, the bloom filter tells us that w isn’t an element because at least one of its bitmap indices is 0.

This is the reason why bloom filters demonstrate false positives but not false negatives. A new element may overlap with the combined bitmap entries of many different elements already in the set. However, there is no operation that sets a bit to 0. So if a bitmap entry is 0, there was never an element inserted into this set that mapped to that location. As a related sidenote, it is extremely difficult to come up with a scheme that removes elements from these sets; usually they are just rebuilt instead.

So why is this useful, as opposed to a hashtable-set type data structure? The space requirements of a bloom filter are significantly lower than a hashtable implementation. This is because

  • Bloom filters use bits instead of larger elements to determine set existence
  • These bits may overlap between set entries, so not every entry has k dedicated bits

Its also interesting because the insertion and checking operations have a time complexity of O(k), based solely on the number of hash functions, instead of the number of elements inserted into the set. There’s no probing or other strategies used here to deal with collisions, as with hash tables. There are “collisions” in the sense of false positives, but these do not impact performance; however with a large enough bitmap and a high enough k value, these can be greatly minimized. In addition, these data structures are typically employed in non-critical services (such as caching existence of data) under the assumption that false positives may occur.

Here’s some python-esque pseudocode that illustrates insertion and existence-checking operations on a simple bloom filter:


# Assumption: 8 bits in a byte

def init(N, hashfnct_list):
 __bitmap_vector = alloc_bytes ((N + 1) / 8 )
 __fnct_list = hashfnct_list
 __N = N

def insert(data):

 # iterate over each hash function in the list and hash the data
 # we could even do this in parallel
 for f in __fnct_list:

  # mod the hash by __N because that is the size
  # of our bitmap vector
  h = f(data) % __N
  m = h / 8

  # m is the byte index, shift hash's remainder mod 8
  # to find the bit index
  __bitmap_vector[m] = __bitmap_vector[m] | (1 << (h % 8))

def contains(data):
 for f in __fnct_list:

  h = f(data) % __N
  m = h / 8

  if __bitmap_vector[m] & (1 << (h % 8)) == 0:
   return False

 return True

There are several variations on this basic implementation which increase the feature set of bloom filters. You can find more details here.