Hash functions are a particular interest of mine, and have been ever since back in university when I began writing password crackers for GNU/Linux systems. I like talking about them so I figured I’d ramble a little bit on here about the subject.

There are two principal uses for these functions:

- Cryptography
- Hash table mapping

These two uses should probably be considered mutually exclusive. A crypto-based hash function is (relatively) expensive to compute, and a quick hash-table hash function does not actively take steps to prevent reversibility. In cryptography, hash functions are primarily used for things like computing (relatively) unique digests of messages or files, or creating digital signatures and certificates. The most common algorithms belonging to this category are MD5 and SHA-1. However, both of these algorithms have had flaws discovered by security researchers, and are considered compromised for critical cryptographic use. Vulnerabilities have yet to be discovered in the SHA-2 family and WHIRLPOOL which are currently considered safer alternatives.

Hash functions are typically described by an expression such as this:

*h* : {0,1}* -> {0,1}^{n}

Meaning, a hash function *h(M)* takes an arbitrarily long binary message *M* and computes a finite binary sequence of length *n*. Because we’re mapping from a space of infinite binary combinations to a finite string, we can’t expect that the hash of every message is unique. This is the Pigeonhole Principle. But there are some requirements *h(M)* must meet in order to be considered a *secure *hash function:

*Pre-image resistance*: Given hash output*H*, it should be computationally infeasible to find an*M*such that*h(M)*=*H.*This should require about 2^{n }operations.*Second pre-image resistance:*Given*M*and its output from*h(M) -> H*, it should be computationally infeasible to find another*M’*such that*h(M) = h(M’)*. Again, this should require about 2^{n}hashing operations.*Collision resistance:*It should be computationally infeasible to find any two different messages*M*and*M’*such that*h(M) = h(M’),*taking at least 2^{n/2 }operations. It is likely that only these many hash operations would be required because of the Birthday Paradox.

If a proposed attack on a cryptographic hash function violates one of the above properties, the system has been compromised. Note that the biggest difference between (2) and (3) above is that (3) applies to *any* pair of messages *M*, whereas (2) refers to finding a collision for a specific message.

In general, a secure hash function will satisfy the following (non-exhaustive list) of properties:

- The algorithm should be efficient; it should be relatively quick to compute the message digest
- The produced hash should be irreversible
- The hash is deterministically created, and all hashes are the same size
- The hash should indicate nothing about the input message (see avalanche effect)
- Produced hashes should represent a good distribution over the available finite hash domain

Both MD5 and the current SHA families, as well as many other cryptographic hash functions, are based on an iterative principle called the Merkle-Damgaard construction. It works something like the following:

- Pad the message to be hashed until its length is some multiple of some desired message block size
- Divide the message into blocks
- Iteratively mix and compress data from the message and from previous rounds together, passing through nonlinear one-way compression functions, and possibly mixed with auxiliary disturbance vectors
- Condense the intermediate hash data into the final hash value of some standard size and output

Graphically, it looks like this:

- B1..6 are the blocks of the message to be hashed
- The initial mixing values are just some numbers selected to mix in with the initial message block
- Each function
*f()*is a one-way compression function (or collection of them) - The final function
*g()*collects and finalizes all data from the iterative rounds, and produces the hash

Although this has been the way cryptographic hash functions have traditionally been designed, there are some newer algorithms which have a different model. For example, MD6 uses a hash tree to allow parallel computation of the hash. WHIRLPOOL uses a Miyaguchi-Preneel based compression function instead of the traditional Merkle-Damgaard based one. There’s been some interesting research into the parallel computation of hashes; its been reported that parallel implementations of MD6 can reach over 1 Gb/s in processing speed. Note, however, that non-cryptographic hashes designed for speed can operate at two to three times that in a single thread of execution. MurmurHash is probably my favourite algorithm of this category.

The National Institute of Standards and Technology is actually holding a cryptographic hash function competition to replace the SHA-2 family of hash functions with a modern and more secure variant, which they’ll call SHA-3.

January 8th, 2011 at 17:52

+1 : MurmurHash! Also my fav.