I’ve always thought that bloom filters were really cool data structures. They belong to a small class of data structures described as “probabilistic”, meaning, there’s a trade-off between performance and accuracy. CS people know about trade-offs all too well, whether they’re related to space/time or approximations of expensive algorithms. Bloom filters kind of fall into both categories.
The purpose of a bloom filter is to indicate, with some chance of error, whether an element belongs to a set. This error refers to the fact that it is possible that the bloom filter indicates some element is in the set, when it in fact is not in the set (false positive). The reverse, however, is not possible – if some element is in the set, the bloom filter cannot indicate that it is not in the set (false negative).
So how does it work?
A bloom filter requires a set of k hash functions (one of my favourite topics) to generate indices into a table. These can either be a series of fairly simple and quick functions, or some manipulation of one or few hash functions. For example, you could run your data through SHA1 and extract integral types to use as indices. Or, use a faster hash function like murmur hash with several different seeds.
Each of these k hashes is mapped to a bit in a bitmap array, which is initialized to contain all zeroes. Think of it like a hash table where each entry is hashed to multiple locations. But instead of storing the hash and the object, you set the bit at each location to 1. Because we only care about existence or non-existence in this set, we don’t need to store any additional information.
So, for example, lets say that we’ve inserted elements x, y, and z into the bloom filter, with k = 3 hash functions, like above. Each of these three elements have three bits each set to 1. In the case when a bit is already set to 1, its left that way. When we try to look up element w in the set, the bloom filter tells us that w isn’t an element because at least one of its bitmap indices is 0.
This is the reason why bloom filters demonstrate false positives but not false negatives. A new element may overlap with the combined bitmap entries of many different elements already in the set. However, there is no operation that sets a bit to 0. So if a bitmap entry is 0, there was never an element inserted into this set that mapped to that location. As a related sidenote, it is extremely difficult to come up with a scheme that removes elements from these sets; usually they are just rebuilt instead.
So why is this useful, as opposed to a hashtable-set type data structure? The space requirements of a bloom filter are significantly lower than a hashtable implementation. This is because
- Bloom filters use bits instead of larger elements to determine set existence
- These bits may overlap between set entries, so not every entry has k dedicated bits
Its also interesting because the insertion and checking operations have a time complexity of O(k), based solely on the number of hash functions, instead of the number of elements inserted into the set. There’s no probing or other strategies used here to deal with collisions, as with hash tables. There are “collisions” in the sense of false positives, but these do not impact performance; however with a large enough bitmap and a high enough k value, these can be greatly minimized. In addition, these data structures are typically employed in non-critical services (such as caching existence of data) under the assumption that false positives may occur.
Here’s some python-esque pseudocode that illustrates insertion and existence-checking operations on a simple bloom filter:
# Assumption: 8 bits in a byte def init(N, hashfnct_list): __bitmap_vector = alloc_bytes ((N + 1) / 8 ) __fnct_list = hashfnct_list __N = N def insert(data): # iterate over each hash function in the list and hash the data # we could even do this in parallel for f in __fnct_list: # mod the hash by __N because that is the size # of our bitmap vector h = f(data) % __N m = h / 8 # m is the byte index, shift hash's remainder mod 8 # to find the bit index __bitmap_vector[m] = __bitmap_vector[m] | (1 << (h % 8)) def contains(data): for f in __fnct_list: h = f(data) % __N m = h / 8 if __bitmap_vector[m] & (1 << (h % 8)) == 0: return False return True
There are several variations on this basic implementation which increase the feature set of bloom filters. You can find more details here.