Tag Archives: computer science

Virtual Memory (Theory vs Application part 3)

I think I was reading Reddit a little while ago when a pretty interesting article from the ACM Communications by Poul-Henning Kamp came up. It might be a little sensationalist, but its a pretty valid examination of what you can do to program performance when you consider how modern operating systems manage virtual memory. Kamp spent a lot of time building Varnish, and although I’d never heard of it before, it sounds like a very impressive utility. Its an HTTP accelerator and cache whose performance apparently blows Squid out of the water.

The meat of the article discusses how conventional thinking about the implementation of algorithms and data structures on modern computers is broken because they fail to consider the consequences of cache misses and page faults when accessing data.

On a modern multi-issue CPU, running at some gigahertz clock frequency, the worst-case loss is almost 10 million instructions per VM page fault. If you are running with a rotating disk, the number is more like 100 million instructions. What good is an O(log2(n))algorithm if those operations cause page faults and slow disk operations? For most relevant datasets an O(n) or even an O(n2) algorithm, which avoids page faults, will run circles around it.

Kamp provides some images and diagrams to visualize some of what he’s talking about. As a traditionally more systems-focused guy, I might be a little biased, but I feel that modern CS curriculums pump out students who have absolutely no idea about how computers actually work. Knowing the asymptotically optimal algorithm for everything is great; but if your practical implementation sucks because you’re causing paging all over the place, what good is it? A reasonable mixture makes sense – and the curriculum could make room by removing that course on CVS (and maybe even calculus – barf!)

I discussed a little earlier about how more intelligent practical implementations can have enormous effects over more naive solutions even though they remain asymptotically equal. Imagine wasting a hundred million instructions because your algorithm’s implementation unnecessarily left some node on disk!

Bloom Filters

I’ve always thought that bloom filters were really cool data structures. They belong to a small class of data structures described as “probabilistic”, meaning, there’s a trade-off between performance and accuracy. CS people know about trade-offs all too well, whether they’re related to space/time or approximations of expensive algorithms. Bloom filters kind of fall into both categories.

The purpose of a bloom filter is to indicate, with some chance of error, whether an element belongs to a set. This error refers to the fact that it is possible that the bloom filter indicates some element is in the set, when it in fact is not in the set (false positive). The reverse, however, is not possible – if some element is in the set, the bloom filter cannot indicate that it is not in the set (false negative).

So how does it work?

A bloom filter requires a set of k hash functions (one of my favourite topics) to generate indices into a table. These can either be a series of fairly simple and quick functions, or some manipulation of one or few hash functions. For example, you could run your data through SHA1 and extract integral types to use as indices. Or, use a faster hash function like murmur hash with several different seeds.

Each of these k hashes is mapped to a bit in a bitmap array, which is initialized to contain all zeroes. Think of it like a hash table where each entry is hashed to multiple locations. But instead of storing the hash and the object, you set the bit at each location to 1. Because we only care about existence or non-existence in this set, we don’t need to store any additional information.

Shamelessly taken from Wikipedia (and yes, I checked the license first =)

So, for example, lets say that we’ve inserted elements x, y, and z into the bloom filter, with k = 3 hash functions, like above. Each of these three elements have three bits each set to 1. In the case when a bit is already set to 1, its left that way. When we try to look up element w in the set, the bloom filter tells us that w isn’t an element because at least one of its bitmap indices is 0.

This is the reason why bloom filters demonstrate false positives but not false negatives. A new element may overlap with the combined bitmap entries of many different elements already in the set. However, there is no operation that sets a bit to 0. So if a bitmap entry is 0, there was never an element inserted into this set that mapped to that location. As a related sidenote, it is extremely difficult to come up with a scheme that removes elements from these sets; usually they are just rebuilt instead.

So why is this useful, as opposed to a hashtable-set type data structure? The space requirements of a bloom filter are significantly lower than a hashtable implementation. This is because

  • Bloom filters use bits instead of larger elements to determine set existence
  • These bits may overlap between set entries, so not every entry has k dedicated bits

Its also interesting because the insertion and checking operations have a time complexity of O(k), based solely on the number of hash functions, instead of the number of elements inserted into the set. There’s no probing or other strategies used here to deal with collisions, as with hash tables. There are “collisions” in the sense of false positives, but these do not impact performance; however with a large enough bitmap and a high enough k value, these can be greatly minimized. In addition, these data structures are typically employed in non-critical services (such as caching existence of data) under the assumption that false positives may occur.

Here’s some python-esque pseudocode that illustrates insertion and existence-checking operations on a simple bloom filter:

# Assumption: 8 bits in a byte

def init(N, hashfnct_list):
 __bitmap_vector = alloc_bytes ((N + 1) / 8 )
 __fnct_list = hashfnct_list
 __N = N

def insert(data):

 # iterate over each hash function in the list and hash the data
 # we could even do this in parallel
 for f in __fnct_list:

  # mod the hash by __N because that is the size
  # of our bitmap vector
  h = f(data) % __N
  m = h / 8

  # m is the byte index, shift hash's remainder mod 8
  # to find the bit index
  __bitmap_vector[m] = __bitmap_vector[m] | (1 << (h % 8))

def contains(data):
 for f in __fnct_list:

  h = f(data) % __N
  m = h / 8

  if __bitmap_vector[m] & (1 << (h % 8)) == 0:
   return False

 return True

There are several variations on this basic implementation which increase the feature set of bloom filters. You can find more details here.