Bloom Filters
samwho keyboard logo

Bloom Filters

The problems we solve as programmers are rarely unique. When we want to store user data, we can reach for one of dozens of suitable databases. When we want to make a list of unique values, we reach for a set. When we want to look up data by a key, we reach for a map.

Everyone has a set of tools they have used before, and can use again on future problems. Growing this set helps you become a better programmer, able to solve harder problems. In this post, I'm going to teach you about a new tool. It's a niche tool that won't apply to many problems, but when it does apply you'll find it invaluable. It's called a "bloom filter."

Wait! Before you read on, this post assumes you know what a hash function is. If you don't, it's highly recommended that you read this post first.

# What bloom filters can do

Bloom filters are used to check if an item is contained in a set. Using a bloom filter might look like this:

let bf = new BloomFilter();
bf.add("Haskie");
bf.add("Sage");
bf.contains("Haskie"); //=> true

Looks a lot like using a Set doesn't it? It has some critical differences, though. Bloom filters are a probabalistic data structure, which is a fancy way of saying that it isn't guaranteed to be accurate. It's possible for a bloom filter to claim that an item is contained within when in reality it is not. This is called a false-positive.

A data structures that lies to you?! How could that possibly be useful?

It's all about what you get in return. Bloom filters are extremely space-efficient. They don't actually store all of the items you put in to them, so they can be a lot smaller than an equivalent Set.

Real-world users of bloom filters include Akamai, who use them to avoid caching web pages that are accessed once and never again. They do this by storing all page accesses in a bloom filter, and only writing them into cache if the bloom filter says they've been seen before. This does result in some pages being cached on the first access, but that's fine because it's still an improvement. It would be impractical for them to store all page accesses in a Set, so they accept the small false-positive rate in favour of the significantly smaller bloom filter. Akamai released a paper about this that goes into the full details if you're interested.

Google's BigTable is a distributed key-value store, and uses bloom filters internally to know what keys are stored within. When a read request for a key comes in, a bloom filter in memory is first checked to see if the key is in the database. If not, BigTable can respond with "not found" without ever needing to read from disk. Sometimes the bloom filter will claim a key is in the database when it isn't, but this is fine because when that happens a disk access will confirm the key in fact isn't in the database. This is a performance optmisation that is only possible thanks to the space-efficiency of bloom filters.

# How bloom filters work

Inside, bloom filters contain an array of bits. At first, all of the bits are set to 0. We're going to represent that as a grid of circles, with each circle representing a bit. Our bloom filters in this post are going to have 32 bits.

To add an item to the bloom filter, we're going to hash it with 3 different hash functions, then use the 3 resulting values to set 3 bits. If you're not familiar with hashing, I recommend reading my post about it before continuing.

For this post I'm choosing to use 3 of the SHA family of hash functions: sha1, sha256, and sha512. Here's what our bloom filter looks like if we add the value "foo" to it:

The bits in positions 15, 16 and 27 have been set. We take the hash value of "foo" for each of our 3 hash functions and modulo it by the number of bits in our bloom filter (32). We get 27 with sha1, 15 with sha256 and 16 with sha512. The table below shows what's happening, and you can try inputting your own values to see what bits they would set if added.

Go ahead and add a few of your own values to our bloom filter and see what happens.


powered by buttondown