The problems we solve as programmers are rarely unique. When we want to store user data, we can reach for one of dozens of suitable databases. When we want to make a list of unique values, we reach for a set. When we want to look up data by a key, we reach for a map.
Everyone has a set of tools they have used before, and can use again on future problems. Growing this set helps you become a better programmer, able to solve harder problems. In this post, I'm going to teach you about a new tool. It's a niche tool that won't apply to many problems, but when it does apply you'll find it invaluable. It's called a "bloom filter."
Wait! Before you read on, this post assumes you know what a hash function is. If you don't, it's highly recommended that you read this post first.
# What bloom filters can do
Bloom filters are used to check if an item is contained in a set. Using a bloom filter might look like this:
let bf = new BloomFilter(); bf.add("Haskie"); bf.add("Sage"); bf.contains("Haskie"); //=> true
Looks a lot like using a
Set doesn't it? It has some critical differences,
though. Bloom filters are a probabalistic data structure, which is a fancy
way of saying that it isn't guaranteed to be accurate. It's possible for a bloom
filter to claim that an item is contained within when in reality it is not. This
is called a false-positive.
A data structures that lies to you?! How could that possibly be useful?
It's all about what you get in return. Bloom filters are extremely
space-efficient. They don't actually store all of the items you put in to
them, so they can be a lot smaller than an equivalent
Real-world users of bloom filters include Akamai, who
use them to avoid caching web pages that are accessed once and never again. They
do this by storing all page accesses in a bloom filter, and only writing them
into cache if the bloom filter says they've been seen before. This does result
in some pages being cached on the first access, but that's fine because it's
still an improvement. It would be impractical for them to store all page
accesses in a
Set, so they accept the small false-positive rate in favour of
the significantly smaller bloom filter. Akamai released a
about this that goes into the full details if you're interested.
Google's BigTable is a distributed key-value store, and uses bloom filters internally to know what keys are stored within. When a read request for a key comes in, a bloom filter in memory is first checked to see if the key is in the database. If not, BigTable can respond with "not found" without ever needing to read from disk. Sometimes the bloom filter will claim a key is in the database when it isn't, but this is fine because when that happens a disk access will confirm the key in fact isn't in the database. This is a performance optmisation that is only possible thanks to the space-efficiency of bloom filters.
# How bloom filters work
Inside, bloom filters contain an array of bits. At first, all of the bits are set to 0. We're going to represent that as a grid of circles, with each circle representing a bit. Our bloom filters in this post are going to have 32 bits.
To add an item to the bloom filter, we're going to hash it with 3 different hash functions, then use the 3 resulting values to set 3 bits. If you're not familiar with hashing, I recommend reading my post about it before continuing.
For this post I'm choosing to use 3 of the
SHA family of hash
sha512. Here's what our bloom filter looks like
if we add the value "foo" to it:
The bits in positions 15, 16 and 27 have been set. We take the hash value of "foo" for each of
our 3 hash functions and modulo it by the number of bits in our bloom filter
(32). We get 27 with
sha1, 15 with
sha256 and 16 with
sha512. The table
below shows what's happening, and you can try inputting your own values to see
what bits they would set if added.
Go ahead and add a few of your own values to our bloom filter and see what happens.