Bloom Filters
Everyone has a set of tools they use to solve problems. Growing this set helps you to solve ever more difficult problems. In this post, I'm going to teach you about a tool you may not have heard of before. It's a niche tool that won't apply to many problems, but when it does you'll find it invaluable. It's called a "bloom filter."
# What bloom filters can do
Bloom filters are similar to the Set
data structure. You can add items to
them, and check if an item is present. Here's what it might look like to use
a bloom filter in JavaScript, using a made-up BloomFilter
class:
let bf = new BloomFilter();
bf.add("Ant");
bf.add("Rhino");
bf.contains("Ant"); // true
bf.contains("Rhino"); // true
While this looks almost identical to a Set
, there are some key differences.
Bloom filters are what's called a probabilistic data structure. Where a
Set
can give you a concrete "yes" or "no" answer when you call contains
, a
bloom filter can't. Bloom filters can give definite "no"s, but they can't be
certain about "yes."
In the example above, when we ask bf
if it contains "Ant"
and "Rhino"
, the
true
that it returns isn't a guarantee that they're present. We know that
they're present because we added them just a couple of lines before, but it
would be possible for this to happen:
let bf = new BloomFilter();
bf.add("Ant");
bf.add("Rhino");
bf.contains("Fox"); // true
We'll demonstrate why over the course of this post. For now, we'll say that
when bloom filters return true
it doesn't mean "yes", it means "maybe". When
this happens and the item has never been added before, it's called a
false-positive.
The opposite, claiming "no" when the answer is "yes," is called a false-negative. A bloom filter will never give a false-negative, and this is what makes them useful.
It's not strictly lying, it's just not giving you a definite answer. Let's look at an example where we can use this property to our advantage.
# When bloom filters are useful
Imagine you're building a web browser, and you want to protect users from malicious links. You could build and maintain a list of all known malicious links and check the list every time a user navigates the browser. If the link they're trying to visit is in the list, you warn the user that they might be about to visit a malicious website.
If we assume there are, say, 1,000,000 malicious links on the Internet, and each link is 20 characters long, then the list of malicious links would be 20MB in size. This isn't a huge amount of data, but it's not small either. If you have lots of users and want to keep this list up to date, the bandwidth could add up.
However, if you're happy to accept being wrong 0.0001% of the time (1 in a million), you could use a bloom filter which can store the same data in 3.59MB. That's an 82% reduction in size, and all it costs you is showing the user an incorrect warning 1 in every million links visited. If you wanted to take it even further, and you were happy to accept being wrong 0.1% of the time (1 in 1000), the bloom filter would only be 1.8MB.
This use-case isn't hypothetical, either. Google Chrome used a bloom filter for this exact purpose until 2012. If you were worried about showing a warning when it wasn't needed, you could always make an API that has the full list of malicious links in a database. When the bloom filter says "maybe," you would then make an API call to check the full list to be sure. No more spurious warnings, and the bloom filter would save you from having to call the API for every link visited.
# How bloom filters work
At its core, a bloom filter is an array of
To add an item to the bloom filter, we're going to hash it with 3 different hash
functions, then use the 3 resulting values to set 3
For this post I'm choosing to use 3 of the
SHA family of hash
functions:
The
Go ahead and
You might occasionally notice that only 2, or even 1,
Exactly right. A bloom filter with every Set
that always returns true
for contains
. It will claim to contain
everything you ask it about, even if that thing was never added.
# False-positive rates
The rate of false-positives in our bloom filter will grow as the percentage of
set
It grows slowly at first, but as we get closer to having all
x^3
, where x
is the percentage of set
3
is the number of hash functions used. To give an
example of why we calculate it with this formula, imagine we have a bloom filter
with half of its bits set, x = 0.5
. If we assume that our hash function has
an equal chance of setting any of the bits, then the chance that all 3 hash
functions set a bit that is already set is 0.5 * 0.5 * 0.5
, or x^3
.
Let's have a look at the false-positive rate of bloom filters that use different numbers of hash functions.
The problem that using lots of hash functions introduces is that it makes the
bloom filter fill up faster. The more hash functions you use, the more
It's possible to calculate how full a bloom filter will be after inserting a
number of items, based on the number of hash functions used. The graph below
assumes a bloom filter with 1000
The more hash functions we use, the faster we set all of the bits. You'll notice that
the curve tails off as more items are added. This is because the more
In practice, 1000
The lines barely leave the bottom of the graph, meaning the bloom filter will be very empty and the false-positive rate will be low. All this cost us was 12.5kB of memory, which is still a very small amount by 2024 standards.
# Tuning a bloom filter
Picking the correct number of hash functions and
The bloom filter page on Wikipedia covers the mathematics involved, which I'm going to translate into JavaScript functions for us to use. I want to stress that you don't need to understand the maths to use a bloom filter or read this post. I'm including the link to it only for completeness.
# Optimal number of bits
The following JavaScript function, which might look a bit scary but bear with
me, takes the number of items you want to store (items
) and the desired
false-positive rate (fpr
, where 1% == 0.01
), and returns how many
function bits(items, fpr) {
const n = -items * Math.log(fpr);
const d = Math.log(2) ** 2;
return Math.ceil(n / d);
}
We can see how this grows for a variety of fpr
values in the graph below.
# Optimal number of hash functions
After we've used the JavaScript above to calculate how many
function hashFunctions(bits, items) {
return Math.ceil((bits / items) * Math.log(2));
}
Pause for a second here and have a think about how the number of hash functions might grow based on the size of the bloom filter and the number of items you expect to add. Do you think you'll use more hash functions, or fewer, as the bloom filter gets larger? What about as the number of items increases?
The more items you plan to add, the fewer hash functions you should use. Yet, a larger bloom filter means you can use more hash functions. More hash functions keep the false-positive rate lower for longer, but more items fills up the bloom filter faster. It's a complex balancing act, and I am thankful that mathematicians have done the hard work of figuring it out for us.
# Caution
While we can stand on the shoulders of giants and pick the optimal number of
# Removing items from a bloom filter
We've spent the whole post talking about adding things to a bloom filter, and the optimal parameters to use. We haven't spoken at all about removing items.
And that's because you can't!
In a bloom filter, we're using
Click the buttons of the bloom filter below to see this in action. First we will add "foo", then "baz", and then we will remove "baz". Hit "clear" if you want to start again.
The end result of this sequence is a bloom filter that doesn't contain "baz",
but doesn't contain "foo" either. Because both "foo" and "baz" set
Something else you might have noticed playing with the above example is that if
you add "foo" and then attempt to remove "baz" before having added it, nothing
happens. Even though
# Counting bloom filters
While you can't remove items from a standard bloom filter, there are variants that allow you to do so. One of these variants is called a "counting bloom filter," which uses an array of counters instead of bits to keep track of items.
Now when you go through the sequence, the end result is that the bloom filter still contains "foo." It solves the problem.
The trade-off, though, is that counters are bigger than
Counting bloom filters also introduce the possibility of false-negatives, which are impossible in standard bloom filters. Consider the following example.
Because "loved" and "response" both hash to the
let bf = new CountingBloomFilter();
bf.add("loved");
bf.add("your");
bf.remove("response");
bf.contains("loved"); // false
Even though we know for sure we've added "loved" in this snippet, the call to
contains
will return false
. This sort of false-negative can't happen in a
standard bloom filter, and it removes one of the key benefits of using a bloom
filter in the first place: the guarantee of no false-negatives.
# Bloom filters in the real-world
Real-world users of bloom filters include Akamai, who
use them to avoid caching web pages that are accessed once and never again. They
do this by storing all page accesses in a bloom filter, and only writing them
into cache if the bloom filter says they've been seen before. This does result
in some pages being cached on the first access, but that's fine because it's
still an improvement. It would be impractical for them to store all page
accesses in a Set
, so they accept the small false-positive rate in favour of
the significantly smaller bloom filter. Akamai released a
paper
about this that goes into the full details if you're interested.
Google's BigTable is a distributed key-value store, and uses bloom filters internally to know what keys are stored within. When a read request for a key comes in, a bloom filter in memory is first checked to see if the key is in the database. If not, BigTable can respond with "not found" without ever needing to read from disk. Sometimes the bloom filter will say a key might be in the database when it isn't, but this is fine because when that happens a disk access will confirm the key in fact isn't in the database.
# Conclusion
Bloom filters, while niche, can be a huge optimisation in the right situation. They're a wonderful application of hash functions, and a great example of making a deliberate trade-off to achieve a specific goal.
Trade-offs, and combining simpler building blocks to create more complex, purpose-built data structures, are present everywhere in software engineering. Being able to spot where a data structure could net a big win can separate you from the pack, and take your career to the next level.
I hope you've enjoyed this post, and that you find a way to apply bloom filters to a problem you're working on.
Join the discussion on Hacker News or Lobste.rs!
# Acknowledgements
Enormous thank you to my reviewers, without whom this post would be a shadow of what you read today. In no particular order:
rylon, Indy, Aaron, Sophie, Davis, ed, Michael Drury, Anton Zhiyanov, Christoph Berger, Andrew Kingston, Tom Hall.