The problems we solve as programmers are rarely unique. When we want to store user data, we can reach for one of dozens of suitable databases. When we want to make a list of unique values, we reach for a set. When we want to look up data by a key, we reach for a map.
Everyone has a set of tools they have used before, and can use again on future problems. Growing this set helps you become a better programmer, able to solve harder problems. In this post, I'm going to teach you a new tool. It's a niche tool that won't apply to many problems, but when it does apply you'll find it invaluable. It's called a "bloom filter."
› The Problem
Before we can use a tool, we need to know the problems it solves. Applying the wrong tool to a problem ends up creating more problems.
Let's say we're building a brand new web browser. One of the features we would like our web browser to have is a malicious link warning. If the user tries to visit a link we know is bad, we show a warning page that lets them know they need to careful. The link could be a scam or trick them into installing a virus.
This being the Internet, people are doing new bad things all of the time. This list of bad links will need to be kept up to date, and our browser will need to download that list on a regular basis. Let's start out by doing this once per day.
Conservatively, we'll estimate that the Internet has 1 million links that are bad, and the average bad link is 60 characters long. That's just over 57MB of data. That might not sound like a lot, but if we were to be successful and get, say, 1 million users, we'll need to transfer 54TB every single day just to support this feature. That's 662MB of data transfer per second, 24/7. If we were to naively serve this from an Amazon S3 bucket, calculator.aws tells me it would cost us $87,710.20 per month.
Sure, we could do better by only sending changes to this list to the client. That sounds quite complicated, though. We would need to handle additions, removals, we'd need to store the full history of additions and removals so we can update any browser from any point in time. I'd prefer something simpler.
› The Solution
Bloom filters, named after their creator Burton Howard Bloom, trade certainty for size. They work just like a set, in that you can add things to them and later check to see if the set contains that thing. The downside is that they can only tell you an item might be in the set. They can't say for sure. When they tell you an item isn't in the set, however, they are always correct.
That doesn't sound very useful. Wouldn't we run the risk of showing users a warning on legitimate websites?
For this specific feature, we would use a bloom filter to know when to send a request to some remote API that can say for sure if a link is bad or not. So the bloom filter would act almost like a cache by telling us when we don't need to make an API call.
› How bloom filters work
A bloom filter is an array of bits. At first, all of the bits are set to 0.
When we add an item to the bloom filter, we hash it and modulo the result by the number of bits, then we set the resulting bit to 1.
Here's an example.