Reservoir Sampling
In front of you are 10 playing cards and I ask you to pick 3 at random. How do you do it?
The first technique that might come to mind is to mix them all up in the middle, arrange them back how they were, then pick the first 3. You can see this happen below by clicking "Shuffle."
Every time you click "Shuffle," the chart below tracks what the first 3 cards were.
At first you'll notice some cards are selected more than others, but if you keep going it will even out. All cards have an equal chance of being selected, and we say that this make it "fair."
Click "Shuffle 100 times" a few times and watch the chart even out. You can reset the chart if you'd like to start over.
This method works fine with 10 cards, but what if you had 1 million cards? Mixing those up and rearranging them won't be easy. Instead, we could leave them where they started and use a random number generator to pick 3 indices. These would be our 3 chosen cards.
We no longer have to move all of the cards, and if we click the "Select" button enough times we'll see that this method is just as fair as the mix-up method.
This is better, but let me throw you a curveball: what if I were to show you one card at a time, and you had to pick one at random?
That's the curveball: you don't know.
No, you can only hold on to one card at a time. You're free to swap your card with a new one any time, but you can only hold one card at a time and you can't go back to a card you've already seen.
It's totally possible, and there are a few places it comes in handy in the real world.
For example, let's say you're building a log collection service. Text logs, not wooden ones. This service receives logs from a bunch of other services and stores it so it can be searched centrally. One of the things that can go wrong with a log collection service is that you get a noisy service. Maybe it's a bad release, maybe one of your videos goes viral, either way it starts sending you many more logs than usual.
Let's simulate this. Below you can see a stream of logs that experiences
periodic spikes. A horizontal line indicates the
You can see that every so often the logs per second spikes above the
Below we will see the same simulation again, but this time logs that don't get
sent to our log collection service will be greyed out. The graph will have 2
lines: a
Now we never exceed the
What we really want to be able to do is to say: "never send us more than 5 logs per second." This would mean that during quiet periods you get all the logs, but during noisy periods you have a cap on the number of logs you receive.
The simple way to achieve this would be to store the first 5 logs you see each second, then send the 5 logs you have to the log collection service at the end of each second. But this isn't fair, you aren't giving all logs an equal chance of being selected.
# Maintaining fairness
We instead want to pick a fair sample of all the logs we see each second. The problem is that we don't know how many we will see.
You could, but that method has unpredictable memory use. Large spikes of logs will mean you're storing more than usual, which could compound by creating memory pressure, excess garbage collection, and maybe even crashing your service if the spike is big enough.
Let's go back to our curveball of me showing you one card at a time. Here's a recap of the rules:
- I'll draw cards one at a time from a deck.
- Each time I show you a card, you have to choose to hold it or discard it.
- If you were already holding a card, you discard that card before replacing it with the new card.
- At any point I can stop drawing cards and whatever card you're holding is the one you've chosen.
How would you play this game in a way that ensures all cards have been given an equal chance to be selected at the end?
You're on the right track. Let's have a look at how the coin flip idea plays out
in practice. Below you see a deck of cards. Clicking "Deal" will draw a card and
place it either in the center, which is your
The problem this has is that while the
Scrub the slider below to see how the odds change as we draw more cards. Each bar represents a successive card in the deck, and the height of the bar is the chance we're holding that card at the end. Below the slider you'll see the odds we're holding the first card drawn vs the last card drawn.
Anything older than about 8 cards ago has an effectively 0% chance of being held at the end.
Because believe it or not, we only have to make one small change to this idea to make it fair. That small change is this:
Instead of flipping a coin to decide if we'll hold the card or not, instead we
give each new card a 1/n
chance of being held, where n
is the number of
cards we've seen so far.
Yep! Let's have a look at the chances as you draw more cards with this new method.
Magical, right? It works by making sure it is always equally likely for the card you're holding to be still in your hand as it is for the new card to be selected.
Let's walk through the numbers. Each card drawn only 2 numbers matter: the chan
# Card 1
The first card is easy: we're not holding anything, so we always choose to hold the first card. The chance we're holding this card is 100%.
# Card 2
This time we have a real choice. We can keep hold of the card we have, or
replace it with the new one. We've said that we're going to do this with a
1/n
chance, where n
is the number of cards we've seen so far. So our chance
of picking the new card is 1/2
, or 50%, and our chance of keeping hold of the
first card is 100% * 1/2
, which is again 50%.
# Card 3
Regardless whether we held or replaced on card 2, whatever card we're holding
has a 50% chance of being there. This is where we start to see the power of
always making sure the chance of
The new card has a 1/3
chance of being selected, so the card we're holding has
a 1/3
chance of being replaced. This means that our held card has a 2/3
chance of remaining held. So its odds of "surviving" this round are 50% * 2/3.
# Card N
This pattern continues for as many cards as you want to draw. We can express both options as formulas:
-
-
If 1/n
is the card of choosing the new card, 1/(n-1)
is the chance of
choosing the new card from the previous draw (thus n-1
). Then the chance of
not choosing the new card is the inverse of 1/n
, so 1-(1/n)
.
Drag the slider below to substitute n
with real numbers and see that the two
formulas always equal each other.
And then below are the cards again except this time set up with our new 1/n method of calculating the odds. Does it feel fair to you?
There's a good chance that through the 2nd half of the deck, you never swap your chosen card. This feels wrong, at least to me, but mathematically it is completely fair.
# Choosing multiple cards
Now that we know how to select a single card fairly, we can extend this to selecting multiple cards. There are 2 changes we need to make:
- Rather than new cards having a
1/n
chance of being selected, they now have ak/n
chance, wherek
is the number of cards we want to hold. - When we decide to replace a card, we choose one of the cards we're holding at random.
So our new previous card selection formula becomes k/(n-1)
because we're now
holding k
cards. Then the chance that any of the cards we're holding get
replaced is 1-(1/n)
.
Let's see how this plays out with real numbers.
-
-
The fairness still holds, and will hold for any k
and n
pair.
In practice, a nice way to implement this is to use an array of size k
, and
for each new card generate a random number between 0 and n
. If the random
number is less than k
, we replace the card at that index with the new card.
Otherwise, we ignore the new card.
# Applying this to log sampling
Let's take what we now know about reservoir sampling and apply it to our log collection service from earlier. The way we're going to do this is to create a reservoir of size 5, and every second we will flush whatever the reservoir sampler has chosen to the log collection service.
The pattern this creates is going to look quite different to the random
sampling, because we're going to be sending logs in chunks of 5, but what you
should notice is that we never send more than the
This method doesn't drop any logs during quiet periods, and maintains a cap on the number of logs sent during spikes. The best of both worlds. It also doesn't store any more logs than it needs to, so it won't run out of memory.
# Conclusion
Reservoir sampling is one of my favourite algorithms. It's elegant, and allows you to solve problems you might at first think can't be solved. I certainly didn't think it was possible to create a fair sample from a set of unknown size. Not only is it possible, the mathematics behind it are relatively simple.