Reservoir Sampling

Reservoir Sampling

In front of you are 10 playing cards and I ask you to pick 3 at random. How do you do it?

Before you scroll! This post has been graciously sponsored by ittybit, an all-in-one set of APIs for handling video. If you need to store, encode, process, or understand video in your application, take a look at their documentation!

The first technique that might come to mind is to mix them all up in the middle, arrange them back how they were, then pick the first 3. You can see this happen below by clicking "Shuffle."

Every time you click "Shuffle," the chart below tracks what the first 3 cards were.

At first you'll notice some cards are selected more than others, but if you keep going it will even out. All cards have an equal chance of being selected, and we say that this make it "fair."

Click "Shuffle 100 times" a few times and watch the chart even out. You can reset the chart if you'd like to start over.

This method works fine with 10 cards, but what if you had 1 million cards? Mixing those up and rearranging them won't be easy. Instead, we could leave them where they started and use a random number generator to pick 3 indices. These would be our 3 chosen cards.

We no longer have to move all of the cards, and if we click the "Select" button enough times we'll see that this method is just as fair as the mix-up method.

This is better, but let me throw you a curveball: what if I were to show you one card at a time, and you had to pick one at random?

How many are you going to show me?

That's the curveball: you don't know.

Can I hold on to all the cards you give me and then pick 1 at the end?

No, you can only hold on to one card at a time. You're free to swap your card with a new one any time, but you can only hold one card at a time and you can't go back to a card you've already seen.

That's not possible! And why would I ever need to do this anyway?

It's totally possible, and there are a few places it comes in handy in the real world.

For example, let's say you're building a log collection service. Text logs, not wooden ones. This service receives logs from a bunch of other services and stores it so it can be searched centrally. One of the things that can go wrong with a log collection service is that you get a noisy service. Maybe it's a bad release, maybe one of your videos goes viral, either way it starts sending you many more logs than usual.

Let's simulate this. Below you can see a stream of logs that experiences periodic spikes. A horizontal line indicates the threshold of logs per second that the log collection service can handle, which in this example is 5 logs per second.

You can see that every so often the logs per second spikes above the threshold. One way to deal with this is "sampling." Deciding to send only a fraction of the logs to the log collection service. Let's send 10% of the logs.

Below we will see the same simulation again, but this time logs that don't get sent to our log collection service will be greyed out. The graph will have 2 lines: a black line for the logs we send to the log collection service and a grey line for the total number of logs our noisy service outputs.

Now we never exceed the threshold on black line, which means we never overwhelm our log collection service. However, in the quieter periods we're still throwing away 90% of the logs when we don't need to.

What we really want to be able to do is to say: "never send us more than 5 logs per second." This would mean that during quiet periods you get all the logs, but during noisy periods you have a cap on the number of logs you receive.

The simple way to achieve this would be to store the first 5 logs you see each second, then send the 5 logs you have to the log collection service at the end of each second. But this isn't fair, you aren't giving all logs an equal chance of being selected.

# Maintaining fairness

We instead want to pick a fair sample of all the logs we see each second. The problem is that we don't know how many we will see.

1 second isn't a long time, can't we just store all the messages we see and then use the select method from above?

You could, but that method has unpredictable memory use. Large spikes of logs will mean you're storing more than usual, which could compound by creating memory pressure, excess garbage collection, and maybe even crashing your service if the spike is big enough.

Let's go back to our curveball of me showing you one card at a time. Here's a recap of the rules:

  1. I'll draw cards one at a time from a deck.
  2. Each time I show you a card, you have to choose to hold it or discard it.
  3. If you were already holding a card, you discard that card before replacing it with the new card.
  4. At any point I can stop drawing cards and whatever card you're holding is the one you've chosen.

How would you play this game in a way that ensures all cards have been given an equal chance to be selected at the end?

How about we flip a coin every new card your friend draws? If it's heads, we keep the card we have. If it's tails, we swap it out for the new card.

You're on the right track. Let's have a look at how the coin flip idea plays out in practice. Below you see a deck of cards. Clicking "Deal" will draw a card and place it either in the center, which is your held card, or in the discard pile on the right.

The problem this has is that while the hold vs discard counts are roughly equal, later cards are much more likely to get chosen than earlier cards. If we draw 10 cards, the odds we keep hold of the first card we choose are 50% multiplied 10 times, or 0.09%. In comparison, the last card drawn has a 50% chance of being held at the end. This isn't fair.

Scrub the slider below to see how the odds change as we draw more cards. Each bar represents a successive card in the deck, and the height of the bar is the chance we're holding that card at the end. Below the slider you'll see the odds we're holding the first card drawn vs the last card drawn.

Anything older than about 8 cards ago has an effectively 0% chance of being held at the end.

You said I was on the right track! How can this be the right track when I'm more likely to win the lottery than to be holding the card I saw 24 draws ago?

Because believe it or not, we only have to make one small change to this idea to make it fair. That small change is this:

Instead of flipping a coin to decide if we'll hold the card or not, instead we give each new card a 1/n chance of being held, where n is the number of cards we've seen so far.

Wait, that's it? That makes it fair?

Yep! Let's have a look at the chances as you draw more cards with this new method.

Magical, right? It works by making sure it is always equally likely for the card you're holding to be still in your hand as it is for the new card to be selected.

Let's walk through the numbers. Each card drawn only 2 numbers matter: the chan

# Card 1

The first card is easy: we're not holding anything, so we always choose to hold the first card. The chance we're holding this card is 100%.

Hold
Replace
100%
-

# Card 2

This time we have a real choice. We can keep hold of the card we have, or replace it with the new one. We've said that we're going to do this with a 1/n chance, where n is the number of cards we've seen so far. So our chance of picking the new card is 1/2, or 50%, and our chance of keeping hold of the first card is 100% * 1/2, which is again 50%.

Hold
Replace
100% * 1/2 = 50%
1/2 = 50%

# Card 3

Regardless whether we held or replaced on card 2, whatever card we're holding has a 50% chance of being there. This is where we start to see the power of always making sure the chance of hold vs replace are equal.

The new card has a 1/3 chance of being selected, so the card we're holding has a 1/3 chance of being replaced. This means that our held card has a 2/3 chance of remaining held. So its odds of "surviving" this round are 50% * 2/3.

Hold
Replace
50% * 2/3 = 33.33%
1/3 = 33.33%

# Card N

This pattern continues for as many cards as you want to draw. We can express both options as formulas:

Hold
Replace
1/(n-1) * (1-(1/n))
-
1/n
-

If 1/n is the card of choosing the new card, 1/(n-1) is the chance of choosing the new card from the previous draw (thus n-1). Then the chance of not choosing the new card is the inverse of 1/n, so 1-(1/n).

Drag the slider below to substitute n with real numbers and see that the two formulas always equal each other.

And then below are the cards again except this time set up with our new 1/n method of calculating the odds. Does it feel fair to you?

There's a good chance that through the 2nd half of the deck, you never swap your chosen card. This feels wrong, at least to me, but mathematically it is completely fair.

# Choosing multiple cards

Now that we know how to select a single card fairly, we can extend this to selecting multiple cards. There are 2 changes we need to make:

  1. Rather than new cards having a 1/n chance of being selected, they now have a k/n chance, where k is the number of cards we want to hold.
  2. When we decide to replace a card, we choose one of the cards we're holding at random.

So our new previous card selection formula becomes k/(n-1) because we're now holding k cards. Then the chance that any of the cards we're holding get replaced is 1-(1/n).

Let's see how this plays out with real numbers.

Hold
Replace
k/(n-1) * (1-(k/n))
-
k/n
-

The fairness still holds, and will hold for any k and n pair.

In practice, a nice way to implement this is to use an array of size k, and for each new card generate a random number between 0 and n. If the random number is less than k, we replace the card at that index with the new card. Otherwise, we ignore the new card.

# Applying this to log sampling

Let's take what we now know about reservoir sampling and apply it to our log collection service from earlier. The way we're going to do this is to create a reservoir of size 5, and every second we will flush whatever the reservoir sampler has chosen to the log collection service.

The pattern this creates is going to look quite different to the random sampling, because we're going to be sending logs in chunks of 5, but what you should notice is that we never send more than the threshold.

This method doesn't drop any logs during quiet periods, and maintains a cap on the number of logs sent during spikes. The best of both worlds. It also doesn't store any more logs than it needs to, so it won't run out of memory.

# Conclusion

Reservoir sampling is one of my favourite algorithms. It's elegant, and allows you to solve problems you might at first think can't be solved. I certainly didn't think it was possible to create a fair sample from a set of unknown size. Not only is it possible, the mathematics behind it are relatively simple.