07/27/23

Retrying

The least sexy aspect of building Event-Driven Architectures

9 Min Read

Requests over the network can fail. This is something we cannot avoid, and to write robust software we need to handle these failures. One of the most common responses to a failed request is to retry it.

In this post we're going to visually explore different methods of retrying requests, demonstrating why some common approaches are dangerous and ultimately ending up what the best practice is. At the end of this post you will have a good understanding of what makes safe retry behaviour, and a vivid understanding of what not to do.

# Setting the stage

Let's introduce the elements involved in our visualisation. We have:

  • Requests can be thought of as HTTP requests. They can succeed or fail.
  • Load balancers route requests from clients to servers.
  • Servers accept and serve requests.
  • Clients send requests to servers via a load balancer. After getting a response, they will wait an amount of time before sending another request.

Here's how that looks in practice.

# Basic retry handling

The most basic way to handle a failure is to do nothing. In this visualisation the server is configured to fail 50% of the time, and each client will just wait to send its next request.

Pretty boring. Let's do what people tend to do when they check Sentry and notice that their service is choking on a failed request to a third-party service: retry 10 times in a tight loop.

You can see now that when a request fails, it is immediately retried. From the client's perspective this is great! At a 50% failure rate the odds of 2 failures in a row is 25%, 3 in a row is 12.5%, 4 in a row is 6.25%, and 5 in a row is practically unheard of! We've solved our problem, right?

Right?

There's a subtle side-effect of behaving this way. Each client's individual chance of a request succeeding within 5 retries is very high, but the cost is that the overall rate of requests to our server is higher.

The above is quite intuitive. Each failed request spawns an extra request that wouldn't exist if there was no retrying.

Let's add a few more clients and see this play out.

If I'm any good at tuning simulations my simulations, you will quickly notice the server explode. Then after it has exploded, there's a very good chance it will explode again almost immediately.

What the explosion represents is a server crashing. This can happen for all sorts of reasons in the real world, from the process running out of memory to rare segfaults that only happen when it's under stress.

Once the server has crashed once, it is almost impossible for it to recover because of the retries. If we run the same simulation with no retries, watch what happens:

If you watch for long enough, you will see this server crash as well. It's less likely, and when it does you don't get the same sudden rapid increase of requests bouncing off the load balancer because there are no servers available to serve requests.

If you're sat there thinking: "Sam, this is silly, a 50% failure rate is absurdly high. This will never happen in practice." There are plenty of things that can cause you to start serving a much higher rate of errors: failure of a third-party dependency, a longer-than-usual database failover, a bug that slipped through your QA process.

If you're doing naive retries and you get into this overloaded state, it's very hard to get back out.

# Retrying with a delay

So retrying in a tight loop is problematic and we've seen why. The next thing people do is to add a delay between each retry. 10 retries with a sleep(1000) between them. Let's see how that fares.

Depending on how lucky you get you might last a little while here, but eventually your server will explode and you'll find yourself in the same crash loop we saw with no delay between retries. If the amount of time you wait between retries isn't longer than the average length of time your clients would wait naturally before sending their next request, your retries add to the overall load.

# Adding more servers

What if we add more capacity? More servers means more redundancy, right? And more redundancy will mean that we recover easier if we get overloaded, right?

Here's the same simulation as before, but with 2 servers instead of 1.

Hopefully you didn't spend too long watching, because this one won't have any fireworks. We've doubled our capacity, so of course we're much more able to handle the load now.

Rather than doubling the load, let's go a little less than double. Here's a simulation with 18 clients to get the ball rolling.

After only a short while, it all falls apart! You'll see one server crash followed very shortly after by the other. Then they will take turns crashing as only one will recover at time and all requests will be routed to it.

Even without fully doubling the load, we find ourselves getting into an overload situation that is hard to recover from.

# So what's the answer?

TODO: this.