Over the last few weeks I've been grappling with a bug that just wouldn't leave me alone. In the end, it took almost a week of pouring over TCP dumps and RFCs to figure out what was going on and, if I'm honest, the solution is a best-guess because I can't figure out how to prove the theory.
It was a fun one, though, so I felt like writing it up :)
One of the systems I work on handles the retrieval of a large number of fairly large files (tens of gigabytes in some cases). It has to fetch and process these files reliably and within a time constrained window. The files are (supposedly) structured data in a multitude of different schemas and formats (XML, TSV, CSV etc.) and they all need to be normalized into the same schema and format (this will henceforth be referred to as "processing").
In the past we handled the download and processing as separate steps, storing the files in shared storage so we can have multiple workers doing the downloading and processing. In an effort to decommission some servers and speed the process up, the code was rewritten so that files were streamed in and processed as the bytes came through. This eliminated the need for shared storage and complicated communication overhead.
Among a multitude of other problems, we had a couple of files getting connection resets before finishing. About 5-6 out of over 1000. The resets happened at arbitrary points in the download but they were predictable; it was always the same files getting their connection reset.
First thought: are they coming from the same server? Nope. The same problem was happening on downloads from a couple of different locations and it wasn't limited to protocol either; it happened on both HTTP and FTP downloads.
A bit of log tailing revealed that processing could get quite slow in some
scenarios. Not all data gets processed at the same speed. Connections appeared
to be resetting after more than 60 seconds of idle time. Aha! I'll just set
the socket and that should stop the other end from resetting us, right?
SO_KEEPALIVE actually had no effect on what was going on at
all. TCP dumps revealed that keep alive packets were being sent but to no avail.
Connections were still being reset.
The processing we do on these files is done in batches. This means that we read a bit from the wire, then do some processing, then repeat. If keepalive wasn't good enough for the remote hosts, fine, let's see if we can be a bit more clever about how we read.
My next effort was to process in smaller batches but because batches don't all take the same amount of time it was necessary to time each batch and dynamically resize the batches so we always fall inside this 60 second window (it was actually a 10 second window, just to be on the safe side). The idea being that if we read smaller batches but more frequently, the other end will be less inclined to reset the connection.
No dice. The above idea of artificially shaping the traffic is actually pretty naive. I didn't realise, until digging into this problem, how many layers of buffering there are between the application and the kernel. Just because the application is consuming at a given rate doesn't mean that data isn't being buffered lower down.
Shaping traffic at the application level would be a very hard task. Hosts using TCP negotiate a "window size" (more on this in a minute) when the initial SYN/SYN-ACK/ACK handshake is taking place and dynamically resize throughout the connection's life. This information isn't communicated to the application layer. It would be very hard to know if your read on a socket triggered a network read or not, though I suppose you could guarantee it by always reading a number of bytes equal to the size of the receive buffer... This may depend on your language's implementation of socket communication, though, and if you have libraries between you and the socket doing their own buffering. Food for thought.
TCP and the sliding window
Feeling disheartened and defeated, I decided to read up on the nitty-gritty details of TCP. I was convinced we were doing something wrong somewhere. My main source of information was the fantastic TCP/IP Guide. I'll give a brief overview of the details that led me to what I believe is the answer.
TCP is a connection-oriented protocol. This means that to communicate with a remote host, you need to negotiate and maintain a connection with it. This is done by first sending a SYN (synchronise) packet, then the remote host sends a SYN-ACK (synchronize-acknowledge) back, and finally you send an ACK (acknowledge). This establishes the connection between you and the remote host and the socket remains open until a FIN (finish) or a RST (reset) packet is sent.
TCP creates an abstraction which guarantees that between two applications you can send a stream of bytes that will be presented in the correct order and in an uncorrupted state. It guarantees this by requiring all data to be acknowledged after being sent. If the data isn't acknowledged, it gets retransmitted until it is. This is a gross oversimplification but it'll do for now.
Memory is finite. Because of this, both ends of the connection must negotiate on how much data is in flight at any given time. One end may not be able to buffer as much information as the other end and we don't want to lose bytes because we've run out of buffer space. This is where "window size" comes into play.
At first, the two ends negotiate an acceptable window size. Let's say they agree on 65kb. This means that until data is acknowledged by the receiving end, the sending end can only transmit 65kb of data. After it has sent 65kb of data without receiving a reply, it must stop. Once data starts being acknowledged, it will be able to continue on its merry way.
If connections are slow, or become slow over time, we don't want one end to overwhelm the other. The sliding window can be dynamically adjusted by sending window update packets specifying the new window size. This new size can even be 0 if the receiving end is really struggling to consume the data is it being given. This is, I believe, where our problem lay.
The problem with zero windows
After a nudge from a co-worker on the subject of us possibly looking like we're trying to perform a denial of service attack, I found the following note:
Part of the Transmission Control Protocol (TCP) specification (RFC 1122) allows a receiver to advertise a zero byte window, instructing the sender to maintain the connection but not send additional TCP payload data. The sender should then probe the receiver to check if the receiver is ready to accept data. Narrow interpretation of this part of the specification can create a denial-of-service vulnerability. By advertising a zero receive window and acknowledging probes, a malicious receiver can cause a sender to consume resources (TCP state, buffers, and application memory), preventing the targeted service or system from handling legitimate connections.
Sounds quite a lot, if not exactly, like what we're doing (but legitimately!). Bugger. In all of the packet dumps I inspected, the connection reset packets were sent directly after we tried to re-open a small window after having a zeroed window for longer than a minute.
My guess is that this is a warning sign for some DoS prevention software. We're tripping this rule and they're killing us (rightly so!).
It's quite simple: stop reading so slowly. I always knew this is what it would come to but I didn't want to resort to it without knowing why. In the end I implemented code that downloaded to the file system but started processing the file straight away, waiting on more bytes if we got to the end before the download was finished. This ensures that the download happens at full speed but processing doesn't need to wait for the download to be finished.
Still, I've no idea how to confirm this theory. It just fits neatly is all. If you've read this post to the end and you know more than I do about these things, I'd love to have a chat with you :)