You Need to Know Networking

Knowing a bit about networking will pay dividends in your software engineering career. I’ve seen many engineers write off problems their service is having during incidents as “a networking problem” without much thought as to what their service is actually doing. I’ve been in meetings where a senior engineer has said “I don’t want to have to learn networking to be able to call other services.” I’ve had arguments with teams whose design decisions overload load balancers, waking me up at off hours for issues that I could do nothing about. What I thought was “obvious” about how computers communicate with each other hardly is for many folks. And that’s okay.

I’m suggesting that there’s another way to think about networking - rather than abstract it away as a “black box” or a cloud on a diagram, embrace it and understand it. Just like many other surfaces in this service-oriented chaos that is modern software architecture, there’s a lot going on behind the scenes that can lead to different optimizations in your code. Runtime complexity, database indexing, API design, and the choice of language or technology to power a stack are choices we scrutinize and pay deep attention to while developing a service. I think how the service utilizes the network deserves a seat at the table, too.

Naive Assumptions

The fastest algorithm for serving 15 second cat videos powered with the most efficient compression technology served from an in-memory cache will still perform like garbage if it has to establish a new connection to another server every time it makes a response. There’s overhead in making a new socket at both the OS level and the software session level. This overhead can add up in ways that will cause future you to curse past you for implementing things the way you did.

Let’s look at an extreme case of what sort of failures and latency can happen with this hypothetical example:

Your service has gotten the request for a cat video, and is ready to serve it from memory. It needs to log to another analytics service in order to catalogue that the user watched this video. Let’s say that the analytics service is a RESTful API listening over HTTPS, with a ~5ms network latency to it. And we’ll assume that DNS resolution, in and of itself a source of strife, has completed.

The video sending server opens up a new socket to the analytics server. Or… it tries to. Did you know that there are a fixed number of open file descriptors that a process can open? And a fixed number of ephemeral ports that can be set up, leading to a cap in the number of active network connections available? If those limits are full, your service could now both be stuck in an error loop retrying connections to the analytics service, while being unable to service new connections because there simply aren’t enough ports!

But let’s say that a session gets set up. The TCP three-way handshake happens, which is already 15ms of latency (5ms each for your service sending a SYN, the other service responding with a SYNACK, and your service responding with an ACK.) Then, since we’re using HTTPS/TLS, TLS key exchange and session setup has to happen. Let’s say this takes another 40ms, plus 5ms latency for the network call. We’re now at 60ms, and we haven’t even sent any data yet!

Now we’re at the point where we can send data to the analytics service, so we PUT our payload through. The analytics service says that their P99 SLA is 40ms to record a call. They’re having a good day, so the response from them takes 40ms to generate. But since there’s 5ms of network latency, our service really ends up waiting another 10ms in waiting for the service to get the call, and waiting for it to come back from us.

We now take at least 110ms unavoidable latency on every request to serve a video that’s contained in memory to our clients. Is that acceptable? What happens when we have thousands of requests per second to this service?

Be Less Contrived

Sure, this example was a contrived example. TLS session reuse prevents this by opening a set of sockets and re-using them, along with TLS metadata, to securely send requests and cut down on that latency. But the first call will always have that latency.

You’d be surprised at how much time people spend trying to explain “latency spikes” for their shiny serverless function, without realizing that there’s always an initial cost for communicating over the network. And that’s even without going into things like Nagle’s algorithm or language-specific startup times and server initialization.

The point I’m trying to drive home is that knowing a bit something about how networking works could have lead to a better design (asynchronous, connection pooling, offline analysis, batching) that has very little to do with networking in and of itself.

Don’t “Blame the Network”

Networking setups (especially cloud-flavored ones) lack transparency as to what’s going on. It’s frustrating operating at a higher level of abstraction - that frustration is partly what lead me to leave my previous job. That said, having some data points to go on when you’re troubleshooting a production issue will help you when you ask the networking team or service for help. Here’s some things to think about:

  • What’s the error message your stack is emitting? Is it a timeout, is it a connection refused, is it reading malformed responses?
  • Does this happen sporadically or every time?
  • What does your latency look like? Is that normal?
  • What’s the load of your service when this happens?

These questions can help narrow down where things are going wrong. They can each provide more actionable investigation paths, and can lead to looking at your code and finding inefficiencies that will improve performance across the board.

Where Do I Go From Here?

The best way to learn networking is to experiment and explore what you don’t know. Look into how your OS handles network calls. Learn about how handshakes in TCP and TLS work. Understand what sort of errors your framework will return when there’s a networking issue while calling RPCs. Figure out how DNS works at a high level and how it fits into finding what server to call. If you have a load balancer in front of your app, figure out how it’s configured (is it balancing at the HTTP level or the Network level?) Be familiar with the OSI model as the industry loves using it to describe different parts of the stack. Pay attention to what affordances your language gives you for session reuse (e.g. Golang’s http.Transport). And consider how your service behaves when its customers are accessing it through a crappy cell phone network across the country.

Folks might rightly say that some of this stuff isn’t “networking” because it doesn’t have to do with how many bits are being sent across a link at a given time - after all, TLS and DNS are hardly a networking problem. I think that taking a holistic view of what a developer has to consider when their service talks to something else is more productive than memorizing CIDR notation and subnetting, evaluating the efficacy of jumbo frames, or knowing how a firewall evaluates rule sets. These concepts are important, but there’s often much lower-hanging fruit that can lead to better design decisions that blurs the line between networking and application behavior.