Load balancing
A load balancer is the bouncer in front of your service: it sees every connection, decides which backend it belongs to, and quietly skips the ones that are sick. At small scale that's a couple of HAProxy boxes. At Google or Cloudflare scale it's anycast IPs in front of XDP packet steerers in front of HTTP proxies in front of application servers, with consistent-hashing tables tying it all together so that removing one machine reshuffles a few thousand connections instead of a few million. This deep dive walks the layers — L4 versus L7, Maglev, anycast, health checks, graceful drain, the DDoS angle, and why sticky sessions are usually a smell.
What a load balancer actually does
Strip away the marketing and a load balancer sits in the middle of a conversation: clients on one side, a pool of backends on the other. For every connection or request, it picks a backend and forwards traffic there. It has three jobs, and they're really the same job at different timescales.
The first is to distribute load. If you have ten backends and a thousand requests per second, you want roughly a hundred each. Naive round-robin gets you most of the way there for uniform workloads; least-connections or power-of-two-choices does better when requests differ in cost. The second is to isolate failures: when one backend turns sick — out of memory, stuck GC, half-open TCP — the LB has to notice and route around it before users do. The third is to absorb topology change: backends come up, drain, get killed, scale out, scale in. The LB has to track the live set and reshuffle traffic without restarting the world.
Everything else — TLS termination, HTTP routing, A/B testing, rate limiting, WAF — sits on top of those three jobs. If the basics are wrong, no amount of L7 polish will save you.
L4 vs L7 — the fundamental split
Load balancers come in two flavours, and the difference is whether they read the payload or just shuffle packets.
An L4 load balancer works at the transport layer. It sees a TCP or
UDP packet, looks at the five-tuple — (protocol, src IP, src port, dst IP, dst
port) — picks a backend, rewrites the destination (or encapsulates the
packet), and forwards it. It doesn't terminate TLS. It doesn't parse HTTP. It doesn't
know whether the payload is a GET request or a video frame. Because it touches so
little, it's fast: an XDP-based L4 LB like Katran can push roughly
10 million packets per second per core, with per-packet latency in
the low microseconds.
An L7 load balancer works at the application layer. It terminates the TCP connection (and the TLS session above it), parses the HTTP request, and can route on path, host, headers, cookies, or anything it can read off the wire. It can do A/B tests, canary deploys, header rewrites, retry logic, request mirroring, circuit breaking. All of that costs CPU: a tuned Envoy or NGINX manages around 100k requests per second per core with full TLS termination, two orders of magnitude less than the L4 layer.
client ─── BGP <a class="il" href="/simulators/anycast">anycast</a> ───▶ [ L4 LB pool ] // Maglev / Katran
│
ECMP within DC
│
▼
[ L7 LB pool ] // Envoy / NGINX
│
▼
[ application ]The split exists because the two jobs want different hardware and different blast radii. L4 absorbs the volume — DDoS, SYN floods, the 10 Gbps of garbage some botnet points at you — and only forwards. L7 sees clean, terminated, decrypted requests from a much smaller surface. Run L7 with no L4 in front and your TLS handshakes become the front line, which is a bad place for them to be.
Algorithms — round-robin, least-connections, hash-based
Once the LB has the request, it has to pick a backend. Four families of algorithms are in widespread use; they differ in how much state they keep and how gracefully they degrade.
| Algorithm | State | Good for | Falls over when |
|---|---|---|---|
| Round-robin | Counter | Uniform backends, uniform requests | One backend gets slow — queue depth balloons |
| Least-connections | Per-backend live count | Long-lived connections, mixed request cost | Long-tail requests skew the count |
| Power-of-two-choices | Two random samples | Close to least-conn at a fraction of the cost | Pathologically uneven backends |
| Consistent hash / Maglev | Hash table | Session affinity, cache locality | Hot keys (one client dominates a backend) |
Round-robin is the default for a reason — it's trivial, keeps no per-request state, and is fine when backends are interchangeable and requests are cheap. The moment one backend gets slow (GC pause, paged-out heap, disk hiccup), round-robin keeps feeding it the same share and the queue depth there explodes. Least-connections (or least-outstanding-requests) fixes that by routing to whichever backend currently has the fewest in-flight requests, at the cost of a little bookkeeping per connection.
Power-of-two-choices is a neat compromise: for each request, pick
two backends at random and send to the less loaded of the two. Mitzenmacher's 1996
thesis showed this gets you almost all the benefit of full least-connections at a
fraction of the cost — the max load grows as log log N instead of
log N / log log N. NGINX's random two least_conn and Envoy's
LEAST_REQUEST with a configurable choice count both implement it.
Hash-based routing is what you reach for when a request or connection needs to land on the same backend every time — for session affinity, for cache locality, or because the backend is sharded by key. That's where consistent hashing earns its keep.
Consistent hashing — when backends come and go
Naive hashing — backend = hash(key) % N — has a brutal failure mode:
the moment you change N, almost every key remaps to a different backend.
Add a backend and roughly (N-1)/N of your cache evaporates. For a 10-node
cluster, removing one node moves 90% of the keyspace. For anything cache-shaped,
that's catastrophic.
Consistent hashing, from Karger et al.'s 1997 paper at MIT, fixes this. Picture the hash space as a ring from 0 to 2^32 - 1. Hash each backend onto the ring (multiple positions per backend, so the arcs are evenly sized — these are usually called virtual nodes or vnodes, typically 100–200 per real backend). To route a key, hash it onto the ring and walk clockwise until you hit the first backend. That's the owner.
Remove a backend and only the arc it owned remaps — roughly K/N of the
keyspace, where K is the total keys and N is the backend
count. Adding a backend is symmetric. Akamai built their whole CDN on this; cache
shards (Memcached, Redis Cluster), database partitioners (DynamoDB, Cassandra), and
L4 load balancers all use variants of the same idea.
The drawback of the basic ring is load imbalance — with only a few backend positions you get arcs of very uneven size, and one backend ends up with twice the traffic of its neighbour. Virtual nodes (100–200 hash positions per real backend) smooth this out by sampling more of the ring. Maglev (next section) is a different way to get the same uniformity guarantee with O(1) lookup.
Maglev — Google's hash-based L4 LB
Maglev is the L4 load balancer Google described in their NSDI 2016 paper. It replaced the old hardware appliances and now fronts most Google services. Cloudflare's Unimog (2020) and Meta's Katran (open-source XDP, 2018) are direct descendants.
Maglev's core idea: instead of walking a ring, build a fixed-size lookup
table with M entries (Google uses M = 65537, a
small prime). Each backend gets a deterministic permutation of the table indices, and
they take turns filling empty slots in their permutation order until the table is
full. The result: each backend owns roughly M/N entries, evenly spread,
and lookup is a single hash plus a table read — O(1), branch-free, friendly to XDP and
DPDK.
// pseudocode, after table build
backend = table[ hash5tuple(packet) % 65537 ]The hash is over the five-tuple, so every packet of the same TCP flow lands on the same backend with no per-connection state on the LB. That's the magic: Maglev is stateless. Every Maglev instance in the cluster builds the same table from the same backend list, so a packet can land on any LB and still reach the right backend. No conntrack, no sync, no flow-pinning sidechannel.
When the backend set changes, the table is rebuilt. Maglev's permutation algorithm
keeps disruption low: empirically, removing one backend out of N moves
roughly 1/N of the table entries, the theoretical minimum. So most
existing flows keep landing on the same backend; only the ones that hashed into the
dead backend's slots get redistributed.
L4 vs L7 — what packets actually see
The diagram is the difference in one frame. L4 sees the IP and TCP/UDP headers and nothing else; the payload is opaque, which is exactly what makes it fast and protocol-agnostic. L7 terminates TLS, decodes HTTP, and can do almost anything with the result — but it has paid for every cryptographic and parsing step to get there.
Anycast — one IP, many locations
Anycast is the trick that lets 8.8.8.8 be one IP
address served from hundreds of physical locations around the world. The same prefix
(say, 8.8.8.0/24) is advertised over BGP from many points of presence;
each network on the internet picks the route with the shortest AS path, so the
client's packets land at the nearest PoP without the client doing anything. DNS root
servers, Cloudflare, Fastly, CloudFront, and most modern CDNs all use
anycast as the front door.
For stateless protocols like DNS, anycast Just Works. Each query is a single UDP packet; whichever PoP receives it answers, and that's the end of the conversation. The next query from the same client might go to a different PoP if BGP reconverges, and nobody cares.
For stateful protocols like TCP and QUIC, things are trickier. A TCP connection's packets have to keep landing on the same backend for its whole lifetime, but a BGP route change midway through can flip a flow to a different PoP — and even within the right PoP, the upstream router uses ECMP to spread traffic across many L4 LBs. Two saving graces: ECMP within the DC is usually flow-hashed (so packets of the same five-tuple go to the same Maglev instance), and the Maglev table is identical on every instance (so even if a flow hits a different LB, the same backend is picked). Inter-PoP route changes are still a hazard — Cloudflare's blog has good post-mortems on the rare cases this matters.
Health checks — telling backends apart
A load balancer that doesn't know which backends are healthy is just a reliable way to send traffic to dead machines. Health checks come in two flavours, and you almost always want both.
Active checks are polls: the LB hits /healthz on each
backend every 5–10 seconds and marks it down if it sees a non-2xx response or a
timeout. The upside is predictability (you control exactly what gets checked) and a
clean signal. The downside is lag: a backend that goes sick at t=0 might not be caught
until t=10s, so ten seconds of traffic gets routed to a dead box.
Passive checks watch live traffic: if a backend returns 5xx, RSTs,
or hits a connect timeout above some rate, the LB ejects it. The upside is speed —
failures show up in real time. The downside is over-reaction: a transient burst (a
slow database query, a brief GC) can trigger an eviction the active probe would have
ignored. The fix is to require N consecutive failures before ejection and to cap the
fraction of the pool that can be ejected at once (Envoy's panic_threshold
is the canonical name for this).
Then there's the awkward middle case — the half-open backend. The
TCP listener still accepts connections (so the LB's connect probe succeeds) but the
application never sends bytes (deadlocked, stuck on a lock, waiting forever for a
downstream that will never reply). A pure connect-style health check sails right past
it. Production stacks guard against this by making active checks application layer
(HTTP GET expecting a 200, or a gRPC Check RPC) rather than connect-only,
and by setting application-layer read timeouts so a stuck request returns an error
instead of hanging.
| Check type | Typical interval | Catches | Misses |
|---|---|---|---|
| TCP connect | 2–5 s | Crashed process, OOM | Stuck applications, slow backends |
| HTTP /healthz | 5–10 s | App-level failure, dependency loss | Bugs that don't surface on /healthz |
| Deep / synthetic | 30–60 s | Functional regressions | Per-request issues |
| Passive (live) | continuous | Fast-onset failures | Slow degradation under low traffic |
The connection-tracking problem
An L4 LB has to send every packet of a TCP flow to the same backend. Otherwise the backend that receives a mid-stream ACK has no idea which connection it belongs to and just sends a RST. The naive way to enforce this is connection tracking (conntrack): the LB keeps a hash table from five-tuple to chosen backend, filled on the SYN and consulted on every later packet.
Conntrack works fine until traffic stops fitting in memory. Linux's
nf_conntrack hash table defaults to a few hundred thousand entries; under
a DDoS — a million spoofed SYNs per second — it overflows in seconds and starts
evicting real connections. Worse, conntrack state has to be synced between LB
instances if you want failover, and that sync is its own bandwidth and consistency
problem.
Maglev's stateless hashing sidesteps the whole thing. Because every LB instance derives the same backend for the same five-tuple, there's no per-flow state to keep. The lookup table itself is small (a few hundred KB for a 65537-entry table with 16-bit backend IDs) and is the same on every LB. The trade-off shows up at backend transitions: when the backend set changes and the table is rebuilt, a small fraction of in-flight flows will hash to a different backend and get RST. For most services that's fine; for very long-lived flows (WebSocket, video streaming) you layer graceful drain on top.
Graceful drain — taking a backend out without dropping users
Rolling deploys, autoscaling events, kubectl drain, manual debugging —
there are many reasons to pull a backend from a pool while it's still serving traffic.
The wrong way is to kill the process: in-flight requests get RST, open
WebSockets are torn down, and the client sees an error. The right way is
graceful drain.
The pattern is the same everywhere:
- Mark the backend as draining in the LB. New connections stop being routed to it.
- Let existing connections finish naturally, up to a deadline (usually 30s–5min, much longer for streaming services).
- Force-close any remaining connections.
- Remove the backend from the pool and terminate.
In Kubernetes, this is what preStop hooks and
terminationGracePeriodSeconds are for. The hook should flip the pod's
readiness probe to fail (so the Service endpoint controller pulls it out of the pool)
and then sleep long enough for the LB to notice — typically 10–15 seconds for
kube-proxy or the cloud LB to converge. Without that pause the LB keeps routing
traffic to the pod while it's shutting down, and you get the exact failure the whole
dance was meant to prevent.
For long-lived stateful protocols — WebSocket, gRPC streams, video — graceful drain is the difference between a rolling deploy being invisible and being a paging event. A streaming media service with no drain drops every user mid-stream every time it deploys. With drain, the user finishes the current segment, the next segment request lands on a different backend, and they never notice.
Sticky sessions — when stateless isn't
Most of this deep dive has assumed any backend can serve any request. Sometimes that's false: an in-memory cart, a long-poll connection, an in-flight multipart upload, an RTC SFU, a WebSocket carrying a session — all of these need a client's later requests to keep landing on the same backend. That's what sticky sessions (or session affinity) buys you. There are three families.
| Method | Where the affinity lives | Breaks when |
|---|---|---|
| Cookie-based | LB-set cookie names the backend | Client drops cookies (private browsing, mobile webviews) |
| Header-based | Consistent hash on a session header (JWT subject, tenant ID) | Header missing or rotates |
| Client-IP hash | Hash on src IP | CGNAT collapses many users to one IP; mobile users roam IPs |
Client-IP hashing is the trap most teams fall into. It looks cheapest — no cookies, no headers, just hash the IP. But behind carrier-grade NAT, tens of thousands of mobile users share a single egress IP, all hashing to the same backend. One mobile carrier can melt one of your backends while the rest sit idle, and nobody can see why. The NAT deep dive covers CGNAT in detail; for load balancing, just assume mobile traffic shares IPs.
The modern answer is to externalise session state — keep it in Redis, Memcached, or a row in a small KV store — so any backend can pick up where another left off. Then you can route round-robin and not care. Treat sticky sessions as a stopgap for protocols that can't be made stateless (WebSocket, WebRTC), not as an architectural choice.
DDoS at the LB
The load balancer is the front door, which makes it the punching bag too. There are three layers of defence, each aimed at a different kind of attack.
Anycast spreads volumetric attacks across the whole edge. A botnet with 1 Tbps of capacity, pointed at an anycast IP served from 200 PoPs, lands roughly 5 Gbps on each — well within absorption capacity, with no single facility taking the brunt. This is why CDNs shrug off attacks the origin would have buckled under.
SYN cookies defend against SYN floods, where an attacker sends a
stream of SYNs from spoofed source IPs and never completes the handshake. A naive
server allocates a SYN_RCVD entry for each, the table overflows, and real clients can
no longer get through. SYN cookies make the handshake stateless on the server side:
the SYN-ACK's sequence number is a cryptographic function of the connection tuple plus
a rotating secret. The server allocates nothing. If a real client's ACK comes back,
the server verifies the cookie and rebuilds the connection state from scratch. The
kernel turns this on automatically under SYN-table pressure
(net.ipv4.tcp_syncookies = 2 to force it always-on).
Per-IP rate limiting at L4 catches the noisy single sources: this IP is sending 10,000 SYN/s, drop everything from it. XDP makes this cheap — a few thousand CPU cycles per packet — and Cloudflare's L4Drop is a well-documented example.
Layer 7 attacks — HTTP floods, slow-loris, complex request floods built to exhaust application CPU — are a different beast. Volumetric defences don't help, because the traffic looks legitimate at the packet level. You need a Web Application Firewall, behavioural analysis, or interactive challenges (JS challenge, CAPTCHA, proof-of-work). This is where Cloudflare, Akamai, and Fastly earn their money; doing it yourself is a multi-year project.
Common mistakes
- Running L7 with no L4 in front. Your TLS handshakes become the
front line; a SYN flood or even a moderate slow-loris exhausts the proxy's
socket budget. Put an L4 stateless tier in front, even if it is just one
HAProxy box in TCP mode with
tcp_syncookiesenabled. - Forgetting graceful drain on rolling deploys. Every restart resets in-flight connections. For HTTP/1 it manifests as a tiny error rate during deploys; for WebSocket or gRPC streams it manifests as every user getting disconnected at deploy time.
- Client-IP hashing behind CGNAT. One mobile carrier hashes to one backend and melts it; the rest of the fleet sits idle. Use cookies or a session header for affinity, never client IP, in any service that takes mobile traffic.
- Connect-only health checks. The backend is half-open — TCP
accepts, the app is stuck. The LB keeps sending traffic. Use HTTP
/healthzor a gRPCCheckRPC and require a real 200 / OK response, not just a TCP handshake. - Not setting
SO_REUSEPORTfor sharded UDP listeners behind XDP. Without it, the kernel hands every packet to a single listening socket and your fancy XDP-distributed pps land on one core. With it, each worker binds its own socket and the kernel hashes flows across them. - Conntrack as the default L4 LB strategy. It works at small scale and breaks the first time you take real attack traffic. Stateless hashing (Maglev / Katran) costs more upfront and is the right default at any non-trivial scale.
Further reading
- Maglev: A Fast and Reliable Software Network Load Balancer — Eisenbud et al., NSDI 2016. The canonical paper. Read this first; everything else in this list is in conversation with it.
- Katran (Meta) — open-source XDP-based L4 LB. Read the code alongside the introductory blog post; it is the most accessible production reference implementation of Maglev-style hashing in the wild.
- Unimog — Cloudflare's edge load balancer — 2020 blog post on Cloudflare's L4 LB. Good complement to the Maglev paper: what changed when they had to deal with anycast at internet scale.
- Envoy — Load balancing overview — the L7 perspective. Round-robin, least-request, ring-hash, Maglev, all implemented in one place with knobs documented; a good cross-reference when you're picking an algorithm in production.
- SRE Workbook — Load Balancing in the Datacenter — Google's writeup of how they think about LB in production, including the distinction between L4 and L7 tiers and the case for stateless hashing.
- Brendan Gregg — BPF / XDP for networking — the foundational writeups on running packet-processing logic in the kernel, which is the layer XDP-based LBs live in.
- Karger et al. — Consistent Hashing and Random Trees — the 1997 STOC paper that started it all. Worth reading once for the proof, even if production implementations have moved on to Maglev-style tables.
- Mitzenmacher — The Power of Two Choices in Randomized Load Balancing
— the 1996 paper behind
random-two-least-conn. Short, surprising, directly applicable.