12 / 12

Stack / 12

Load balancing

A load balancer is the bouncer in front of your service: it sees every connection, decides which backend it belongs to, and quietly skips the ones that are sick. At small scale that's a couple of HAProxy boxes. At Google or Cloudflare scale it's anycast IPs in front of XDP packet steerers in front of HTTP proxies in front of application servers, with consistent-hashing tables tying it all together so that removing one machine reshuffles a few thousand connections instead of a few million. This deep dive walks the layers — L4 versus L7, Maglev, anycast, health checks, graceful drain, the DDoS angle, and why sticky sessions are usually a smell.

What a load balancer actually does

Strip away the marketing and a load balancer sits in the middle of a conversation: clients on one side, a pool of backends on the other. For every connection or request, it picks a backend and forwards traffic there. It has three jobs, and they're really the same job at different timescales.

The first is to distribute load. If you have ten backends and a thousand requests per second, you want roughly a hundred each. Naive round-robin gets you most of the way there for uniform workloads; least-connections or power-of-two-choices does better when requests differ in cost. The second is to isolate failures: when one backend turns sick — out of memory, stuck GC, half-open TCP — the LB has to notice and route around it before users do. The third is to absorb topology change: backends come up, drain, get killed, scale out, scale in. The LB has to track the live set and reshuffle traffic without restarting the world.

Everything else — TLS termination, HTTP routing, A/B testing, rate limiting, WAF — sits on top of those three jobs. If the basics are wrong, no amount of L7 polish will save you.

Why this matters. Most outages blamed on "the load balancer" are really one of these three primitives breaking. Distribution gone wrong (one backend took 90% of the traffic), failure isolation gone wrong (LB kept sending to a dead backend), or topology change gone wrong (rolling deploy dropped 2% of in-flight connections). Knowing which of the three failed tells you which knob to turn.

L4 vs L7 — the fundamental split

Load balancers come in two flavours, and the difference is whether they read the payload or just shuffle packets.

An L4 load balancer works at the transport layer. It sees a TCP or UDP packet, looks at the five-tuple — (protocol, src IP, src port, dst IP, dst port) — picks a backend, rewrites the destination (or encapsulates the packet), and forwards it. It doesn't terminate TLS. It doesn't parse HTTP. It doesn't know whether the payload is a GET request or a video frame. Because it touches so little, it's fast: an XDP-based L4 LB like Katran can push roughly 10 million packets per second per core, with per-packet latency in the low microseconds.

An L7 load balancer works at the application layer. It terminates the TCP connection (and the TLS session above it), parses the HTTP request, and can route on path, host, headers, cookies, or anything it can read off the wire. It can do A/B tests, canary deploys, header rewrites, retry logic, request mirroring, circuit breaking. All of that costs CPU: a tuned Envoy or NGINX manages around 100k requests per second per core with full TLS termination, two orders of magnitude less than the L4 layer.

client ─── BGP <a class="il" href="/simulators/anycast">anycast</a> ───▶  [ L4 LB pool ]      // Maglev / Katran
                                  │
                            ECMP within DC
                                  │
                                  ▼
                            [ L7 LB pool ]      // Envoy / NGINX
                                  │
                                  ▼
                            [ application ]

The split exists because the two jobs want different hardware and different blast radii. L4 absorbs the volume — DDoS, SYN floods, the 10 Gbps of garbage some botnet points at you — and only forwards. L7 sees clean, terminated, decrypted requests from a much smaller surface. Run L7 with no L4 in front and your TLS handshakes become the front line, which is a bad place for them to be.

Algorithms — round-robin, least-connections, hash-based

Once the LB has the request, it has to pick a backend. Four families of algorithms are in widespread use; they differ in how much state they keep and how gracefully they degrade.

Algorithm	State	Good for	Falls over when
Round-robin	Counter	Uniform backends, uniform requests	One backend gets slow — queue depth balloons
Least-connections	Per-backend live count	Long-lived connections, mixed request cost	Long-tail requests skew the count
Power-of-two-choices	Two random samples	Close to least-conn at a fraction of the cost	Pathologically uneven backends
Consistent hash / Maglev	Hash table	Session affinity, cache locality	Hot keys (one client dominates a backend)

Round-robin is the default for a reason — it's trivial, keeps no per-request state, and is fine when backends are interchangeable and requests are cheap. The moment one backend gets slow (GC pause, paged-out heap, disk hiccup), round-robin keeps feeding it the same share and the queue depth there explodes. Least-connections (or least-outstanding-requests) fixes that by routing to whichever backend currently has the fewest in-flight requests, at the cost of a little bookkeeping per connection.

Power-of-two-choices is a neat compromise: for each request, pick two backends at random and send to the less loaded of the two. Mitzenmacher's 1996 thesis showed this gets you almost all the benefit of full least-connections at a fraction of the cost — the max load grows as log log N instead of log N / log log N. NGINX's random two least_conn and Envoy's LEAST_REQUEST with a configurable choice count both implement it.

Hash-based routing is what you reach for when a request or connection needs to land on the same backend every time — for session affinity, for cache locality, or because the backend is sharded by key. That's where consistent hashing earns its keep.

Consistent hashing — when backends come and go

Naive hashing — backend = hash(key) % N — has a brutal failure mode: the moment you change N, almost every key remaps to a different backend. Add a backend and roughly (N-1)/N of your cache evaporates. For a 10-node cluster, removing one node moves 90% of the keyspace. For anything cache-shaped, that's catastrophic.

Consistent hashing, from Karger et al.'s 1997 paper at MIT, fixes this. Picture the hash space as a ring from 0 to 2^32 - 1. Hash each backend onto the ring (multiple positions per backend, so the arcs are evenly sized — these are usually called virtual nodes or vnodes, typically 100–200 per real backend). To route a key, hash it onto the ring and walk clockwise until you hit the first backend. That's the owner.

Remove a backend and only the arc it owned remaps — roughly K/N of the keyspace, where K is the total keys and N is the backend count. Adding a backend is symmetric. Akamai built their whole CDN on this; cache shards (Memcached, Redis Cluster), database partitioners (DynamoDB, Cassandra), and L4 load balancers all use variants of the same idea.

The drawback of the basic ring is load imbalance — with only a few backend positions you get arcs of very uneven size, and one backend ends up with twice the traffic of its neighbour. Virtual nodes (100–200 hash positions per real backend) smooth this out by sampling more of the ring. Maglev (next section) is a different way to get the same uniformity guarantee with O(1) lookup.

Maglev — Google's hash-based L4 LB

Maglev is the L4 load balancer Google described in their NSDI 2016 paper. It replaced the old hardware appliances and now fronts most Google services. Cloudflare's Unimog (2020) and Meta's Katran (open-source XDP, 2018) are direct descendants.

Maglev's core idea: instead of walking a ring, build a fixed-size lookup table with M entries (Google uses M = 65537, a small prime). Each backend gets a deterministic permutation of the table indices, and they take turns filling empty slots in their permutation order until the table is full. The result: each backend owns roughly M/N entries, evenly spread, and lookup is a single hash plus a table read — O(1), branch-free, friendly to XDP and DPDK.

// pseudocode, after table build
backend = table[ hash5tuple(packet) % 65537 ]

The hash is over the five-tuple, so every packet of the same TCP flow lands on the same backend with no per-connection state on the LB. That's the magic: Maglev is stateless. Every Maglev instance in the cluster builds the same table from the same backend list, so a packet can land on any LB and still reach the right backend. No conntrack, no sync, no flow-pinning sidechannel.

When the backend set changes, the table is rebuilt. Maglev's permutation algorithm keeps disruption low: empirically, removing one backend out of N moves roughly 1/N of the table entries, the theoretical minimum. So most existing flows keep landing on the same backend; only the ones that hashed into the dead backend's slots get redistributed.

Katran is the open-source version to read. Meta released Katran under Apache 2.0 in 2018. It's an XDP program (the LB lives in the Linux kernel's NIC driver path), uses Maglev-style hashing, and pushes around 10M pps per core on commodity hardware. The GitHub repo and the original blog post are the clearest concrete walkthrough of how all of this fits together.

L4 vs L7 — what packets actually see

The diagram is the difference in one frame. L4 sees the IP and TCP/UDP headers and nothing else; the payload is opaque, which is exactly what makes it fast and protocol-agnostic. L7 terminates TLS, decodes HTTP, and can do almost anything with the result — but it has paid for every cryptographic and parsing step to get there.

Anycast — one IP, many locations

Anycast is the trick that lets 8.8.8.8 be one IP address served from hundreds of physical locations around the world. The same prefix (say, 8.8.8.0/24) is advertised over BGP from many points of presence; each network on the internet picks the route with the shortest AS path, so the client's packets land at the nearest PoP without the client doing anything. DNS root servers, Cloudflare, Fastly, CloudFront, and most modern CDNs all use anycast as the front door.

For stateless protocols like DNS, anycast Just Works. Each query is a single UDP packet; whichever PoP receives it answers, and that's the end of the conversation. The next query from the same client might go to a different PoP if BGP reconverges, and nobody cares.

For stateful protocols like TCP and QUIC, things are trickier. A TCP connection's packets have to keep landing on the same backend for its whole lifetime, but a BGP route change midway through can flip a flow to a different PoP — and even within the right PoP, the upstream router uses ECMP to spread traffic across many L4 LBs. Two saving graces: ECMP within the DC is usually flow-hashed (so packets of the same five-tuple go to the same Maglev instance), and the Maglev table is identical on every instance (so even if a flow hits a different LB, the same backend is picked). Inter-PoP route changes are still a hazard — Cloudflare's blog has good post-mortems on the rare cases this matters.

QUIC and connection IDs. QUIC was designed with anycast and multi-path in mind. Each connection carries a server-chosen connection ID so the edge can steer packets to the right server even when the client's IP changes (Wi-Fi to cellular) or when ECMP would otherwise scatter them. That makes QUIC noticeably more tolerant of anycast wobble than TCP. See the QUIC deep dive for the details.

Health checks — telling backends apart

A load balancer that doesn't know which backends are healthy is just a reliable way to send traffic to dead machines. Health checks come in two flavours, and you almost always want both.

Active checks are polls: the LB hits /healthz on each backend every 5–10 seconds and marks it down if it sees a non-2xx response or a timeout. The upside is predictability (you control exactly what gets checked) and a clean signal. The downside is lag: a backend that goes sick at t=0 might not be caught until t=10s, so ten seconds of traffic gets routed to a dead box.

Passive checks watch live traffic: if a backend returns 5xx, RSTs, or hits a connect timeout above some rate, the LB ejects it. The upside is speed — failures show up in real time. The downside is over-reaction: a transient burst (a slow database query, a brief GC) can trigger an eviction the active probe would have ignored. The fix is to require N consecutive failures before ejection and to cap the fraction of the pool that can be ejected at once (Envoy's panic_threshold is the canonical name for this).

Then there's the awkward middle case — the half-open backend. The TCP listener still accepts connections (so the LB's connect probe succeeds) but the application never sends bytes (deadlocked, stuck on a lock, waiting forever for a downstream that will never reply). A pure connect-style health check sails right past it. Production stacks guard against this by making active checks application layer (HTTP GET expecting a 200, or a gRPC Check RPC) rather than connect-only, and by setting application-layer read timeouts so a stuck request returns an error instead of hanging.

Check type	Typical interval	Catches	Misses
TCP connect	2–5 s	Crashed process, OOM	Stuck applications, slow backends
HTTP /healthz	5–10 s	App-level failure, dependency loss	Bugs that don't surface on /healthz
Deep / synthetic	30–60 s	Functional regressions	Per-request issues
Passive (live)	continuous	Fast-onset failures	Slow degradation under low traffic

The connection-tracking problem

An L4 LB has to send every packet of a TCP flow to the same backend. Otherwise the backend that receives a mid-stream ACK has no idea which connection it belongs to and just sends a RST. The naive way to enforce this is connection tracking (conntrack): the LB keeps a hash table from five-tuple to chosen backend, filled on the SYN and consulted on every later packet.

Conntrack works fine until traffic stops fitting in memory. Linux's nf_conntrack hash table defaults to a few hundred thousand entries; under a DDoS — a million spoofed SYNs per second — it overflows in seconds and starts evicting real connections. Worse, conntrack state has to be synced between LB instances if you want failover, and that sync is its own bandwidth and consistency problem.

Maglev's stateless hashing sidesteps the whole thing. Because every LB instance derives the same backend for the same five-tuple, there's no per-flow state to keep. The lookup table itself is small (a few hundred KB for a 65537-entry table with 16-bit backend IDs) and is the same on every LB. The trade-off shows up at backend transitions: when the backend set changes and the table is rebuilt, a small fraction of in-flight flows will hash to a different backend and get RST. For most services that's fine; for very long-lived flows (WebSocket, video streaming) you layer graceful drain on top.

The non-obvious rule. If your LB tier is meant to survive DDoS, conntrack is a liability, not an asset. Stateless hashing — Maglev, Katran's XDP path, the Cilium L4 LB — keeps working under volumetric attack precisely because there's nothing to overflow. Stateful LBs (HAProxy, Envoy in TCP mode without careful tuning) need a stateless L4 tier in front of them.

Graceful drain — taking a backend out without dropping users

Rolling deploys, autoscaling events, kubectl drain, manual debugging — there are many reasons to pull a backend from a pool while it's still serving traffic. The wrong way is to kill the process: in-flight requests get RST, open WebSockets are torn down, and the client sees an error. The right way is graceful drain.

The pattern is the same everywhere:

Mark the backend as draining in the LB. New connections stop being routed to it.
Let existing connections finish naturally, up to a deadline (usually 30s–5min, much longer for streaming services).
Force-close any remaining connections.
Remove the backend from the pool and terminate.

In Kubernetes, this is what preStop hooks and terminationGracePeriodSeconds are for. The hook should flip the pod's readiness probe to fail (so the Service endpoint controller pulls it out of the pool) and then sleep long enough for the LB to notice — typically 10–15 seconds for kube-proxy or the cloud LB to converge. Without that pause the LB keeps routing traffic to the pod while it's shutting down, and you get the exact failure the whole dance was meant to prevent.

For long-lived stateful protocols — WebSocket, gRPC streams, video — graceful drain is the difference between a rolling deploy being invisible and being a paging event. A streaming media service with no drain drops every user mid-stream every time it deploys. With drain, the user finishes the current segment, the next segment request lands on a different backend, and they never notice.

Sticky sessions — when stateless isn't

Most of this deep dive has assumed any backend can serve any request. Sometimes that's false: an in-memory cart, a long-poll connection, an in-flight multipart upload, an RTC SFU, a WebSocket carrying a session — all of these need a client's later requests to keep landing on the same backend. That's what sticky sessions (or session affinity) buys you. There are three families.

Method	Where the affinity lives	Breaks when
Cookie-based	LB-set cookie names the backend	Client drops cookies (private browsing, mobile webviews)
Header-based	Consistent hash on a session header (JWT subject, tenant ID)	Header missing or rotates
Client-IP hash	Hash on src IP	CGNAT collapses many users to one IP; mobile users roam IPs

Client-IP hashing is the trap most teams fall into. It looks cheapest — no cookies, no headers, just hash the IP. But behind carrier-grade NAT, tens of thousands of mobile users share a single egress IP, all hashing to the same backend. One mobile carrier can melt one of your backends while the rest sit idle, and nobody can see why. The NAT deep dive covers CGNAT in detail; for load balancing, just assume mobile traffic shares IPs.

The modern answer is to externalise session state — keep it in Redis, Memcached, or a row in a small KV store — so any backend can pick up where another left off. Then you can route round-robin and not care. Treat sticky sessions as a stopgap for protocols that can't be made stateless (WebSocket, WebRTC), not as an architectural choice.

DDoS at the LB

The load balancer is the front door, which makes it the punching bag too. There are three layers of defence, each aimed at a different kind of attack.

Anycast spreads volumetric attacks across the whole edge. A botnet with 1 Tbps of capacity, pointed at an anycast IP served from 200 PoPs, lands roughly 5 Gbps on each — well within absorption capacity, with no single facility taking the brunt. This is why CDNs shrug off attacks the origin would have buckled under.

SYN cookies defend against SYN floods, where an attacker sends a stream of SYNs from spoofed source IPs and never completes the handshake. A naive server allocates a SYN_RCVD entry for each, the table overflows, and real clients can no longer get through. SYN cookies make the handshake stateless on the server side: the SYN-ACK's sequence number is a cryptographic function of the connection tuple plus a rotating secret. The server allocates nothing. If a real client's ACK comes back, the server verifies the cookie and rebuilds the connection state from scratch. The kernel turns this on automatically under SYN-table pressure (net.ipv4.tcp_syncookies = 2 to force it always-on).

Per-IP rate limiting at L4 catches the noisy single sources: this IP is sending 10,000 SYN/s, drop everything from it. XDP makes this cheap — a few thousand CPU cycles per packet — and Cloudflare's L4Drop is a well-documented example.

Layer 7 attacks — HTTP floods, slow-loris, complex request floods built to exhaust application CPU — are a different beast. Volumetric defences don't help, because the traffic looks legitimate at the packet level. You need a Web Application Firewall, behavioural analysis, or interactive challenges (JS challenge, CAPTCHA, proof-of-work). This is where Cloudflare, Akamai, and Fastly earn their money; doing it yourself is a multi-year project.

Common mistakes

Running L7 with no L4 in front. Your TLS handshakes become the front line; a SYN flood or even a moderate slow-loris exhausts the proxy's socket budget. Put an L4 stateless tier in front, even if it is just one HAProxy box in TCP mode with tcp_syncookies enabled.
Forgetting graceful drain on rolling deploys. Every restart resets in-flight connections. For HTTP/1 it manifests as a tiny error rate during deploys; for WebSocket or gRPC streams it manifests as every user getting disconnected at deploy time.
Client-IP hashing behind CGNAT. One mobile carrier hashes to one backend and melts it; the rest of the fleet sits idle. Use cookies or a session header for affinity, never client IP, in any service that takes mobile traffic.
Connect-only health checks. The backend is half-open — TCP accepts, the app is stuck. The LB keeps sending traffic. Use HTTP /healthz or a gRPC Check RPC and require a real 200 / OK response, not just a TCP handshake.
Not setting SO_REUSEPORT for sharded UDP listeners behind XDP. Without it, the kernel hands every packet to a single listening socket and your fancy XDP-distributed pps land on one core. With it, each worker binds its own socket and the kernel hashes flows across them.
Conntrack as the default L4 LB strategy. It works at small scale and breaks the first time you take real attack traffic. Stateless hashing (Maglev / Katran) costs more upfront and is the right default at any non-trivial scale.

Load balancing

What a load balancer actually does

L4 vs L7 — the fundamental split

Algorithms — round-robin, least-connections, hash-based

Consistent hashing — when backends come and go

Maglev — Google's hash-based L4 LB

L4 vs L7 — what packets actually see

Anycast — one IP, many locations

Health checks — telling backends apart

The connection-tracking problem

Graceful drain — taking a backend out without dropping users

Sticky sessions — when stateless isn't

DDoS at the LB

Common mistakes

Further reading