How load balancing spreads traffic across many servers.
Spreading traffic across a pool sounds like the easy part of distributed systems. It is not. The algorithm you pick decides who falls over first, who serves slow users, and whether "add another box" actually helps.
What is load balancing?
One IP, many backends.
Load balancing distributes incoming requests across a pool of backend servers. Layer 4 (TCP/UDP) load balancers operate on connections; Layer 7 (HTTP) load balancers operate on requests. The five canonical algorithms — round-robin, least-connections, P2C, weighted, consistent-hashing — make different trade-offs around fairness, tail latency, and behaviour under heterogeneous backends.
A load balancer is the address the world talks to. Behind it, a fluctuating pool of machines actually does the work. The LB's job sounds trivial — pick a backend for each request — but it carries three responsibilities that together hold most distributed systems together.
Distribution
Spread N requests across M backends so no single one falls behind. The algorithm in Part 03 is what makes this work — or fail invisibly.
Health
Probe backends; remove the dead from rotation; add them back when they recover. The whole point of multiple boxes is to survive one of them dying — but only if the LB notices fast enough.
Insulation
Hide the pool from the client. The world sees one stable IP. You can deploy, scale, and replace backends behind the LB without anyone reconnecting. This is what makes rolling deploys possible.
L4 vs L7 load balancing
Routing by connection versus routing by request.
The first decision is which layer of the stack the LB looks at. Both have a place; both have things they cannot do.
Move bytes, fast.
The LB sees TCP connections (or UDP datagrams). It picks a backend at SYN time and forwards every byte after. It cannot route based on URL or header — it never parsed them. AWS NLB, HAProxy in tcp mode, kube-proxy iptables. Lowest latency, lowest CPU, can fall back to direct-server-return for blistering throughput.
Understand the request.
The LB terminates the connection itself, parses HTTP, then forwards. It can route /api/* → backend pool A and /static/* → backend pool B; rewrite headers; stream gRPC; speak HTTP/2 to clients and HTTP/1.1 upstream. Envoy, NGINX, ALB. Costs the price of parsing every request, twice. If you run the L7 box yourself, see Traefik vs Nginx for the trade-offs.
An L4 LB at the edge for raw throughput and DDoS handling, then an L7 LB inside the cluster for routing intelligence. Two boxes, two scopes, one stable IP for the user.
The four load balancing algorithms, side by side
Four algorithms, five backends, one slow one.
Below: a five-backend pool. Backend C is slow (60 ms / request). Switch the algorithm and watch in-flight load balance — or fail to. Click a backend's health switch to simulate a failure mid-stream.
Health checks are harder than they look
Knowing a backend is up is not the same as knowing it works.
The LB needs to know which backends are accepting traffic. A naïve TCP connect every few seconds catches "the process is gone" but misses "the process is up but its database driver is in a deadlock and every request hangs at 30 s." Production health checks are layered.
- L4
TCP connect
Cheap, frequent (every few seconds), catches process death and port issues. Tells you nothing about whether the application is actually serving requests correctly.
- L7
HTTP /healthz
The application returns 200 OK only when its own self-checks pass — DB reachable, caches warm, internal queues drained. Best practice: two endpoints — /livez for "the process is alive at all" and /readyz for "ready to accept new work."
- passive
Outlier ejection
Watch real traffic. If a backend produces five 5xx in a row, eject it immediately — don't wait for the next health probe. Envoy's outlier detection, AWS Target Group unhealthy-host detection. Catches the partial failures L7 probes miss.
- slow start
Ramp, don't dump
When a fresh backend joins the pool, do not send it 1/N of all traffic at once — its caches are cold, its connections are unwarmed. Start it at 5% and ramp linearly over thirty seconds. NGINX slow_start; Envoy slow-start mode.
Session affinity and stickiness
Sending a user back to the same backend every time.
Stateless backends are easy to load-balance: any backend can serve any request. WebSockets, server-sent events, in-memory sessions, and shard-locked queries are not stateless — they need the next request from the same user to land on the same backend or the connection breaks.
Affinity is implemented either by hashing a stable identifier (source IP, a cookie the LB sets, a header from the application) or by consistent-hashing a request key. The cost is sharp: when a backend dies, its sessions are lost. Affinity must be paired with graceful drain, sticky-cookie expiry, and a connection-aware deploy strategy.
Cheap and broken.
The LB hashes the client IP. Works until users sit behind a corporate NAT — then thousands of users share one IP and pile onto one backend. Or until your mobile user moves from Wi-Fi to LTE and changes IP mid-session.
Application-aware.
The LB sets a cookie on the first response (SRV=A) and routes future requests with that cookie back to backend A. Survives NAT, survives IP change, expires when you say it does. The default for any L7 LB you'd deploy in 2026.
Pool churn and graceful drain
Adding and removing backends without dropping live requests.
Backends come and go: scale-out adds new ones, deploys rotate every one in turn, crashes remove them without warning. The LB has to handle each case without dropping live requests.
A normal shutdown follows the drain sequence: the orchestrator marks the backend not ready; the LB stops sending new requests; in-flight requests are allowed to complete (typically 30–60 s); then the process exits. A SIGTERM with a long enough terminationGracePeriod is what makes rolling deploys non-disruptive.
Crashes don't follow this pattern. The first signal the LB has is failed requests. With outlier ejection (Part 04) and a fast retry budget at the client, the broken backend is ejected within a second or two and traffic shifts to the rest of the pool. See the autoscaling guide for the metric loop that decides when to add and remove backends in the first place.
Global server load balancing (GSLB): balancing across regions
Routing each user to the nearest healthy datacenter.
A single L7 LB scales to hundreds of thousands of requests per second, but it lives in one region. Global Server Load Balancing (GSLB) is what routes a user in Tokyo to a datacenter in Tokyo and a user in London to one in Frankfurt — without their browser ever knowing.
The two common implementations: DNS-based — the authoritative DNS server returns the IP of the nearest healthy region (with sub-minute TTL so failover is fast) — and anycast — multiple regions announce the same IP, and BGP routes packets to the topologically nearest one. Cloudflare and Google use anycast for everything; AWS Route 53 latency-routing is DNS-based GSLB.
For deeper context, see the DNS guide on how authoritative servers work and the BGP guide on what anycast actually means.
Where load balancers fail in surprising ways
The failures that surprise you.
The LB is the single point through which every request flows. When it misbehaves, every request misbehaves. Three classes of failure are worth knowing by name.
Thundering herd after a restart
An LB rolls — for thirty seconds, all traffic flows through one half of the pool, which has cold caches and warming pools. Latency spikes; clients retry; each retry hits the same overloaded half. Slow-start (Part 04) and client jitter break the cycle.
Uneven hashing behind a NAT
Source-IP hashing behind a single corporate NAT gateway can produce a pool where 95% of users hash to one backend. The LB is doing exactly what it was told; the inputs were the problem. Use cookie or header hashing instead.
Retries multiply at every layer
Client retries × LB retries × inner-service retries = an outage's failure rate multiplied to disaster. Set a retry budget at the LB ("≤ 10% of requests may retry"); return 503 with Retry-After when overload is detected. See the retry-strategy simulator.
Cloud load balancers compared: AWS ALB/NLB, GCP, Azure
What the managed services actually offer.
- AWS ALB (Application Load Balancer)
- L7 HTTP/HTTPS. WebSocket and HTTP/2 supported, gRPC supported (since 2020). ~$0.0225/hour + LCU charges. Routes by host header, path, query string, or HTTP header. Native integration with WAF, Cognito, OIDC.
- AWS NLB (Network Load Balancer)
- L4 TCP/UDP/TLS. Static IP per AZ, supports millions of connections per second, ultra-low latency. Source IP preserved by default. Best fit: gaming servers, IoT, real-time bidding.
- GCP Global Load Balancer
- Anycast IPv4 in front of multi-region backend. Single global IP routed to the nearest healthy region. Probably the simplest cross-region LB story among the big three.
- Azure Application Gateway / Front Door
- Application Gateway is regional L7 with WAF; Front Door is global anycast (similar to GCP GLB or Cloudflare). Often deployed together: Front Door at the edge, App Gateway per region.
- Cloudflare Load Balancing
- DNS-based + anycast L4/L7. Health-check-aware traffic shifting between regions or providers. Often used for multi-cloud or hybrid backends.
The session-affinity gotcha. All cloud LBs offer sticky sessions via cookie or source-IP hashing, but the trade-offs are subtle. Cookie-based stickiness (the typical default) breaks if the client clears cookies; source-IP stickiness breaks behind proxies and NAT. Many deployments end up with subtly broken stickiness because the underlying assumption (each user has one consistent IP) doesn't hold on mobile.
Load balancing is one of those topics that sounds dull until something melts. The algorithm picks themselves are well-understood — round-robin, least-conn, P2C, hash — but the production failure modes are subtle, and the consequences cascade. Pick the algorithm to match your backends' shape; pair it with health checks deep enough to catch real failure; pair it with drains long enough to let in-flight requests finish. Most of the rest is the LB doing the boring, important thing it always does.
Read
further.
- Mitzenmacher · 2001The Power of Two Choices in Randomized Load BalancingThe result that justifies P2C — exponentially better load distribution than uniform random, with one extra coin flip per request.
- EnvoyLoad Balancing — Architecture OverviewThe most thorough operational treatment of load-balancing in any modern proxy's docs. Covers locality awareness, slow-start, panic mode.
- AWS Builders' LibraryTimeouts, retries, and backoff with jitterWhy retry amplification (Part 08·03) happens and how to bound it. Pairs with the retry-strategy simulator.
- Semicolony guideDNS, how names resolveDNS-based GSLB lives at the authoritative tier. Read this if Part 07's "DNS-based GSLB" needs more grounding.
- Semicolony guideBGP, the glueAnycast — the routing technique that lets one IP live in many places at once. Foundation for global-scale LBs.
- Semicolony simulatorLoad Balancer · richer simulatorA more elaborate sandbox for the algorithms above — adjustable backend latencies, burst arrivals, and failure injection.