Load balancing — System Design Handbook

A load balancer is the floor of every distributed system. Above it, servers come and go. Below it, traffic keeps flowing as if nothing happened.

Horizontal scaling distributes traffic across many backends, but something has to do the distributing. A user types example.com; the request has to land on one of fifty servers, and the choice of which has consequences. The load balancer is the most critical, most invisible piece of any distributed system. This page covers the algorithms, the layers (L4 vs L7), the health checks that make it self-healing, and the failure modes that take systems down.

The reverse-proxy load balancer is the seam where many clients meet many servers. Clients address it as a single endpoint; it owns the decision of which backend serves each request.

What a load balancer actually does

Three jobs, in order of importance. Distribute incoming requests across a pool of backends so no single one is overwhelmed. Detect failure by health-checking those backends and removing dead ones from rotation before users notice. Terminate shared concerns — TLS, HTTP/2, compression — once at the edge instead of N times across the fleet.

Every cloud provider sells one (ALB on AWS, GCLB on Google, Application Gateway on Azure). Every infrastructure team runs at least one (HAProxy, NGINX, Envoy, Traefik). And modern service meshes (Istio, Linkerd) embed a small one as a sidecar next to every pod. The mechanics are the same; only the deployment topology changes.

Distribute: Pick a backend per request: round-robin, least-connections, hash-based, or weighted. The algorithm choice has outsized impact on tail latency and is one of the few load-balancer settings worth defending in a review.
Detect: Probe each backend on a schedule. If a probe fails N times in a row, mark the backend unhealthy and stop sending it traffic. When probes recover, re-add it gradually.
Terminate: Decode TLS once, parse HTTP once, log once, rate-limit once. The backend speaks plain HTTP/1.1 over a private network and does not even know what protocol the client used.

L4 vs L7 — the only architectural choice that matters

Load balancers operate at one of two layers of the OSI stack. The choice changes what the LB knows about the traffic it's routing, and therefore what it can do with it.

	L4 (transport)	L7 (application)
Sees	IP + port	HTTP method, path, headers, cookies, body
Can route by	Connection	Path, host, header, cookie, query string
TLS	Passes through (or terminates)	Always terminates
Throughput	Higher (kernel-level)	Lower (user-space parsing)
Latency added	~10 µs	~100 µs to 1 ms
Examples	AWS NLB, IPVS, HAProxy in tcp mode	NGINX, Envoy, ALB, Traefik

Use L4 when you need to load-balance non-HTTP protocols (databases, MQTT, raw TCP), maximise throughput, or preserve the client IP all the way to the backend. Use L7 for everything else — most public traffic, anything that benefits from path-based routing, anything that wants to be A/B tested or canary-rolled. In practice big systems run both: a thin L4 load balancer at the very edge for DDoS scrubbing and TLS, then a fleet of L7 load balancers behind it doing the application-aware work.

The algorithms

The choice of algorithm decides how evenly traffic spreads, how well long-tail latency behaves, and whether a single misbehaving backend can take the system down. There are six algorithms you should know.

Round-robin

Cycle through backends 1, 2, 3, 1, 2, 3. Trivial. Works when backends are identical and requests cost the same. Falls apart if one backend gets a slow request and pile-ups go unseen.

Weighted round-robin

Some backends get more requests because they have more CPU. Use weights when the fleet is heterogeneous (m6i.large + m6i.xlarge mixed) or you're shifting traffic to a new generation of instance.

Least-connections

Send the next request to the backend with the fewest open connections. Self-corrects for slow requests. The default choice for most production HTTP services.

Least-response-time

Combine connection count with observed P50 latency. Better behaviour during partial degradation, but harder to reason about and more sensitive to noisy probes.

IP / cookie hash (sticky)

Hash the client IP or a session cookie and pin them to one backend. Cheap session-affinity. Required when backends hold per-user state in memory.

Consistent hash

Hash a stable key (user ID, cache key) onto a ring. Adding or removing one backend reshuffles only 1/N of keys instead of all of them. Foundation of distributed caches and shard routing.

Health checking — the part that makes it self-healing

An algorithm that distributes traffic perfectly across N backends still sends a third of traffic to the dead one if N=3 and one is dead. Health checks are what convert "I have backends" into "I have healthy backends."

Active probes: The LB makes its own request — usually GET /healthz — every few seconds. Cheap, well-understood, but the probe doesn't experience real traffic, so a backend can be passing probes while failing real requests.
Passive probes: The LB observes outcomes of real requests. Three 5xx in a row → mark unhealthy. Catches problems active probes miss but can flap during incidents and isolate good backends if traffic is bursty.
Outlier detection: Envoy's term for "kick out backends whose error rate is N standard deviations above peers." Combines passive probing with a statistical floor.
Slow start: When a previously-unhealthy backend rejoins, ramp its weight from 0 to 1.0 over 30-60 seconds. Prevents a cold backend from being flooded the moment it returns.

Real systems use all of the above together. The probe shape matters: /healthz should hit the same code path real traffic does — connect to the database, touch the cache — not just return 200. A probe that always passes is worse than no probe.

The hard cases

Sticky sessions break drains. If you've pinned 5,000 users to a backend with cookie hash, you cannot drain it gracefully — those users will keep landing on it until their cookie expires. Mitigation: short-TTL cookies, store session state in Redis instead of memory, or use consistent hashing instead of pinned hashing.

The thundering herd on backend recovery. A backend rejoins the pool, the LB sends it 1/N of traffic instantly, and it falls over because its caches are cold. Mitigation: slow start (weight ramp), pre-warm with synthetic traffic before re-adding, or use the "ready vs healthy" two-state model — ready means probes pass, serving means it has been ramped up.

The LB itself. A single-node load balancer is a single point of failure. Production deployments run an HA pair (active/standby with VRRP) or a multi-instance fleet behind anycast. The cloud LBs hide this by being fleet-managed, but if you run NGINX yourself, plan for it.

Cross-region and global load balancing

Within one region, you have a load balancer. Across regions, you have DNS-based or anycast-based global routing. The two common approaches:

GeoDNS: The DNS server returns a different A record based on the client's resolver location — US clients get the us-east IP, EU clients get the eu-west IP. Simple, works everywhere, but DNS caching can leave clients pinned to a dead region for minutes.
Anycast: The same IP is announced from multiple regions; BGP routes clients to the nearest one. CDNs and Cloudflare's whole business depend on this. Failover is seconds, not minutes, but it requires owning your own IP space and BGP relationships — usually a managed-service buy, not a build.

Practical defaults

Default to L7 unless you have a specific reason for L4. Path routing, retries, and observability are worth the added latency.
Default algorithm: least-connections, with weighted variants only when the fleet is heterogeneous.
Active health probes every 5-10 seconds, three failures to mark unhealthy, two successes to mark healthy.
Probe a real endpoint that touches the database. Don't probe a static /ping that always returns 200.
Slow-start every backend that rejoins the pool. 30 seconds is a sensible default.
Run the LB in HA pairs. If you cannot afford an HA pair, use a managed cloud LB.
Terminate TLS at the LB, speak plain HTTP/1.1 to backends over a private network. Re-encrypting on the inside doubles work for negligible benefit.

Load balancing.

What a load balancer actually does

L4 vs L7 — the only architectural choice that matters

The algorithms

Health checking — the part that makes it self-healing

The hard cases

Cross-region and global load balancing

Practical defaults