A load balancer is the floor of every distributed system. Above it, servers come and go. Below it, traffic keeps flowing as if nothing happened.
Horizontal scaling distributes traffic across many backends, but something has to do the distributing. A user types example.com; the request has to land on one of fifty servers, and the choice of which has consequences. The load balancer is the most critical, most invisible piece of any distributed system. This page covers the algorithms, the layers (L4 vs L7), the health checks that make it self-healing, and the failure modes that take systems down.
What a load balancer actually does
Three jobs, in order of importance. Distribute incoming requests across a pool of backends so no single one is overwhelmed. Detect failure by health-checking those backends and removing dead ones from rotation before users notice. Terminate shared concerns — TLS, HTTP/2, compression — once at the edge instead of N times across the fleet.
Every cloud provider sells one (ALB on AWS, GCLB on Google, Application Gateway on Azure). Every infrastructure team runs at least one (HAProxy, NGINX, Envoy, Traefik). And modern service meshes (Istio, Linkerd) embed a small one as a sidecar next to every pod. The mechanics are the same; only the deployment topology changes.
- Distribute
- Pick a backend per request: round-robin, least-connections, hash-based, or weighted. The algorithm choice has outsized impact on tail latency and is one of the few load-balancer settings worth defending in a review.
- Detect
- Probe each backend on a schedule. If a probe fails N times in a row, mark the backend unhealthy and stop sending it traffic. When probes recover, re-add it gradually.
- Terminate
- Decode TLS once, parse HTTP once, log once, rate-limit once. The backend speaks plain HTTP/1.1 over a private network and does not even know what protocol the client used.
L4 vs L7 — the only architectural choice that matters
Load balancers operate at one of two layers of the OSI stack. The choice changes what the LB knows about the traffic it's routing, and therefore what it can do with it.
| L4 (transport) | L7 (application) | |
|---|---|---|
| Sees | IP + port | HTTP method, path, headers, cookies, body |
| Can route by | Connection | Path, host, header, cookie, query string |
| TLS | Passes through (or terminates) | Always terminates |
| Throughput | Higher (kernel-level) | Lower (user-space parsing) |
| Latency added | ~10 µs | ~100 µs to 1 ms |
| Examples | AWS NLB, IPVS, HAProxy in tcp mode | NGINX, Envoy, ALB, Traefik |
Use L4 when you need to load-balance non-HTTP protocols (databases, MQTT, raw TCP), maximise throughput, or preserve the client IP all the way to the backend. Use L7 for everything else — most public traffic, anything that benefits from path-based routing, anything that wants to be A/B tested or canary-rolled. In practice big systems run both: a thin L4 load balancer at the very edge for DDoS scrubbing and TLS, then a fleet of L7 load balancers behind it doing the application-aware work.
The algorithms
The choice of algorithm decides how evenly traffic spreads, how well long-tail latency behaves, and whether a single misbehaving backend can take the system down. There are six algorithms you should know.
Cycle through backends 1, 2, 3, 1, 2, 3. Trivial. Works when backends are identical and requests cost the same. Falls apart if one backend gets a slow request and pile-ups go unseen.
Some backends get more requests because they have more CPU. Use weights when the fleet is heterogeneous (m6i.large + m6i.xlarge mixed) or you're shifting traffic to a new generation of instance.
Send the next request to the backend with the fewest open connections. Self-corrects for slow requests. The default choice for most production HTTP services.
Combine connection count with observed P50 latency. Better behaviour during partial degradation, but harder to reason about and more sensitive to noisy probes.
Hash the client IP or a session cookie and pin them to one backend. Cheap session-affinity. Required when backends hold per-user state in memory.
Hash a stable key (user ID, cache key) onto a ring. Adding or removing one backend reshuffles only 1/N of keys instead of all of them. Foundation of distributed caches and shard routing.
Health checking — the part that makes it self-healing
An algorithm that distributes traffic perfectly across N backends still sends a third of traffic to the dead one if N=3 and one is dead. Health checks are what convert "I have backends" into "I have healthy backends."
- Active probes
- The LB makes its own request — usually
GET /healthz— every few seconds. Cheap, well-understood, but the probe doesn't experience real traffic, so a backend can be passing probes while failing real requests. - Passive probes
- The LB observes outcomes of real requests. Three 5xx in a row → mark unhealthy. Catches problems active probes miss but can flap during incidents and isolate good backends if traffic is bursty.
- Outlier detection
- Envoy's term for "kick out backends whose error rate is N standard deviations above peers." Combines passive probing with a statistical floor.
- Slow start
- When a previously-unhealthy backend rejoins, ramp its weight from 0 to 1.0 over 30-60 seconds. Prevents a cold backend from being flooded the moment it returns.
Real systems use all of the above together. The probe shape matters: /healthz should hit the same code path real traffic does — connect to the database, touch the cache — not just return 200. A probe that always passes is worse than no probe.
The hard cases
ready means probes pass, serving means it has been ramped up.
Cross-region and global load balancing
Within one region, you have a load balancer. Across regions, you have DNS-based or anycast-based global routing. The two common approaches:
- GeoDNS
- The DNS server returns a different A record based on the client's resolver location — US clients get the us-east IP, EU clients get the eu-west IP. Simple, works everywhere, but DNS caching can leave clients pinned to a dead region for minutes.
- Anycast
- The same IP is announced from multiple regions; BGP routes clients to the nearest one. CDNs and Cloudflare's whole business depend on this. Failover is seconds, not minutes, but it requires owning your own IP space and BGP relationships — usually a managed-service buy, not a build.
Practical defaults
- Default to L7 unless you have a specific reason for L4. Path routing, retries, and observability are worth the added latency.
- Default algorithm: least-connections, with weighted variants only when the fleet is heterogeneous.
- Active health probes every 5-10 seconds, three failures to mark unhealthy, two successes to mark healthy.
- Probe a real endpoint that touches the database. Don't probe a static
/pingthat always returns 200. - Slow-start every backend that rejoins the pool. 30 seconds is a sensible default.
- Run the LB in HA pairs. If you cannot afford an HA pair, use a managed cloud LB.
- Terminate TLS at the LB, speak plain HTTP/1.1 to backends over a private network. Re-encrypting on the inside doubles work for negligible benefit.