Load Balancer Simulator: four ways to spread the work.

A load balancer spreads incoming requests across a pool of backends so no single server gets buried. Round-robin, least-connections, random, and power-of-two-choices all chase the same goal with different math. Run them on one arrival rate and watch the imbalance — and why p2c wins.

Algorithm
p2c
In-flight
0

Algorithm
Backends
Backends · in-flight requests
b00
served 0
b10
served 0
b20
served 0
b30
served 0
b40
served 0
b50
served 0

What you're looking at

Each column is one backend; its bar height and the bold number track requests in flight right now, with "served" counting the lifetime total below it. Every Dispatch sends one request to the backend the current algorithm picks; that request finishes after a random 300 to 1200 ms, so a column drains on its own. Burst 30 fires a quick volley, and Auto-load keeps a steady stream running so you can watch the shape settle rather than a single frame.

Start on Random and hold Auto-load: the bars jitter, and one column routinely spikes far above the rest while another sits idle. Switch to Power-of-2-choices with the same stream and the columns tighten into a band almost as even as Least-connections, even though P2C only ever looks at two backends per pick and keeps no global ordering. That near-tie is the surprise. The jump from one random choice to two collapses the worst-case pile-up; going to three would barely move the bars further.


What is a load balancer?

One name, many backends.

A load balancer distributes incoming requests across a pool of backend servers so no single server is overwhelmed. Modern load balancers operate at Layer 4 (TCP/UDP) or Layer 7 (HTTP) and use one of five canonical algorithms — round-robin, least-connections, P2C (power-of-two-choices), consistent hashing, or weighted variants. Every web service at scale has at least one in front of it.

Imagine a small web service that took off. Three years ago it ran on a single VM with a public IP and a domain pointing straight at it. Around fifty thousand users a day was the soft ceiling; beyond that, the CPU pegged and response times went bad. The fix that bought another year was a bigger machine. The fix after that was a second copy of the service on a second machine. Suddenly there is a question that did not exist before: when a user types your domain name into their browser, which of the two machines should answer?

You need a small piece of software that sits in front of every backend and answers exactly that question, request by request. It accepts a connection on behalf of the fleet, picks a healthy backend, forwards the request, and returns the response. Done well, the client never knows the dispatch happened: it sees one IP, one TLS certificate, one consistent latency. Done badly, you get the load-balancer fail modes that fill engineering postmortems for years afterward — uneven backend load, sticky-but-stale sessions, a slow backend that drags the whole tail latency upward, a health check that is too slow to react when a process actually dies.

The simulator above shows what the choice of dispatch algorithm does to a fleet of five backends under bursty traffic. Round-robin spreads requests evenly by count but ignores how busy each backend already is — give one backend a slow request and the next request goes to it anyway. Least-connections sends each new request to whichever backend has the fewest in-flight calls — the textbook fix for heterogeneous service times. Random-with-two-choices (sometimes called "power of two choices" or P2C) picks two backends at random and sends to whichever has fewer in-flight; it is almost as good as least-connections but stateless and embarrassingly parallel. Random by itself is the baseline you compare to. Watch the bars: when service times are skewed, the difference between algorithms is the difference between a 50 ms p99 and a 500 ms p99.

Two anchoring numbers. A modern L7 reverse proxy (NGINX, Envoy, HAProxy, Caddy) handles roughly one hundred thousand to one million HTTP requests per second per CPU core, depending on TLS termination cost and rule complexity. A modern L4 dispatcher built on eBPF (Cloudflare's Unimog, Meta's Katran, IPVS) handles ten million packets per second per core because it never copies the payload to user space. Production stacks chain these: an L4 box at the edge, an L7 fleet behind it, and a sidecar inside each pod doing the final-mile dispatch with the freshest health information. The rest of this article walks through that stack, the algorithms each layer uses, and the failure modes that pay each layer's salary.

INCOMING REQUESTS · FANNED TO BACKENDSclientsload balancerone IP · one certpicks healthy backendbackend 1backend 2backend 3 sickbackend 4FAILED HEALTH CHECK · DRAINED FROM ROTATION · NEW REQUESTS SKIP IT

Origins — load balancing is a sixty-year problem

Spreading work, a sixty-year problem.

Load balancing predates the web. The mainframe era split jobs across CPUs by queue length; AT&T's Distributed Computing System (1973) and the multi-CPU schedulers of OS/360 already faced the question every load balancer still asks: given N workers and a stream of arriving tasks of unknown service time, where should the next task go? The Bell Labs and IBM literature of the 1970s answers that question for hard real-time and batch contexts; the routing primitives invented there — round-robin, shortest queue, randomised dispatch — transferred almost unchanged into network load balancing thirty years later.

The networking flavour took shape with Cisco's LocalDirector in 1996, F5's BIG-IP in 1997, and Foundry Networks' ServerIron in 1999 — all hardware appliances that sat in front of the web farm and rewrote destination addresses on the fly. The idea was modest: take what an Ethernet switch already did and add per-flow stickiness plus health probing. It worked because the alternative — round-robin DNS — was slow to react when a backend died (TTLs measured in minutes) and gave clients no way to keep a session pinned to one server. The dedicated layer-4 dispatcher was invisible to clients, switched flows in microseconds, and could yank a sick backend out of rotation in seconds.

The next inflection was software. HAProxy 1.0 shipped in 2001; Willy Tarreau wrote it as an explicit alternative to the appliance market — same job, on commodity Linux, with full source. By the late 2000s NGINX had picked up reverse-proxy duties at scale; by the mid-2010s Lyft's Envoy (open-sourced 2016) had reframed the load balancer as a sidecar primitive that every microservice mesh would adopt. Today's stack is layered: the eBPF-accelerated L4 dispatcher at the edge (Cloudflare's Unimog, Meta's Katran, Google's Maglev) hands traffic to an L7 proxy fleet (Envoy, NGINX, Microsoft's YARP from 2022) that hands it to the in-cluster mesh sidecar that hands it, finally, to a pod. The decisions feel different at each layer but the underlying scheduling problem — which backend gets the next request — is the same problem the OS/360 dispatcher solved in 1968.

What unifies the entire history is the brutal asymmetry of the workload. Service times are heavy-tailed: a few requests take 100× longer than the median. A scheduler that ignores in-flight work concentrates the long tails onto whichever unlucky backend they land on, and the resulting queue inflates the latency for every short request behind them. The next sixty years of progress amount to incrementally smarter ways of refusing to be that ignorant scheduler.


L4 vs L7 load balancing — two layers, two trade-offs

Two layers, two trade-offs.

The first design decision in any load-balancer architecture is which OSI layer to inspect. Layer 4 routes on the TCP or UDP five-tuple — source IP, source port, destination IP, destination port, protocol — without ever decrypting or parsing what the bytes mean. Layer 7 terminates the application protocol (HTTP, gRPC, WebSocket, MQTT) and routes on its semantic content: path, header, cookie, JWT claim, body fragment. The choice cascades through every other property of the system.

L4 is fast and stupid. A modern eBPF-based L4 dispatcher like Meta's Katran or Cloudflare's Unimog processes ten million packets per second per core because it never copies the payload into userspace; the kernel XDP hook examines the headers and rewrites the destination MAC or IP in place. AWS's NLB handles tens of millions of flows per zone with a sub-millisecond addition to round-trip time. The trade-off is that L4 cannot tell a logged-in user from an anonymous one, cannot send /api/checkout to one fleet and /static/* to another, and cannot retry a failed request — a TCP connection is opaque, so the failed bytes are lost.

L7 is slower and richer. NGINX, HAProxy, Envoy, AWS ALB, GCP HTTP(S) load balancing, and Cloudflare's edge all terminate TLS, parse HTTP, and route on application semantics. Throughput drops to one hundred thousand to one million requests per second per node depending on TLS handshake cost, header parsing, and routing-rule complexity. In return you get content-based routing, per-request retries with idempotency awareness, request/response buffering, deep healthchecks against application endpoints, and observability that distinguishes 503 from 504 from 599. Modern L7 stacks support HTTP/2 and HTTP/3 multiplexing, which means a single client connection can dispatch hundreds of in-flight requests onto whatever backends the load balancer chooses, request by request, rather than once per connection.

Most production architectures chain the two. The edge layer is L4 for raw throughput and DDoS absorption; an internal L7 fleet handles content routing, retry, and per-service policy; a service-mesh sidecar (Envoy, Linkerd2-proxy, Cilium) inside the pod does final-mile load balancing with the freshest view of backend health. AWS's canonical pattern is NLB → ALB → service. Cloudflare's is Magic Transit (L3/L4) → CDN → Workers (L7). Google's is Maglev (L4) → GFE (L7) → service. Each layer answers a different question with a different latency budget.

PropertyL4 (TCP/UDP)L7 (HTTP/gRPC)
Routing keys5-tupleheaders, path, cookies, body
Throughput10M+ pps/core100k–1M req/s/core
TLSpass-throughterminate
Stickysource IP onlycookie, header, JWT claim
ExamplesAWS NLB, Maglev, Katran, IPVSAWS ALB, Envoy, NGINX, YARP
L4 PASSTHROUGH VS L7 TERMINATIONclientTLS bytesL4 NLB / Maglev5-tuple hashTLS bytesbackend (terminates TLS)L4 · OPAQUE PAYLOAD · MILLIONS OF FLOWS / COREclientTLSL7 ALB / EnvoyTLS terminate ·parse path · routeHTTP/2backend poolL7 · TLS TERMINATES · CONTENT ROUTING

Load balancing algorithms — round-robin, P2C, least-connections, hashing

Five algorithms, one scheduling problem.

Round-robin is the textbook starter: a counter, a modulo, an O(1) pick. It wastes nothing in steady state when service times are uniform and backends are identical — but real workloads are heavy-tailed, real backends are heterogeneous (different generations of EC2 instances, different cache warmth, different JVM heap pressure), and round-robin's ignorance of in-flight work concentrates long requests on whichever backend they happen to land on. Weighted round-robin patches half of that by emitting each backend an integer number of times per cycle proportional to its capacity weight; it solves heterogeneity but still ignores the queue.

Least-connections tracks active in-flight count per backend and picks the lowest. The implementation is O(N) per dispatch unless you maintain a min-heap. It adapts beautifully to long-tail service times — a backend stuck on a slow request stops receiving new ones until it drains. Least-time (NGINX Plus, Linkerd) extends the idea by combining in-flight count with an exponentially weighted moving average of recent response time, so a backend that is technically idle but historically slow loses traffic before its queue rises. Both schemes need centralised state, which is fine for an in-process load balancer but a coordination headache for a horizontally scaled fleet of dispatchers.

Random is the surprising baseline. Throw a uniform dart at the backend pool. Mean load is perfect; variance is awful. With n requests across n bins, the heaviest bin holds roughly ln n / ln ln n requests — for a thousand bins, about seven; for a million, about nine. The scheduler is stateless, scales linearly with dispatcher count, and never coordinates — which is exactly the property that makes it the right starting point for the next idea.

Power-of-two-choices (Mitzenmacher's tutorial paper, 2001; Vöcking's tighter bounds, 2003; Azar, Broder, Karlin, Upfal's foundational result, 1994) picks two backends uniformly at random and selects the less loaded of the two. The arithmetic of the worst-case bin drops from ln n / ln ln n to ln ln n / ln 2. Going from two choices to three barely helps; the first jump captures essentially all of the gain. The algorithm needs no coordination across dispatchers, costs two random reads of a load counter per dispatch, and is the default scheduler in Twitter's Finagle, Linkerd's proxy, Envoy's least-request policy, and HAProxy's leastconn with balance random.

Consistent hashing is the algorithm you reach for when stickiness matters — cache affinity, session affinity, sharded state. Hash the request key (user ID, session cookie, source IP for the unfortunate L4 case) onto a ring; place each backend at k evenly distributed points on that ring; route the request to the next backend clockwise from its hash. Adding or removing one backend re-keys only 1/N of the keyspace, not all of it. Maglev hashing (Eisenberg et al, NSDI 2016) uses a permutation table instead of a ring, achieving lower memory overhead and a more uniform key distribution at the cost of a more expensive table rebuild on membership change. Earliest-deadline-first scheduling and weighted-fair queueing (Demers, Keshav, Shenker, 1989) are EDF cousins that show up in router quality-of-service rather than in load balancers proper, but the underlying scheduling intuition is the same family.

POWER OF TWO CHOICES · PICK 2 RANDOM, ROUTE TO LIGHTERrequestarrivespick A (rand)pick D (rand)backend A · 12backend B · 5backend C · 8backend D · 4D wins (4 < 12)D · cwnd 5WORST-CASE BIN: ln ln n / ln 2 · STATELESS · NO COORDINATION
Algorithm State Variance Where
Round-robincounterhigh (heavy tail)NGINX, classic
Least-connectionsper-backend gaugelowHAProxy, NGINX Plus
Power-of-two2 random gaugesvery lowEnvoy, Linkerd, Finagle
Maglev hashpermutation tablestable on reshuffleGoogle, Katran
Ring hashsorted ringstable, cache-affineVimeo CDN, Akamai

Load balancers in production — F5, NGINX, HAProxy, Envoy, ELB, Maglev

Real systems, at scale.

Maglev is Google's L4 software load balancer described in Maglev: A Fast and Reliable Software Network Load Balancer (Eisenberg, Lim, Andrews, Dukes, Hong, et al; NSDI 2016). It runs on commodity Linux, sustains roughly ten gigabits per second per box at line rate, uses Maglev hashing for stable backend assignment across reconfigurations, and survives connection state loss because the hash function is consistent enough that a restarted Maglev arrives at the same backend choice for the same flow. Google's frontend network has used Maglev since around 2010.

Katran is Meta's open-source eBPF/XDP L4 load balancer, deployed across Facebook's edge starting in 2017 and open-sourced in 2018. It uses XDP to process packets before the kernel networking stack, achieving roughly ten times the throughput of IPVS at one quarter the CPU. The same XDP toolkit underlies Cloudflare's Unimog (described in their 2020 engineering blog) and parts of GitHub's edge.

Envoy, written by Matt Klein at Lyft and open-sourced in September 2016, became the default service-mesh data plane after CNCF graduation in 2018. Its xDS control-plane API decoupled configuration from the proxy binary, letting Istio, Consul Connect, AWS App Mesh, and Google Cloud Service Mesh all drive Envoy without forking it. Envoy's default load-balancing policy is LEAST_REQUEST with two-of-N choices — literally the algorithm this simulator runs.

HAProxy 1.0 launched in 2001; the 2.0 release in 2019 added native HTTP/2; 2.4 (2021) added HTTP/3 via the QUIC stack; the 2.7 line in 2022 added ALPN-based protocol selection. It runs the load-balancing fleet at Stack Overflow, GitHub, Reddit (historical), and a long tail of latency-sensitive sites. NGINX 1.0 shipped in 2011 from Igor Sysoev's earlier work; the open-source variant supports least_conn, IP hash, and weighted round-robin; NGINX Plus adds least-time and active health checks.

AWS exposes three managed load balancers. The Classic Load Balancer (CLB, 2009, deprecated in 2018 but still serving traffic) was the first managed offering. The Application Load Balancer (ALB, 2016) is a managed L7 with content routing, AWS WAF integration, and cognito auth. The Network Load Balancer (NLB, 2017) is a managed L4 that scales to tens of millions of flows per zone with sub-millisecond latency. GCP's offering is split between Cloud Load Balancing (L7, global anycast, the same Maglev technology) and the regional internal L4 service.

In Kubernetes, the load-balancing layer is split between kube-proxy for in-cluster service VIPs and an external Service of type LoadBalancer for ingress. kube-proxy itself has three implementations: the original iptables mode (linear lookup, fine to a few thousand services), the ipvs mode (introduced in Kubernetes 1.11, kernel-level hash table, scales to tens of thousands of services), and the eBPF-based Cilium data path (Isovalent, CNCF, 2018) which replaces both iptables and IPVS with kernel-resident eBPF programs and connection-tracking maps. MetalLB provides a software load-balancer for bare-metal clusters that don't have a cloud provider's VIP service; it speaks BGP or layer-2 ARP to claim service IPs.

# Envoy cluster · power-of-two-choices, panic threshold, outlier ejection
clusters:
- name: orders
  type: STRICT_DNS
  lb_policy: LEAST_REQUEST
  least_request_lb_config:
    choice_count: 2
  common_lb_config:
    healthy_panic_threshold:
      value: 50.0
  outlier_detection:
    consecutive_5xx: 5
    interval: 10s
    base_ejection_time: 30s
    max_ejection_percent: 50
  health_checks:
  - timeout: 1s
    interval: 5s
    interval_jitter: 1s
    unhealthy_threshold: 3
    healthy_threshold: 2
    http_health_check:
      path: /healthz

Health checks — a backend is healthy until proven sick

A backend is healthy until proven sick.

Health checking is where a load balancer earns or loses its keep. Get the tuning wrong and the dispatcher routes to a dead pod (false-healthy: visible 5xx storms) or removes every live one at the same moment (false-unhealthy: the whole fleet drops out and the load balancer has nowhere to send traffic). The literature treats it as three knobs and a topology choice.

Active checks probe each backend on a schedule. The interval is typically five to thirty seconds for HTTP and ten to sixty for raw TCP; the unhealthy threshold is two or three consecutive failures (one is too jumpy — a single packet drop ejects a healthy backend); the healthy threshold for re-admission is at least the unhealthy threshold so flapping backends get filtered. Envoy's interval_jitter spreads probes pseudo-randomly so 100 dispatchers don't all probe the same backend in the same millisecond.

Passive checks infer health from real traffic. AWS calls this connection draining; Envoy calls it outlier detection; HAProxy calls it the observe directive. A backend that returns five 5xx in a sliding ten-second window is ejected from the pool for thirty seconds, then re-admitted on probation. The advantage is that passive checks see exactly what real users see; the disadvantage is that they react slowly when traffic is sparse and can amplify false positives during a partial outage.

Shallow versus deep probes ask different questions. A shallow check (GET /health → 200) tells you the process is up. A deep check (GET /healthz with a database round-trip and a cache ping) tells you the process can serve real requests. Kubernetes draws this distinction explicitly with livenessProbe (shallow — restart the container if it fails) and readinessProbe (deep — remove from the service endpoint list but don't restart). Confusing the two is one of the most common Kubernetes misconfigurations: a deep liveness probe restarts pods every time the database wobbles, turning a database hiccup into a cascading restart storm.

Connection draining handles the deregistration side. When a backend is being removed — deploy, scale-down, autoscaler decision — the load balancer stops sending it new connections but lets existing in-flight requests finish. AWS ALB defaults to thirty seconds of drain; Envoy's terminationDrainDuration defaults to five seconds. The pairing with Kubernetes' preStop hook and the terminationGracePeriodSeconds field is the most failure-prone configuration in the cloud-native stack; mismatched values produce dropped connections during routine deploys.

The classical pathology is the cascading-removal trap: a deep health check that depends on a shared dependency (a database, an auth service) ejects every backend at once when that dependency wobbles. Envoy and HAProxy both expose a panic threshold — a fraction (typically 50 percent) below which the load balancer abandons the healthy filter entirely and routes to all backends as if all were healthy, on the principle that load-shedding to a slow shared dependency is preferable to routing all traffic to one survivor. This is the single most important load-balancer setting nobody touches.

Liveness vs readiness, in one rule

A liveness probe failing should restart the pod. A readiness probe failing should drain it from the load balancer. If your liveness check pings the database, a database wobble restarts every pod and turns a five-minute incident into a thirty-minute one. Liveness shallow, readiness deep.


Session affinity — when the dispatcher has to remember

When the dispatcher has to remember.

Stateless backends are the easy case. Hard cases need stickiness: WebSocket and Server-Sent Events keep a single TCP connection alive for hours; gRPC's HTTP/2 multiplexes hundreds of streams on a single connection; cache-affine systems (sharded Redis, Memcached, in-memory session stores) require the same key to land on the same backend or the cache miss rate skyrockets. Stickiness comes in three flavours, each with a different cost.

IP hash is the L4 fallback. Hash the source IP plus port to pick a backend and route every packet of that flow there. It is implemented in the kernel (ipvs --scheduler sh), processes packets at line rate, and works for any protocol. The flaw is mobile NAT: tens of thousands of mobile clients behind a single carrier-grade NAT all hash to the same backend, hot-spotting it. The same flaw appears with corporate proxies and any single-IP egress. Use it for narrow workloads where the IP space is naturally well-distributed.

Cookie/header stickiness is the L7 standard. The load balancer sets a cookie on the first response (Envoy's cookie hash policy, ALB's AWSALB cookie, NGINX Plus's sticky cookie) or the application sets one explicitly (signed JWT with a backend-id claim). Subsequent requests carry the cookie; the load balancer reads it and routes deterministically. The advantage is that stickiness survives client IP changes (mobile network handoff) and that it works through proxies. The disadvantage is that the cookie outlives the backend that issued it, so deregistering a backend forces a fresh cookie on its sticky users.

Consistent hashing is the cache-affine choice. Hash the request key on a ring; route to the next backend clockwise. Adding a backend re-keys only 1/N of traffic; removing one re-keys the same fraction. Use the request key (user ID, tenant ID, cache key) as the hash input rather than the source IP. Bounded-load consistent hashing (Mirrokni, Thorup, Zadimoghaddam, 2016) caps the per-backend load at a small constant times the average; if the next backend is full, traffic spills to the next available one on the ring. Used in Vimeo's CDN, Akamai's edge logic, and Google Cloud's load balancer.

Direct server return (DSR), sometimes called direct routing or triangular routing, lets the response bypass the load balancer entirely. The dispatcher rewrites the destination MAC of incoming packets but leaves the destination IP unchanged; the backend, which is configured to claim the VIP on a loopback alias, accepts the packet and replies directly to the client over its default route. The asymmetric path means the load balancer sees only ingress traffic, halving its bandwidth budget — useful when responses dominate (video, large file downloads). LVS-DR, Maglev, Katran all support DSR; AWS NLB does it implicitly because it preserves source IP through a flow-table lookup. The cost is operational complexity: DSR requires layer-2 reachability between dispatcher and backend, and it interacts badly with stateful firewalls that expect to see both directions of the conversation.


Where load balancers fall apart — slow drains, herd, asymmetric backends

Where load balancers fall apart.

Thundering herd at startup. When a fresh backend joins an empty pool under least-connections, every dispatcher picks it for the next request because its in-flight count is zero. A burst of dispatch decisions land on the same backend in the same millisecond and the cold start (JIT warmup, cache fill, JVM tier-up compilation) collapses under the load. Envoy's slow_start_window ramps a freshly added backend's effective weight from zero to its configured value over the first thirty seconds; HAProxy has the same idea under slowstart.

Hash drift on resharding. Plain modular hashing (hash(key) mod N) re-keys (N-1)/N of all traffic when one backend joins or leaves — almost the entire keyspace. Consistent hashing limits this to 1/N. The Discord 2017 outage is the canonical example: a deploy added a single Redis shard, plain modular hashing rebalanced 96 percent of cache keys, the cache miss rate spiked, the database fell over, and the entire chat backend went dark for forty minutes.

The retry storm. A single backend hiccups; retries from upstream double the load on the surviving backends; one of those falls behind; more retries; the entire fleet is processing five times the offered load. AWS's 2015 DynamoDB outage (postmortem published 2015-09-20) was triggered exactly this way. The mitigation is paired: bounded retries with jitter, retry budgets (Google SRE Book chapter 22), and circuit breakers (Envoy's outlier_detection, Hystrix's deprecated but influential model) that fail fast rather than retry into a saturated backend.

Slowloris and friends. A malicious client opens a connection, sends bytes one per second, and never finishes the request. An L4 load balancer happily holds the connection open; a backend with limited per-process file-descriptor budget exhausts it. NGINX's client_body_timeout and HAProxy's timeout client default to ten to thirty seconds for exactly this reason. AWS ALB's idle timeout (default 60 s, max 4000 s) is the production knob.

Half-open connections after failover. When a load balancer fails over to a hot standby, the new dispatcher has no record of in-flight TCP connections. Existing client sockets continue to send packets and silently drop them — a problem that takes minutes to surface as a TCP retransmit timeout on the client. Conntrack-based stateful failover (LVS, IPVS with sync_daemon) helps but is not free; consistent-hash dispatchers like Maglev sidestep the problem because the same flow hashes to the same backend regardless of which dispatcher handles it.

Sticky-session lock-in. A backend serves ten thousand sticky sessions and has to be drained for a deploy. With strict stickiness, all ten thousand sessions break. Cookie-based stickiness with a short TTL (10–15 minutes) plus connection draining is the usual escape hatch; consistent hashing on a session-ID key with bounded load is a more elegant one. The GitHub 2018-10-21 outage was extended in part because session affinity to the partitioned MySQL master complicated the failover topology.


Further reading on load balancing

Primary sources, in order.

Found this useful?