Maglev, a load balancer in software.

In 2008 Google replaced its hardware load balancers with a software design that runs on the same Linux servers as the rest of the fleet. The paper describes how: a stateless consistent-hash forwarding table, direct-server-return packet forwarding, and a 65,537-row lookup table that keeps flow-to-backend mappings stable even as backends come and go. The design became the reference architecture for modern L4 software LBs — Cloudflare's Unimog, Meta's Katran, the cilium-based LBs in Kubernetes.

Authors Daniel E. Eisenbud, Cheng Yi, Carlo Contavalli, Cody Smith, Roman Kononov, et al

Year 2016

Venue NSDI

PDF →

TL;DR

Maglev is Google's L4 load balancer. It runs on commodity hardware, scales linearly with the number of Maglev instances, and routes packets to backends using a deterministic consistent-hash lookup table. Each Maglev instance independently computes the same lookup table from the live backend set; given a packet, all Maglev instances would forward it to the same backend — so connection state doesn't need to be shared. The lookup table has 65,537 rows (a small prime); each row maps to a backend. Adding or removing a backend reshuffles about 1/N of the rows. Packets to the LB's anycast VIP arrive at any Maglev instance; the instance looks up the backend in O(1), encapsulates the packet, and forwards to the backend. The backend replies directly to the client (Direct Server Return), skipping the LB on the return path.

The problem

Google's L4 load balancing in the 2000s used dedicated hardware appliances. As traffic grew, the appliances became operationally painful: capacity planning was lumpy (you bought a new appliance for a new POP), failures required hardware swaps, and the cost per Gbps was much higher than commodity Linux servers.

The team wanted a software replacement that would scale by adding Linux boxes, run on the same fleet hardware as everything else, and provide the same operational properties as the appliances: high throughput per box, predictable latency, no per-flow state across rebalances, and connection consistency across LB instances.

The key idea

Stateless hashing is the central idea. Each Maglev instance independently maintains a lookup table mapping a deterministic hash of the packet's 5-tuple (src IP, src port, dst IP, dst port, protocol) to a backend. The lookup table is identical across all Maglev instances because all instances run the same construction algorithm on the same backend set. So a flow's packets can be sprayed across Maglev instances (via ECMP from the upstream routers) and still all land on the same backend.

The lookup table has 65,537 rows — a small prime large enough to give load balance with realistic backend counts. Adding or removing a backend re-computes the table; only about 1/N of the entries change. The Maglev hashing construction (the paper's Section 3.4) is deterministic — every Maglev instance produces the same table from the same backend set.

Direct Server Return avoids putting the LB on the return path. The Maglev encapsulates the incoming packet in GRE (or IPIP) with the backend's IP, the backend decapsulates and processes, then sends the response directly to the client using the VIP as the source address. The LB sees only ingress traffic, halving the LB's bandwidth load.

Kernel-bypass packet processing. Maglev uses a userspace network stack with shared-memory rings to the NIC — predating DPDK's popularisation but with similar architecture. Each Maglev instance handles ~10 Gbps from a single NIC and ~10 million pps on a single core.

Connection consistency without state. Most software LBs in 2008 tracked flow state — they remembered which backend each flow was using. That state had to be replicated across LB instances. Maglev showed you can compute the answer from the same input on every instance, no replication needed. The flip side: when the backend set changes, ~1/N of flows get re-mapped. The paper measures the impact and shows it's acceptable for stateless protocols (HTTP without keepalive) and tolerable for stateful ones (with hash-based fallback for the few re-mapped flows).

Contributions

Maglev hashing. The deterministic permutation-based construction that produces a balanced lookup table with minimal disruption on backend changes. The construction is now the reference for L4 software LBs.

Stateless flow hashing at scale. No flow-state sharing across LB instances. Add a Maglev instance by adding a box and the upstream router's ECMP. Remove one the same way. Operational simplicity at the cost of accepting some flow disruption on rebalance.

Direct Server Return. Halving LB bandwidth by skipping the response path. Standard now in every modern L4 LB.

Production deployment evidence. The paper includes operational data: Maglev handles all of Google's incoming traffic, with per-instance throughput in the tens of Gbps, latency under 100 microseconds, and 99.99% availability over the measurement period.

Criticisms and limitations

The rebalance disruption is real. When a backend is added or removed, ~1/N of flows are re-mapped. For long-lived stateful flows (WebRTC, video streams), this means dropping the connection. Production systems pair Maglev with a slower fallback (a flow-state table for the unfortunate 1/N) or use sticky-routing techniques on top.

The single-tier model. Maglev is L4 only — no SSL termination, no HTTP routing, no rate limiting at the application level. It hands packets to backends that do all the application work. For full L7 functionality you need a second tier (Envoy, NGINX, the application itself).

Hash collisions and adversarial flows. A determined adversary can construct flows whose 5-tuples hash to the same backend, oversaturating one backend while others are idle. The paper doesn't address this; production deployments use hashing tricks (random seed per VIP, hash-based scrubbing) to mitigate.

Reliance on ECMP. Maglev assumes upstream routers do ECMP to spread flows across Maglev instances. ECMP itself can have flow-distribution issues (consistent-hash ECMP isn't universally supported). Cloudflare's Unimog paper addresses this with an alternative routing approach.

Where it shows up today

Cloudflare Unimog (2020) — explicitly inspired by Maglev. Uses a different hashing construction but the same stateless principle.

Meta Katran — Maglev's ideas, but implemented as an XDP/eBPF program in the Linux kernel, achieving ~10× the throughput per core.

Cilium's service load balancing in Kubernetes — uses XDP/eBPF with Maglev hashing.

AWS Network Load Balancer, GCP Network Load Balancer — both use Maglev-style stateless hashing internally (per public sources).

IPVS in the Linux kernel has a Maglev scheduler since kernel 4.18.

Most modern L4 software LBs assume Maglev hashing is the default; the academic alternatives (Rendezvous hashing, Jump hashing) are used in niche cases.

Follow-up reading

Stateless Datacenter Load-balancing with Beamer — Olteanu et al · 2018 · NSDI. A different stateless approach that aims to avoid flow disruption on rebalance.
Cloudflare's Unimog Load Balancer — Cloudflare blog · 2020. How Cloudflare built their version of Maglev. Same principles, different routing.
Katran: A High-Performance Layer 4 Load Balancer — Facebook engineering blog · 2018. Meta's Maglev-inspired XDP/eBPF implementation. Open source.
Consistent Hashing and Random Trees — Karger et al · 1997 · STOC. The general consistent-hashing background. Maglev is one of several variants.
The Tail at Scale — Dean & Barroso · 2013 · CACM. Why software LBs need to care about latency variance, not just throughput. Annotated.

More annotated papers

Back to the papers index

Foundational distributed-systems and database papers, read and annotated.

← All papers

Found this useful?