The tail at scale, why p99 is the metric that matters.

In nine pages, Dean and Barroso reframed how the systems community thinks about latency. The argument: when a request fans out to thousands of leaves, the user sees the slowest leaf's latency. Average latency is irrelevant. The 99.9th percentile of any one server's latency becomes the median of the user's experience. The paper named the problem and prescribed the toolbox.

→ The original PDF (research.google, free)

Authors Jeffrey Dean, Luiz André Barroso

Year 2013

Venue CACM

PDF →

TL;DR

A web search request fans out to thousands of leaf servers. The user waits for all of them. If each server's p99 is 1 second, and the user request hits 100 servers in parallel, the probability that at least one server takes more than a second is ~63%. So a server-level p99 of 1 second produces a user-level p50 of 1 second. The paper enumerates the sources of variability (GC, queueing, shared resources, OS scheduler, hardware variability), shows how fan-out amplifies them, and proposes three families of solutions: reduce variability at the source, mask it within the request (hedged requests, tied requests), and avoid root-causing it by overprovisioning.

The problem

By 2012 Google Search was a distributed system in which a user query was answered by thousands of leaf servers in parallel — each owning a slice of the index, all returning their top-k partial results to a top-level aggregator. The aggregator could only return to the user after every leaf responded (or after a timeout that dropped some results).

Internal latency dashboards showed that the p50 per-leaf latency was tiny — single-digit milliseconds. But the user-perceived p50 was hundreds of milliseconds. The arithmetic was unforgiving: a 1-in-1000 slow event at a single server became near-certain when 1000 servers were queried at once.

The conventional response — "optimise the slow path" — wasn't enough. The slow path was the tail of a perfectly healthy server's latency distribution, made up of unavoidable contributions: GC pauses, queueing waits, lock contention, OS scheduler quanta, kernel page-cache evictions. The systems community needed both a name for this problem and a toolkit for addressing it.

The key idea

The central idea is that latency variability compounds with fan-out. The paper presents the formula: if each leaf has probability p of exceeding latency L, and the request fans out to n leaves in parallel, the probability that at least one exceeds L is 1 − (1 − p)ⁿ. For p = 1% and n = 100, this is 63%. For n = 1000, it's effectively 100%.

The paper enumerates the variability sources: shared resources (other processes on the same machine, network paths, disks), daemons (logging, monitoring, garbage collection), per-server bursts (large requests interleaving with small ones, queueing delay from co-located tenants), and machine-to-machine heterogeneity (slower hardware, thermal throttling, soft errors).

It then categorises the responses. Reducing variability means addressing the sources directly — fewer co-located tenants, smaller GC pauses, eliminating coordinated batch jobs. Latency-aware request distribution means hedging or tying requests: send the same request to two replicas and use whichever responds first (hedged), or send to two and cancel the slower one once the first responds (tied). Information-aware scheduling means picking the replica with the lowest expected queueing delay based on observed load.

The hedged-request math. Hedging works because a request's own distribution of latency dominates over any cost of duplicating it. The paper measures: sending two requests in parallel and using whichever returns first reduces p99 latency by ~5× on a typical Google service, while only doubling the load for requests slow enough to cross the hedge threshold. If the hedge timer is set just above p50, fewer than 5% of requests get duplicated. The amortised cost of duplicated requests is tiny compared to the latency win.

Contributions

Named the problem. Before this paper, "tail latency" was a term database people used for slow queries. After it, every systems engineer understood that p99 latency at one node becomes p50 latency at the fan-out aggregator.

Hedged requests. The technique of sending duplicate requests to multiple replicas and using the first response. Now standard in gRPC, Envoy, the AWS SDK, and most service meshes.

Tied requests. A refinement of hedging where the duplicate is sent only after a delay and is cancelled once the first responds. Reduces the duplication overhead while still cutting p99.

Variability sources catalogue. The paper's enumeration — GC, daemons, queueing, OS, hardware — is the checklist every SRE goes through when investigating a tail-latency spike. The catalogue legitimised treating these as first-class engineering concerns rather than "noise".

The fan-out latency model. The 1−(1−p)ⁿ formula is now standard in capacity planning. Once you see it, you can't un-see why microservice architectures with deep call graphs need much tighter per-service latency budgets than monoliths.

Criticisms and limitations

The paper assumes parallel fan-out. Sequential call graphs (microservice chains) accumulate tail latency additively rather than multiplicatively, and the prescriptions differ — tied requests don't help when the next call depends on the previous one's result.

Hedged requests double the load on the slow path. In a heavily-loaded system, hedging can cascade — every slow request hedges, increasing load, slowing more requests. Production systems need feedback loops (circuit breakers, adaptive hedge thresholds) to keep hedging from amplifying outages.

The paper offers techniques but no general theory of when to apply each. The choice between "reduce variability" (expensive engineering work) and "hedge" (cheap but doubles tail load) is left to judgement.

Where it shows up today

gRPC, Envoy, Istio, Linkerd — all implement hedged-requests as a first-class feature, often with adaptive thresholds.

BigQuery and Spanner — both use hedged-requests internally to absorb tail latency from individual storage shards.

Cassandra and DynamoDB — both use "speculative retry" (their name for hedging) to mask GC pauses and stragglers.

The CNCF SLO documentation, every Google SRE book chapter on latency, and every modern observability product's tail-latency dashboard cite this paper directly.

Follow-up reading

Maglev: A Fast and Reliable Software Network Load Balancer — Eisenbud et al · 2016 · NSDI. A load balancer that explicitly cares about p99. Annotated.
The Datacenter as a Computer — Barroso, Clidaras & Hölzle · 2013. Barroso's longer book-length treatment of warehouse-scale computing.
Heracles: Improving Resource Efficiency at Scale — Lo et al · 2015 · ISCA. Google's system for co-locating latency-critical and batch workloads while protecting tail latency.
Bringing the Web up to Speed with WebAssembly — Haas et al · 2017 · PLDI. Tangential, but their predictable-runtime story is a reaction to similar tail-latency concerns in browsers.
Borg, Omega, and Kubernetes — Burns et al · 2016 · CACM. Container scheduling that has to manage tail latency through resource isolation. Annotated.

More annotated papers

Back to the papers index

Foundational distributed-systems and database papers, read and annotated.

← All papers

Found this useful?