01 / 08

Methods / 01

Latency budgets & percentiles

Latency is a distribution, not a number. The methods that work in production treat it that way: pick a P99 deadline; allocate it across the layers; track each layer's contribution; treat coordinated-omission load tests as the lying tools they are. The interactive widget below shows the deadline cascade live; the rest of this page explains the percentile traps and operational details that catch every team eventually.

Try it — drag the budgets

A realistic request chain: edge, gateway, auth, cache, service, database, fan-out. Pick a top-level deadline. Allocate per-layer budgets. Watch the bar turn red when the sum overflows. Note the P99-of-N callout — that's the percentile-composition trap, made visible.

Interactive · Latency budget tree Each layer eats from the request's deadline. Watch the bar overflow.

Top-level deadline (P99)

200ms

edge / CDN

API gateway

auth check

cache lookup

service A

database

service B (parallel)

0 deadline 200 ms 200 ms

TIGHT 146 ms / 200 ms (73%)

Per-layer budget (serial sum)

5 ms

10 ms

8 ms

3 ms

30 ms

20 ms

50 ms

P99-of-N trap

7.7%

of requests will hit P99 latency somewhere in the chain (8 sequential calls, each at 1% slow). The P99 of the whole request is not the sum of P99s — it's the chance any one of them missed.

Try this. Drop the top deadline to 50 ms — half the layers turn red. Bump database calls from 1 to 5 — independent-call P99 trap rises. Real designs route the chain so fewer hops happen on the hot path; that's where these mental models earn their keep.

Why you set a budget at all

A latency budget is a single promise written as a number: "p99 of this request stays under 200 ms." It exists because without it, latency is everybody's problem and nobody's. The edge team trims 5 ms, the database team adds 10 ms for a new index, a third team bolts on a fraud check, and three releases later the page feels slow with no single change to blame. The budget turns a vague quality into an account. Each layer gets a line item, and any change that overspends its line shows up as a debit, not a mystery.

The target itself is a product decision, not an engineering one. Below roughly 100 ms a response feels instant; somewhere past 300 ms the user notices the wait; past a second they start to suspect something broke. Pick the number from the experience you owe the user, then work backwards. A 200 ms p99 for an API that a browser calls on every keystroke is tight but reasonable. The same 200 ms for a nightly batch report is absurd. The budget only means something once it is tied to a real interaction and a real percentile.

Once the top-level number exists, you decompose it. A request rarely does one thing; it walks a chain of hops — edge, gateway, auth, cache, the service itself, a database, maybe a fan-out to a dozen downstream services. The budget has to be split across that chain so the parts sum to less than the whole, with slack left over for the network and the scheduler and the surprises. Decomposition is the act of handing each hop its slice and holding it to that slice.

A 200 ms p99 split into line items. The dashed slack box is deliberate headroom for the network, the scheduler, and bad days.

Two habits make decomposition honest. Always leave a slack line — 15 to 25% of the total — because the network, queueing delay, and garbage-collection pauses are real and do not announce themselves. And size each line from a measurement, not a hope. If the database p99 is already 80 ms, writing 30 ms on its line does not make it faster; it just guarantees the budget is a fiction the first time you check it. The widget below lets you push these numbers around and watch the total go red the moment the line items overflow.

Latency is a distribution

Most monitoring dashboards show the average. The average is the worst summary of latency that exists — it's pulled around by the tail and tells you nothing about the experience of any actual user.

Real systems have a distribution. P50 (median) tells you the typical experience. P95 catches "most users". P99 catches the user-visible tail — the one they remember. P99.9 catches the one that breaks the SLO. Track all four; the gap between them is the "tail factor", and the tail factor is the lever you tune.

Percentile	What it tells you	Common tail factor (P_n / P50)
P50	Median. The typical experience.	1×
P95	"Most users." Catches GC pauses, head-of-line.	2–3×
P99	The user-visible tail. SLO target for most products.	4–10×
P99.9	Catches the worst 1-in-1000. Where outage budgets get spent.	10–50×
P99.99	Almost a paranoid-mode metric. Useful for very-large fleets.	20–200×

Why averages lie. Imagine 1,000 requests. 990 take 50 ms, 10 take 5 seconds. Average: ~100 ms. P50: 50 ms. P99: 5,000 ms. The average buries the problem the tail makes visible.

The picture below is the whole argument in one frame. The bulk of requests sit in a tight cluster around the median; a thin, long tail stretches far to the right. The mean lands somewhere in the empty space between the two — describing no actual request. The percentiles, by contrast, each point at a real spot on the curve: p50 sits under the hump, p99 sits out near the start of the tail, p99.9 sits deep inside it. That gap between p50 and p99 is the number that should drive your work, because it is the difference between the experience you measure and the experience your slowest users actually get.

A right-skewed latency curve. The mean sits in no-man's-land; the percentiles each name a real point. The work lives in the gap between p50 and the tail.

Tracking the tail also forces you to store latency as a histogram, not a running average. You cannot recover a percentile from a mean and a standard deviation — the information is gone the moment you average. Production systems keep the shape of the distribution in a compact sketch such as HdrHistogram or t-digest, which hold thousands of buckets in a few kilobytes and answer "what is p99.9 over the last five minutes" without re-reading the raw data. The cost is small; the payoff is that the tail stays visible instead of being smoothed into a comforting lie. This is the same instinct that runs through logs, metrics, and traces: keep the distribution, not just the summary, because the summary is where the interesting failures hide.

Latency numbers every engineer should know

You cannot decompose a budget you cannot estimate, and you cannot estimate a hop without a rough feel for what each kind of work costs. The table below is the modern version of Jeff Dean's "numbers everyone should know" — order-of-magnitude figures, not exact ones. They are the raw inputs to every budget. If a line item claims a cross-region database read in 5 ms, the numbers tell you instantly that the line is wrong, because the speed of light alone makes a round trip across a continent take tens of milliseconds before any work happens.

Operation	Rough time	What it means for a budget
L1 cache reference	~1 ns	Free. Never a budget line.
Main memory reference	~100 ns	Still free at request scale.
Read 1 MB sequentially from memory	~3 µs	Cheap; in-process work rarely shows up.
SSD random read	~16 µs	A few hundred of these fit in a millisecond.
Round trip within a datacenter	~0.5 ms	The floor for any same-region service call.
Read 1 MB sequentially from SSD	~1 ms	Real money once you do it per request.
Disk seek (spinning)	~10 ms	Avoid on the request path entirely.
Round trip across regions (e.g. US ↔ EU)	~80–150 ms	One of these can blow a 200 ms budget alone.

The lesson hidden in the table is that the expensive things are the ones that cross a boundary: a network hop, a disk, a region. In-process computation is almost always cheap by comparison, which is why the right first move when a budget is tight is rarely "optimize the code" and usually "remove a hop" — collapse two services into one, add a cache so the database round trip disappears, or move the data closer so the cross-region trip becomes a same-region one. The numbers also explain why the gap between p50 and p99 exists at all: the median request hits the cache and skips the slow path, while the tail request misses, falls through to disk or a far region, and pays the full cost. Many slow-path costs are baked into physics, so the only way to keep them out of the tail is to keep requests off them.

The deadline-propagation rule

Per-hop timeouts do not add up safely. Service A times out at 1 s; A calls B with 500 ms; B calls C with 200 ms. Looks fine. Now the user-visible request arrives with only 100 ms left, and B still uses its 500 ms — work continues for 400 ms past the user's give-up.

The fix is deadline propagation: pass the absolute deadline (epoch ms), not a relative duration. Each service computes its remaining budget from the deadline minus its local clock and short-circuits if there's not enough.

# Caller (gateway)
deadline = now + 1000ms              # 1 s SLA
call B(req, deadline)                # propagate

# Service B
remaining = deadline - now           # ms left when B starts
if remaining < 50:                   # not enough budget for B's work
   return DeadlineExceeded            # short-circuit; don't even try
call C(req, deadline)                # propagate the same deadline

# Service C
remaining = deadline - now
if remaining < 5:
   return DeadlineExceeded
do work; return result

gRPC does this automatically through context deadlines. HTTP services do it via request headers (X-Deadline or traceparent's sampled-flag plus a custom field). Manual plumbing is error-prone — use the framework's mechanism if it has one.

This connects directly to back-pressure, retries, hedging, deadlines — deadlines are one of the four primitives that have to work together, and they're the one that makes retries safe. A retry policy that fires five attempts past the deadline is amplifying load that the user has already given up on.

P99-of-N is not P99

A request that calls N independent dependencies, each with P99 latency of L ms, does not have a P99 of L. It has a P99 of approximately "the latency at which any of the N is in its tail" — which converges to something much higher than L as N grows.

The math: each dependency has a 1% chance of being slow. With N independent calls, the chance that at least one is slow is 1 − (1 − 0.01)^N. At N = 10, that's ~9.5%. At N = 100, it's ~63%. The whole-request tail is dominated by the largest dependency's tail, scaled up by N.

N (parallel/serial calls)	P(any in tail)	Implication
1	1%	P99 of the request ≈ P99 of the call
5	~5%	P99 of request ≈ P98 of any call
10	~10%	P99 of request ≈ P95 of any call
20	~18%	P99 of request ≈ P90 of any call
100	~63%	P99 of request ≈ near-median of any call
1000	~99.99%	Tail dominates. Hedging is the only fix.

The "Tail at Scale" insight. Dean & Barroso's 2013 CACM paper coined this — it's why Google fan-out searches use hedged requests. With fan-out of 50, P99 of the slowest dependency wipes out the request's P99 budget; the only mitigation is to launch redundant copies and take whichever finishes first.

The widget above's "P99-of-N trap" panel computes this live. Bump the database call count to 5 — the percentage doubles. Add layers — it climbs. This is why systems with deep call chains have to either (a) reduce N, (b) make each call faster than its theoretical P99, or (c) hedge.

There is a deeper reason the tail behaves this way, and it lives in queueing theory. Latency is not a fixed property of a service; it is a function of how busy the service is. As utilization climbs toward 100%, queueing delay does not rise linearly — it explodes. A server at 50% utilization has short queues and a tight tail; the same server at 90% has long, bursty queues and a tail that is several times worse. So the tail you measure under light load is not the tail you get under peak load, and a budget built from quiet-hour numbers will fail exactly when it matters. The practical consequence is that you size for headroom: keep each hop comfortably below saturation so its tail stays bounded, because the last 10% of utilization buys you the worst latency of the whole curve.

Coordinated omission — why most load tests lie

Closed-loop load tests issue a request, wait for the response, then issue the next. If a request takes 5 seconds (because the system is briefly overloaded), the client sits idle for those 5 seconds — and never measures the requests that would have arrived during the stall.

This is "coordinated omission". The reported P99 looks great because the worst latencies were silently dropped from the dataset. Reality: the user experience is much worse than the report.

Tool	Loop type	Coord-omission corrected?
`wrk`	Closed	No — reports rosy P99
`wrk2`	Open (constant arrival rate)	Yes — Gil Tene's correction
`vegeta`	Open	Yes
`k6`	Open (constant-arrival-rate scenarios)	Yes when configured for open-loop
`locust`	Closed by default	No — needs careful configuration
`JMeter`	Closed by default	No

The rule: if the tool says "users" or "concurrency" instead of "arrival rate", it's probably closed-loop and probably under-reporting tail latency. Open-loop with a fixed arrival rate (e.g. k6 --rate 10000 --duration 60s) is the only configuration that produces honest tail numbers.

Read this once. Gil Tene's "How NOT to Measure Latency" (YouTube — Oracle, 2015) is the talk that pushed coordinated-omission awareness into the mainstream. 45 minutes; the most-recommended SRE talk of the decade.

Hedging — the production fix for tail latency

When a single dependency's tail dominates the request's tail, hedging is the lever. Send the request to a second replica after a short delay; return whichever responds first; cancel the loser.

Property	Why it matters
Hedge delay = P95 of the call	Hedging at P50 fires on half the requests; that's 2× QPS forever. Hedge at the tail.
Idempotent operations only	Hedging non-idempotent operations is a duplicate-write bug.
Cancelable	The loser must be cancelable; otherwise hedging just doubles work.
Budgeted	Cap hedge rate at 5–10% of total RPS. Past that, drop. Otherwise hedging amplifies under load.

With those four constraints, hedging cuts P99 by 30–50% for fan-out workloads at < 5% extra QPS — Google's published number from "The Tail at Scale". See back-pressure and retries for the broader treatment.

The rest of the tail-defence toolkit

Hedging is the headline technique, but it is one of several that work together, and each addresses a different shape of tail. The point of all of them is the same: stop a slow component from spending budget the request no longer has.

Timeouts set from the budget, not from a round number. A timeout is the hard floor under your tail. If a call has a 35 ms line item, its timeout belongs somewhere near its p99.9, not at a comfortable 1,000 ms that lets a single stuck call burn the entire request. Set the timeout from the budget line, derive it from the deadline that was propagated in, and the tail can never run longer than you allocated. A too-generous timeout is the single most common reason a p99 budget quietly becomes a p99 fantasy.

Concurrency limits ahead of the slow thing. When a dependency slows down, naive callers pile more and more in-flight requests onto it, which makes it slower still — the queueing-theory death spiral. A concurrency limit (a semaphore, a bounded thread pool, or an adaptive limiter that watches latency) caps the in-flight count so the dependency stays in the part of its curve where the tail is bounded. Past the cap you shed load fast instead of letting every request share the misery. This is the same idea as back-pressure, applied locally to keep one slow hop from poisoning the whole budget.

Backups and fallbacks for when the fast path is gone. If the primary path blows its budget, a cheaper answer beats a late one. Serve a slightly stale cache entry instead of waiting on the database. Return the top results you have instead of the complete set. Drop the optional re-ranking step under load. Each of these trades a little quality for a bounded tail, and on the request path a bounded wrong-ish answer usually beats a perfect answer that arrives after the user gave up. The decision is a product one, but the budget is what tells you when to make it.

Load shedding at the door. When the system is past the point where it can meet the budget for everyone, the kindest thing it can do is refuse some requests immediately so the rest stay fast. Shedding the lowest-priority traffic at admission keeps the served requests inside their budget instead of degrading everyone past the SLO. A request rejected in 1 ms is a better outcome than a request that times out at 200 ms after consuming work the whole way down the chain.

SLOs and error budgets

An SLO (Service Level Objective) is the threshold for "performant enough" — "99.9% of requests under 200 ms" or "99.99% successful per quarter". The SLO is what the product agreed to. The error budget is the difference between the SLO and 100%: how much risk you can spend.

SLO	Allowed downtime / quarter	Allowed downtime / year
99%	~22 hours	~3.65 days
99.9%	~2.2 hours	~8.8 hours
99.99%	~13 minutes	~52 minutes
99.999%	~78 seconds	~5.3 minutes

Error budget burned faster than expected is a deploy-freeze signal. Error budget under-spent is a "we should ship faster" signal. The economics make the contract concrete — the SRE Workbook (free from Google) is the canonical reference.

Measuring the budget you set

A budget you do not measure is a wish. The measurement has to happen at the place the user actually waits, and it has to break down by hop so you can tell which line item overspent. Two instruments do most of this work, and they answer different questions.

Per-hop histograms tell you where the time goes in aggregate. Record p50, p95, p99, and p99.9 for each hop separately, and the decomposition stops being a planning document and becomes a live ledger: you can look at the dashboard and see that the database line is running at 95 ms against an 80 ms allocation, three weeks before it becomes an incident. Distributed traces tell you where the time goes in a single slow request. When one request blows the budget, a trace shows the actual span waterfall — which call waited, which one fanned out, which one sat in a queue — so you debug the specific tail event instead of guessing from averages. Histograms find the trend; traces find the cause. You want both, and logs, metrics, and traces is where that toolkit is laid out in full.

Measure from the client's vantage point, not just the server's. Server-side timing misses the network, the connection setup, the time spent queued before the server even accepted the request, and the coordinated-omission stalls described above. Real-user monitoring at the edge, or at least timing that starts when the request leaves the client, is the only number that matches what the user feels. And remember that the budget is a percentile, so you measure it as a percentile over a window — "p99 over the last five minutes" — never as a single slow sample or a rolling mean. The whole discipline only works if the thing you watch is the same shape as the thing you promised.

One last framing worth keeping nearby: latency and throughput are not the same axis, and tuning one can wreck the other. Batching, larger queues, and higher concurrency all raise throughput while pushing latency up, because they trade a longer wait for more work per unit time. A budget is a latency constraint, so it acts as a ceiling on how far you can chase throughput before the tail breaks the SLO. The trade-off is worth understanding on its own terms — see latency vs. throughput for the full picture.

Production checklist

Pick a P99 SLO. Not an average. The number you'd report to product.
Allocate the budget. Edge, gateway, auth, cache, service, database. Sum ≤ SLO. The widget above visualises this.
Propagate the deadline. Absolute time, not duration. Use the framework's mechanism (gRPC contexts, HTTP traceparent).
Short-circuit when remaining < work_estimate. Drop expired requests at queue dequeue; don't pay for work the user gave up on.
Hedge at the tail. For fan-out reads, hedge at P95 with budget cap. Idempotent only.
Measure with open-loop tools. wrk2, vegeta, or k6 in constant-arrival-rate mode.
Track P50 / P95 / P99 / P99.9 separately. Mean is the worst summary. Histograms (HDR or t-digest) preserve the tail.
Set an error budget. Tie deploy speed to it. Burn it down predictably; refresh quarterly.

Latency budgets & percentiles

Try it — drag the budgets

Why you set a budget at all

Latency is a distribution

Latency numbers every engineer should know

The deadline-propagation rule

P99-of-N is not P99

Coordinated omission — why most load tests lie

Hedging — the production fix for tail latency

The rest of the tail-defence toolkit

SLOs and error budgets

Measuring the budget you set

Production checklist

Further reading

Back to methods