02 / 05

Principle / 02

Latency vs throughput

Latency is the time one request takes. Throughput is how many requests per second the system can sustain. They are two different numbers measuring two different things, and one equation ties them together — concurrency = throughput × latency. It shows up everywhere: thread-pool sizing, queue depths, "why is it slow?" investigations.

The numbers, separately

Start with the definitions, because most confusion comes from treating these two as if they were one thing.

Latency is the time a single operation takes, measured from when it starts to when it finishes. Wall-clock time per request. It has units of time: milliseconds, microseconds, seconds. Because every request takes a slightly different amount of time, latency is a distribution, not one number, and we describe it with percentiles. P50, P95, P99 are points on that distribution. "30 ms P99" means 99 out of 100 requests come back inside 30 ms; the slowest 1% take longer.

Throughput is the number of operations the system completes per unit of time. QPS, TPS, ops/sec, requests/minute, whatever the workload calls it. It has units of count-over-time: requests per second. "5K QPS" means five thousand operations finish every second, counted across the whole system.

The cleanest way to hold the two apart is the pipe analogy. Think of water moving through a pipe. Latency is how long it takes one molecule of water to travel from one end to the other — a property of length and speed. Throughput is how many litres come out the far end each second — a property of the pipe's width and the pressure behind it. A long, fat pipe has high latency and high throughput at the same time. A short, thin straw has low latency and low throughput. The two measure different geometry of the same pipe, and you can change one without touching the other.

Swap the pipe for a highway and the same split holds. Latency is the time it takes one car to drive end to end. Throughput is the number of cars that pass a given point each minute. Adding lanes raises throughput — more cars per minute — without making any single car arrive sooner. Raising the speed limit cuts each car's travel time without changing how many lanes feed through. Lanes and speed are independent dials, which is exactly why latency and throughput are independent numbers.

Which one matters more is the wrong question. They are coupled, and the rope tying them together is Little's Law.

They are not inverses of each other

The single most common mistake is to assume throughput is just one divided by latency. If each request takes 10 ms, the reasoning goes, the system must do 100 per second. That is only true for a system that processes exactly one request at a time, end to end, with nothing else going on. Almost no real system works that way.

Throughput equals one-over-latency only on a single, serial worker. The moment you have more than one worker, or any concurrency at all, the two numbers come apart. Ten workers each taking 10 ms per request give you 1,000 requests per second while every individual request still takes 10 ms. Latency did not move; throughput went up tenfold. That is the whole point of running things in parallel.

Each number can move on its own, in either direction, and the four cases are worth naming because interviews probe all of them:

Throughput up, latency flat. Add workers, add replicas, shard the data. More requests finish per second; each one still takes the same time. The default way to scale a stateless service.
Throughput up, latency up. Batch the work. You finish more per second by grouping, but each item waits for its batch to fill. This is the trade most "go faster" knobs actually make.
Latency down, throughput flat. Make each operation faster — a tighter algorithm, a closer cache, a shorter network path. One request returns sooner; the aggregate rate may not change if you were never the bottleneck.
Latency down, throughput down. Shrink batches and flush aggressively for responsiveness. Each item leaves quickly, but you pay more per-operation overhead and the total rate drops.

Because they move independently, "we made it faster" is an ambiguous claim. Faster per request, or faster in aggregate? Those are different changes, often with opposite effects on the other number, and a sharp engineer always says which one they mean.

Little's Law

John Little proved this in 1961, and it's the most-cited result in queueing theory:

L = λ × W

L = average number of items in the system (concurrency / in-flight)
λ = arrival rate (throughput, items per unit time)
W = average time an item spends in the system (latency)

It reads like accounting, and that's what it is. Serve 5,000 requests per second (λ) at 50 ms each (W = 0.05 s) and you have 5,000 × 0.05 = 250 requests in flight on average. Not an estimate, a count. The only assumption is that the system has reached steady state.

Three rearrangements you'll use constantly:

Sizing a thread pool or connection pool. To handle λ QPS at W latency, you need at least λW workers. Fewer than that and requests pile up at the door, which adds wait time on top of service time.
Sizing for a latency budget. Fixed worker count, fixed service time, max sustainable QPS is workers / W. Past that and the queue grows without bound.
Reasoning about backlogs. Backlog of N items, arrivals at λ, each new item waits roughly N / λ seconds. Useful when something's behind and you need to answer "when will it catch up?"

Two axes, not one slider

Because latency and throughput move independently, the honest way to picture a system is on two axes, not on a single good-to-bad slider. Latency runs up one side, throughput along the bottom. Every design choice lands somewhere on that plane, and the goal is rarely the corner — it is the region that fits the workload.

A workload occupies a region on the plane. Most systems pick one of the two near corners; the fast-and-high-rate corner is the expensive one.

Reading the plane is the start of every design conversation. An interactive service — a web request, an API call a human is waiting on — lives in the low-latency region and can usually accept modest throughput per node. A bulk job — a nightly export, a training run, a log roll-up — lives in the high-throughput region and does not care if any single item takes a second. The two near corners are cheap because each one tunes for a single number. The far corner, fast and high-rate at once, is where the money goes, and the rest of this page is about why.

Try it: Little's Law calculator

Pick a target QPS, an average service time, and how much concurrency you have. The calculator shows the in-flight count, headroom left, and how the effective latency moves as you approach saturation (M/M/1 approximation).

Target QPS (λ) 5,000 req/sec Service time (W) 50 ms Available concurrency 256 workers

In-flight (L = λW) 250.0

Utilisation (ρ) 97.7%

Sustainable QPS at this latency 5120

Expected queue wait +2083.3 ms

Effective latency (service + wait) 2133.3 ms

Capacity headroom 2%

Push utilisation past ~80% and the wait term takes off. The "knee" lives at utilisation, not at QPS.

The latency–throughput curve

Plot effective latency against throughput and you get the capacity curve every system follows. From zero to ~70% utilisation, latency is flat at the service time. From 70% to 80%, the wait term kicks in and latency starts to climb. From 80% upward, it goes hyperbolic. The code hasn't changed. The system just moved from fast-and-idle to fast-and-busy to slow-and-broken by being asked to do more.

That's why capacity planning targets 80% utilisation, not 95%. The 20% gap is what absorbs bursts without the latency curve blowing up. You can run hotter than that, but you're trading headroom for hardware savings, and on the day a spike arrives you'll feel it.

The wait term that drives the knee comes straight out of queueing theory, which gives the formal model behind the M/M/1 approximation used here. If you want to set per-service targets that keep you on the flat part of the curve, the latency budgets page turns these numbers into a budget you can hold a team to.

The tail gets worse under load, and that is the whole game

The curve above is drawn for average latency, and the average hides the part that hurts. As utilisation rises, the tail — P99, P99.9 — grows far faster than the median. A queue that adds 5 ms to the average can add hundreds of milliseconds to the slowest 1%, because the slow requests are the ones that arrive while the queue is already deep, and deep queues happen more often the busier you are.

This is why "the average looks fine" is a dangerous thing to say under load. Throughput can be climbing, the mean can be steady, and the P99 can be quietly tripling because the system is spending more of its time near saturation. The two numbers in the title hide a third: the spread between typical and worst, which is the first thing to blow up when you push throughput toward the ceiling. Watch the tail, not the mean, when you decide how close to the knee you are willing to run.

Batching is the lever

The clearest way to trade one number for the other is batching, and it is worth seeing exactly what it does. Process items one at a time and each one leaves as soon as it is done — low latency, but you pay the fixed per-operation overhead (a round trip, a syscall, a kernel launch) on every single item. Group items into a batch and you pay that overhead once for the whole group, so the rate climbs. The catch is that the first item in a batch has to wait for the rest of the batch to arrive before anything goes out. Its latency is now its own service time plus the time spent waiting for the batch to fill.

Same work, two policies. Batching pays the per-item overhead once and lifts the rate, but every item carries the fill-wait as added latency.

Buffering is the same idea applied to a stream. A buffer absorbs bursts and lets the consumer pull a chunk at a time instead of one element at a time, which raises the rate it can sustain. The data that lands in the buffer first sits there until the consumer comes around, so it pays a wait. Bigger buffer, more throughput smoothing, more worst-case latency for whatever sits at the bottom of it. Every "more throughput, at the cost of latency" technique is one of these two — batching or buffering — wearing a different hat:

Database batch writes. 100 inserts in one round-trip pushes throughput up 10–50x, but each insert now waits up to a batch interval before going out.
Nagle's algorithm. Wait briefly for more bytes before sending a TCP segment. Fewer packets, higher per-message latency.
Group commit (WAL flushing). Postgres and MySQL flush the write-ahead log every N ms instead of every commit. More commit throughput, more latency per commit.
Kafka producer batching. linger.ms literally says "wait this long for a fuller batch." A dial on the latency-throughput trade.
GPU inference batching. Group incoming requests so one matmul amortises the per-request overhead. Every production inference framework does this.

Going the other way — less latency, less throughput — is just smaller batches, fewer in-flight requests, more aggressive flushing. Same dial, turned the opposite direction.

Pipelining and concurrency: throughput for free, almost

Batching trades latency away to buy throughput. Pipelining and concurrency are the other family of moves — they raise throughput without lengthening any single operation. They are worth keeping separate in your head, because they pay a different price.

Concurrency runs many independent operations at the same time. Ten workers handling ten requests in parallel finish ten times as many per second, and each request still takes its own service time. No request waits for another. This is the cleanest throughput gain there is, and it is the default for stateless request handling. The price is not latency; it is resources. You need ten workers, ten connection slots, ten times the memory in flight. Little's Law sets the floor: to run λ requests per second at W latency you need at least λW things happening at once.

Pipelining splits one operation into stages and keeps every stage busy on a different item. While stage two works on item one, stage one is already working on item two. A CPU instruction pipeline, a GPU stream, a multi-stage data job, a chain of Kafka topics — all of them do this. The key fact is that pipelining does not shorten the journey of any single item. An item still passes through every stage, so its end-to-end latency is the sum of the stages, sometimes a touch more. What changes is that a new item comes out the end every time the slowest stage finishes, so the rate is set by the bottleneck stage, not by the full path.

This is the cleanest example of the two numbers moving independently. A four-stage pipeline where each stage takes 10 ms has a per-item latency of 40 ms and, once full, produces one item every 10 ms — 100 per second. Collapse it to a single 40 ms step and the latency is identical but the rate drops to 25 per second. Same latency, four times the throughput, just from overlapping the stages. The cost is the plumbing between stages and the fact that the pipeline only pays off once it is full, so it suits steady streams, not occasional one-off requests.

The takeaway: when someone says "we sped it up," ask whether they cut the work per item or overlapped more items. The first lowers latency. The second raises throughput and usually leaves latency alone or slightly worse. They are not the same optimisation, and confusing them leads to capacity plans that do not add up.

Two traps in interviews

Conflating average and tail latency. Little's Law works on averages. A service with a 10 ms median and an 800 ms P99 has an average somewhere in between — but users feel the P99. Read The Tail at Scale before any system-design interview that goes near scale.
Forgetting queueing. "100 QPS at 20 ms, so 2 in-flight, easy." Real arrivals are bursty. The worker count has to absorb the burst, not the average. Rule of thumb: size for 2–5x the steady-state in-flight count.

The arithmetic that connects them

Little's Law in different forms tells you everything about a system in steady state:

L = λW · concurrency = arrival-rate × time-in-system
throughput = concurrency / latency · the rearrangement that matters most
utilisation = arrival-rate × service-time / servers · the dimensionless number that predicts the curve

Three concrete uses you will reach for over and over:

Sizing a worker pool. "10K requests/sec, each takes 50ms, how many threads do I need?" 10,000 × 0.050 = 500 in-flight requests on average. With bursts, size for ~1500 worker slots. That is your starting point for a Tomcat maxThreads, a goroutine pool, an asyncio semaphore.

Sizing a connection pool. "Service does 500 DB queries/sec, each query takes 8ms." 500 × 0.008 = 4 in-flight DB connections in steady state. Pool of 10-20 is plenty. Pool of 200 wastes Postgres connections and hurts everyone.

Predicting the queue. "Workers process 100 items/sec; queue arrival rate is 95 items/sec." Utilisation is 95%. From M/M/1 theory, queue length grows as ρ/(1-ρ) = 19. Average wait time is service-time × ρ/(1-ρ) = 200ms. That is why 95% utilisation feels horrible.

The shape of the latency distribution

Latency is never normally distributed. The shape you actually get is heavy-tailed — most requests are fast, a long tail is much slower. Three things this means in practice:

Median tells you nothing useful about the worst case. A service with median 10ms can easily have p99 200ms and p99.9 of 2 seconds. The factor between median and tail varies by workload but is typically 20-100×.

The tail compounds across services. If a single service has p99 = 100ms and a request fans out to 10 services in parallel, the probability that at least one hits its p99 is ~10% per request. The user-perceived p99 of the composite call is roughly the p99.99 of any individual call. That is the "Tail at Scale" insight in one sentence.

Hedged requests fight the tail with redundancy. Send the same request to two replicas, take whichever returns first. p99 collapses dramatically — at the cost of doubling the load. Google's Bigtable and Spanner use this for read paths; the math works out when the tail is 10-100× the median, which it usually is.

What to monitor. p50 tells you average user experience. p99 tells you whether 1-in-100 users will complain. p99.9 tells you whether 1-in-1000 sessions hit a usability bug. Engineering teams that watch only p50 ship products that feel broken to the loudest 1% of users — that 1% is also disproportionately the power users.

Sources of tail latency, ranked

Where the long tail comes from, roughly in order of how often it dominates:

1. Queueing under load. The M/M/1 wait term. Utilisation above 70% sends the tail through the roof. Almost every "the p99 doubled overnight" incident is utilisation creep on some queue (worker pool, connection pool, kafka topic, file descriptor table).

2. Garbage collection pauses. JVMs, Go runtimes, Node V8 all stop-the- world periodically. Sub-millisecond pauses for nursery, multi-second pauses for full GCs under memory pressure. Tunable but never eliminated.

3. Network jitter. Single-flow TCP can hit 100-300ms tails because of one dropped packet plus the RTO. Cross-region calls compound this. QUIC fixes some of it; not all.

4. Disk and SSD pauses. SSDs have garbage collection too. A write that triggers a flash block erase can stall for 50-200ms. RAID rebuilds and database compactions hit the same thing at larger scales.

5. CPU contention. Noisy-neighbour VMs, hyperthread contention, NUMA misses. Cloud instances under bursty workloads show 2-10× tail variance vs bare metal.

6. Lock contention. A handful of slow operations holding a popular lock cascades into wait times for everyone behind them. Database row locks, mutex hot spots in application code.

7. Coordinated omission. Not a real source — a measurement bug. If your load generator pauses while a request is slow, you systematically under-sample the tail. Real production traffic does not pause. Tools like wrk2, k6, and ghz handle this; ab, hand-rolled scripts, and many ad-hoc tools do not.

Which one to optimise

The choice is set by who or what is waiting. If a human is sitting in front of the result, optimise latency: a web page, an API call behind a UI, an autocomplete, a checkout. People notice tens of milliseconds and abandon at hundreds, so the per-request number is the product. Throughput here is a capacity question you solve with more nodes, not the thing you tune the code for.

If nothing is waiting on any single item — a nightly report, a re-index, a training run, a log pipeline — optimise throughput. No one cares whether one record took 5 ms or 5 seconds; they care when the whole job finishes, which is total work divided by the rate. Here you batch hard, run wide, and let individual latency drift, because cutting per-item latency would only cost you the rate you actually want.

The trap is the workload that claims to need both at once. It usually does not — it has two classes of request hiding under one name, and the fix is to split them. This is the system-level form of the same trade covered in performance vs scalability: making one request faster and making the system handle more of them are different projects with different tools.

When latency is the goal — designs that win

Workloads where latency is the primary metric (interactive UX, real-time systems, financial trading, ad bidding) have characteristic design moves:

Co-locate the compute and the data. Edge functions (Cloudflare Workers, Fastly Compute) move the code to the user. In-memory databases (Redis, MemSQL/SingleStore) keep the data in RAM. The latency cost of a network hop or a disk seek is the win.

Pre-compute aggressively. Materialized views, denormalized read models, fan-out-on-write timelines. Pay the cost on write to make read trivially fast. Twitter's home timeline writes a user's tweet to all their followers' timelines on post; the read is then a single index lookup.

Short paths, fewer hops. Each microservice boundary adds 1-10ms in practice. Workloads with 1ms latency budgets cannot afford to be 10 microservices deep; they look like monoliths with cleanly internal modules.

Speculative execution. Start the work before you are sure you need it, discard if you don't. Browser prefetch, database query plan caching, CPU branch prediction at the architecture level.

Hedged requests. The textbook tail-latency mitigation. Send two parallel requests; take the first response; cancel the other. Works when the tail is much longer than the median (typical) and the load multiplier is acceptable.

When throughput is the goal — designs that win

Workloads where throughput dominates (batch jobs, analytics, ML training, log ingestion):

Batch everything. Larger batches amortise per-operation overhead. Kafka producer linger.ms, database multi-row inserts, SIMD vector operations are all batching at different scales.

Parallelise via partitioning. Spread the work over independent shards that don't coordinate. Hadoop, Spark, MapReduce — every batch system is partitioning + a shuffle step.

Pipeline the stages. Stage 1 produces, stage 2 consumes while stage 1 continues. CPU instruction pipelines, GPU stream processing, Kafka topic chains. The latency of any single item goes up; aggregate throughput goes through the roof.

Compress before sending. If the network is the bottleneck, CPU time spent compressing pays off in transmission time saved. snappy and zstd are the modern defaults; gzip is the legacy.

Trade memory for throughput. Bigger buffers absorb more variability; ring buffers and circular queues handle higher rates than naive linked lists. The Disruptor pattern (LMAX) ships millions of messages per second per core by sizing for this.

The mistake of optimising both

A system tuned for one usually performs poorly on the other. Three places this bites:

"Real-time analytics" that needs both. Streaming systems that promise sub-second latency and millions-of-events-per-second throughput often miss both targets. The honest framing is "high-latency for some events, low-latency for others"; pick which latency you care about per query class.

OLTP databases with analytical workloads. Adding analytical queries to a row-store OLTP database (Postgres, MySQL) destroys p99 for the transactional traffic because the analytical queries pollute the buffer pool and create lock contention. The HTAP marketing claim almost never survives contact with production. Separate the workloads.

"We need low latency AND high throughput." The answer is usually a two-tier architecture: low-latency tier (in-memory cache, fast read path) backed by a high-throughput tier (persistent store, batch processing). The cache absorbs reads with low latency; the database absorbs writes with high throughput. Each tier is tuned for one number.

How this connects to the rest

This is the equation behind the "knee in the curve" from the previous principle. It also makes back-pressure non-optional: without it, queues grow, requests time out, retries stack, and the system folds. And it sits under every worked problem in the playbook — going from DAU to QPS to in-flight to instance count is just four applications of Little's Law in a row.

Related on Semicolony

← Performance vs scalability — the system-level version of this same trade-off.
Performance methods — USE, RED, queueing theory in practice.
The Tail at Scale — Dean & Barroso on why P99 matters more than the median.
Back-pressure & retries — what to do when the queue grows.
Retry strategy simulator — interactive backoff + jitter exploration.
System-design playbook — Little's Law applied to 14 real problems.

Principle 03 / 05

Availability vs consistency →

The CAP intuition with worked outage scenarios. Why "we want both" is the wrong answer.

Next principle

← back to principles → system design roadmap

Found this useful?