06 / 08

Methods / 06 · Capacity sizing

Queueing theory for engineers

Enough queueing theory to size thread pools, connection pools, and queues correctly — and to know why latency starts climbing long before utilisation hits 100%. Little's Law is the universal one; M/M/1 explains the latency curve; Pollaczek–Khinchine explains why variance hurts so much; the 80% rule is the rule of thumb most teams converge on once they've been burned twice.

The picture before the maths

Every queueing problem has the same shape, and once you see it you stop needing to memorise formulas. Work arrives at some rate. It either gets served right away or it waits in line. A server, or a pool of servers, drains the line at some rate. When arrivals outpace service for any stretch of time, the line grows; when service outpaces arrivals, the line shrinks. The whole field is the study of one thing: how long the line gets, and how long you spend in it, as a function of how hard you push the server.

Almost everything an engineer queues fits this picture. A web server with a fixed number of worker threads is a queue: requests arrive, threads serve them, and surplus requests wait in the accept backlog. A connection pool is a queue: queries arrive, a connection serves each one, and queries wait when every connection is busy. A Kafka partition, a disk's I/O scheduler, a rate-limited downstream API, the CPU run queue the kernel scheduler manages — all the same. The reason a single body of theory predicts behaviour across such different systems is that they share this structure, and the structure is what drives the numbers.

Two quantities run through everything below. The first is the arrival rate, written λ (lambda) — how many requests show up per second. The second is the service rate, written μ (mu) — how many requests a single server could finish per second if it never went idle. Their ratio, ρ = λ/μ, is utilisation: the fraction of the time the server is busy. Utilisation is the dial you are really turning when you add load or remove capacity, and it is the variable that decides whether a system feels fast or feels like it is drowning.

Little's Law — the universal one

Little's Law is the most general result in queueing theory and applies to any stable system, regardless of arrival distribution, service distribution, or queue discipline. There is no fine print about Poisson arrivals or exponential service. As long as the system is stable — nothing is piling up without bound over the long run — the law holds exactly. It says:

L = λW — the average number of items in the system equals the arrival rate times the average time an item spends in the system.

The intuition is almost a tautology once you look at it the right way. If ten people walk into a shop every minute and each stays five minutes, then at any snapshot there are about fifty people inside. The headcount inside is just the rate they enter multiplied by how long each one lingers. Nothing about the shape of arrivals or how long any individual stays matters to the averages — the bookkeeping balances regardless.

In a stable system, flow in equals flow out, and the count inside is fixed by arrival rate times dwell time. No distribution assumptions.

Three variables, no assumptions about distributions. Useful because two of the three are usually easy to measure and the third drops out:

Sizing a queue: If 1,000 requests/sec arrive (λ) and each waits 50 ms (W), the queue holds 50 items on average. Build for 100 to allow burst.
Sizing a thread pool: If 10,000 requests/sec arrive and each is in the system for 20 ms, you need on average 200 threads "in flight". Size the pool ≥ that, with headroom.
Inferring service time: Measured 5,000 requests/sec arriving and 25 in flight on average → average request takes 5 ms in the system.

Little's Law is what makes back-of-envelope capacity calculations possible at all. It also has a precise version per system: the law holds for the whole system (clients, queue, servers) and for any sub-system separately. You can apply it to just the waiting line, just the servers, or the whole pipeline end to end, and it stays true at each scope. That nesting is what lets you decompose a request chain: measure the in-flight count and rate at each hop, and the per-hop dwell time falls out without instrumenting every individual request.

A subtle trap is worth naming. Little's Law describes a stable system in steady state, and it talks about averages. It will not tell you the ninety-ninth-percentile wait, and it says nothing useful during a transient overload when the line is still growing. If you measure L and λ over a window where the queue is exploding, the W you back out is an artefact of a system that is not in steady state, not a number you can plan with. Use the law for the long-run average; use the models in the next sections for the shape of the distribution and the danger zone near saturation. This division of labour — Little's Law for the averages, the M/M models for the curve — is the right way to keep the two ideas straight.

Why utilisation kills you — the M/M/1 latency curve

M/M/1 is the simplest non-trivial queueing model: Markov arrivals (exponentially distributed), Markov service (exponentially distributed), 1 server. Most production systems aren't actually M/M/1, but its behaviour explains the shape of the latency curve nearly everywhere.

Mean response time: W = 1 / (μ − λ) where μ is the service rate and λ is the arrival rate.
Equivalently: W = (1/μ) × 1/(1 − ρ) where ρ = λ/μ is utilisation.

The 1/(1−ρ) term is the killer. As ρ approaches 1, response time goes to infinity. The curve is hyperbolic — it stays nearly flat through moderate utilisation, then climbs steeply in the last quarter of the range. This is the single most important shape in performance engineering, and most of the counter-intuitive behaviour of loaded systems falls straight out of it. Plot it once and you will recognise it everywhere: in your latency dashboards as traffic ramps, in the panic when a node drops out of a pool, in the way a five-percent traffic increase that used to be invisible suddenly doubles your P99.

W = (1/μ) × 1/(1−ρ). The curve barely moves until roughly 70–80% utilisation, then turns vertical. Operating to the right of the dashed line trades a small throughput gain for a large, accelerating latency penalty.

Utilisation (ρ)	Latency factor (vs idle)	What it feels like
10%	1.11×	Idle. Each request gets immediate service.
50%	2×	Half the time you wait for one prior request.
70%	3.33×	Noticeable queueing, but still OK.
80%	5×	The common operating sweet spot.
90%	10×	Wait times dominate service times.
95%	20×	Queue grows fast; small bursts cause big spikes.
99%	100×	Effectively unstable. One blip and you have minutes of catch-up.
100%	∞	Queue grows without bound.

The 80% rule. Plan for 70–80% sustained utilisation, not 95%. The latency penalty above 80% grows much faster than the throughput win, and you need headroom for traffic spikes anyway. This is the single most useful rule queueing theory gives you.

It helps to see why the rule is not arbitrary. The marginal cost of one more percent of utilisation is the slope of that curve, and the slope itself is proportional to 1/(1−ρ)². At 50% utilisation the slope is gentle: pushing to 51% barely moves latency. At 90% the slope is a hundred times steeper, so the same one-percent push costs a hundred times more latency. You are not buying a little risk by running hot; you are buying exponentially more risk for the same throughput, and the gradient turns against you precisely in the region where teams are tempted to squeeze out the last of their hardware budget.

Why you should never run near 100%

The flip side of the curve is a discipline, not a vibe. Running a server at 95% utilisation looks efficient on a spend report and feels responsible — the boxes are doing work, the dashboard is green, nothing is idle. But the curve says that at 95% your latency is already twenty times its idle value, and the slope under you is nearly vertical. You are standing on the wall of the hockey stick, and any disturbance that nudges ρ upward pushes you straight up the cliff.

Disturbances are not rare. They are the normal weather of a production system. A deploy briefly halves throughput while a new version warms its caches. A node falls out of the pool and its share of traffic lands on the survivors, bumping each one's ρ. A retry storm doubles the effective arrival rate for a few seconds. A garbage-collection pause stalls one worker and the line behind it grows. At 50% utilisation each of these is a shrug — there is slack to absorb the bump and drain it. At 95% the same bump tips you past saturation, the queue grows faster than you can drain it, and now you have minutes of catch-up to do while every request in flight blows its deadline. The system does not gracefully slow down; it falls over, because the curve has no gentle region near the top.

There is a second, sharper reason. A system run at sustained 100% utilisation is, by the maths, unstable: arrivals equal or exceed the rate you can serve, so the queue length has no finite steady-state value. It does not settle at a large number; it grows without bound until something — memory, a timeout, an operator — cuts it off. The mean response time in the formula is literally infinite at ρ = 1 because there is no average to converge to. This is why "the server can handle 10,000 QPS" is a dangerous thing to believe if 10,000 is the measured ceiling. The ceiling is the point of instability, not a target. The safe operating point sits well below it, in the flat part of the curve, where the system has the headroom to absorb the bumps that are coming whether you planned for them or not.

Headroom, autoscaling, and the cost of slow reactions

Headroom is the gap between where you run and where you fall over, and the curve explains exactly why it has to be generous. The reason is not that spikes are large; it is that the latency penalty of being caught at high utilisation is so steep that even a brief excursion does outsized damage. A system that lives at 50% has a full doubling of capacity in hand before latency becomes a problem. A system that lives at 85% has almost none, and its dashboards will look fine right up until they do not.

Autoscaling is the standard answer, and the curve sets its terms. Scaling is never instant: there is a delay to notice the load, a delay to launch and warm new instances, and a delay before they take traffic. During that whole window the existing fleet absorbs the spike alone, climbing its own utilisation curve. If you scale on a 90%-utilisation trigger, you are deciding to spend the entire scale-up delay sitting in the steep part of the curve, serving slow requests, while you wait for help that arrives after the damage is done. The fix is to scale on a leading indicator and a lower threshold — 60–70% utilisation, or a P95-latency budget, or queue depth — so the new capacity is online and warm before the old fleet reaches the cliff. The headroom you keep is not waste; it is the time you are buying for the autoscaler to react.

Alert and scale below the knee, not at it. If your scale-up trigger and your alert threshold both sit at 90–95% utilisation, you will page the on-call and start scaling at the moment latency is already on fire. Set the trigger at the utilisation where the curve is still flat — typically 60–70% — so the response completes before users feel anything.

M/M/c — multiple servers, surprisingly different

M/M/c is M/M/1 with c parallel servers sharing one queue. Real systems are almost always M/M/c — a thread pool of c threads, a connection pool of c connections. The behaviour at high utilisation is meaningfully better than M/M/1, because the chance that all c servers are simultaneously busy when a request arrives is much lower than the chance one is busy.

Servers (c)	Latency at 80% util	Latency at 90% util	Latency at 95% util
1 (M/M/1)	5×	10×	20×
4	~2.0×	~3.3×	~6×
16	~1.3×	~2.0×	~3.5×
64	~1.1×	~1.4×	~2.2×

Two practical takeaways. First, "more servers" buys real latency improvement at moderate utilisation, not just throughput. Second, a single large pool is dramatically better than many small pools at the same total capacity — c=16 at 90% beats four c=4 pools at 90% by a wide margin because the larger pool absorbs bursts the smaller ones can't.

The reason pooling helps is worth sitting with, because it is the most useful and least obvious result in the whole topic. A request only waits if every server is busy when it arrives. With one server, the chance of that is just ρ. With many servers sharing one queue, the chance that all of them happen to be busy at the same instant is far lower than the chance that any one is busy — busy periods on independent servers rarely line up perfectly. So a request usually finds a free server even at high average utilisation, and waiting becomes the exception rather than the rule. The large pool spends its idle capacity on whoever needs it, while small pools strand their idle capacity where it cannot help.

Same total capacity, very different latency. Splitting one pool into several strands idle capacity behind separate queues; consolidating lets any free server take the next waiting request.

This is why one shared connection pool beats one per host. Splitting a 64-connection pool into 8 pools of 8 cuts your effective latency-vs-utilisation curve by a factor of several. The shared pool pattern shows up everywhere from PgBouncer to Envoy connection pooling for the same reason. The same logic argues for fewer, larger service instances over many tiny ones when each instance keeps its own queue, and it is part of why a well-balanced fleet behind a load balancer outperforms the same machines with traffic pinned to fixed shards.

There is a limit to the free lunch. Pooling only pools work that any server can do; if requests are sticky to a particular server — a session affinity, a shard key, a warmed local cache — then you are back to many small queues no matter how the boxes are wired together. The latency win from consolidation is real only to the degree that work is fungible across the pool. When you find yourself with hot shards and cold ones, the queueing-theory read is that you have accidentally rebuilt the many-small-pools picture, and the fix is to make the work more shareable rather than to add capacity to the hot shard alone.

Pollaczek–Khinchine — variance matters more than you think

Real systems aren't M/M/anything. Service times in particular are rarely exponential — they have heavy tails (GC pauses, slow queries, retries). The Pollaczek–Khinchine formula handles M/G/1 (general service time distribution, one server):

W = ρ × (1 + C²) / (2μ(1 − ρ)) where C is the coefficient of variation of service time (stddev / mean).

The (1 + C²)/2 factor is the variance penalty. With deterministic service (C = 0), wait time is half of M/M/1. With exponential service (C = 1), you recover M/M/1. With C = 2 (a tail-heavy distribution — quite common in real services), wait time is 2.5× M/M/1. With C = 5 (very tail-heavy, think GC pauses or occasional cache misses), wait time is 13× M/M/1.

Service-time variability (C)	Wait factor vs deterministic	Real-world example
0	1×	Constant-time service. Hash lookup with no cache misses.
1 (exponential)	2×	Roughly memoryless service. Many web services.
2	5×	Mixed fast/slow paths. Common for APIs with optional joins.
5	26×	Tail-heavy. JVM with occasional GC, database with cold-cache queries.
10	~100×	Pathologically variable. A few requests dwarf the rest.

The variance lesson. Reducing the tail of service time has a quadratic effect on wait time. A GC tuning change that cuts P99 service time in half (without changing the median) can drop wait time by 4×. This is why tail-latency work compounds so well.

How this explains tail latency

Most engineers meet queueing theory because their tail is bad. The mean is fine, the median is fine, but P99 is ugly and nobody can say why. Queueing theory has the answer, and it is not one cause but a stack of them that all point the same direction.

Start with the curve. At moderate utilisation the wait time is small on average, but the distribution of waits is wide, and it gets wider as ρ climbs. The high-percentile request is not the one with a slow service time; it is the unlucky one that arrived when the queue happened to be long. Even when nothing is wrong with the server, the randomness of arrivals means some requests bunch up, and the ones at the back of a momentary bunch wait far longer than the average. As you push utilisation up, those momentary bunches get deeper and clear more slowly, so the tail stretches out long before the mean does. A system at 60% can have a perfectly healthy mean and a P99 that is already five or ten times it, purely from arrival randomness.

Then add variance in service time, which Pollaczek–Khinchine says hits wait time quadratically. A few slow requests do not just add their own slowness; they hold up everyone queued behind them, and the wait those followers inherit is the heavy contributor to the tail. This is convoy behaviour: one slow query, one GC pause, one cold cache miss stalls the server, and the line that forms behind it serves a whole batch of requests their bad P99. The slow request itself is one data point; the convoy it creates is many.

Finally, real requests fan out. A single user request often touches dozens of backends, and the user waits for the slowest of them. If each backend has a one-in-a-hundred chance of a slow response, a request that hits a hundred of them is almost certain to catch at least one — so the service's tail becomes the typical experience at the fan-out layer. This is why tail latency dominates user-facing performance and why cutting it pays off across the whole system. Allocating those tails across a request chain is the job of a latency budget, and the queueing models here are what tell you whether a given budget is even achievable at your planned utilisation.

The tail is a utilisation problem first. Before you chase individual slow requests, check where you sit on the curve. Dropping from 90% to 70% utilisation can fix a tail that no amount of per-request tuning would have touched, because most of the tail was queue wait, not service time.

Sizing a thread pool — the worked example

A common practical question: "How big should the thread pool be?" Queueing theory gives a defensible starting point that beats guessing.

# Given:
#   Expected sustained QPS:  λ = 8,000 requests/sec
#   Per-request CPU time:    1 ms (cpu-bound work)
#   Per-request I/O wait:    9 ms (database, cache, downstream RPC)
#   Total service time:      S = 10 ms
#   SLO: P95 latency < 50 ms
#   Available CPU cores:     16

# Little's Law: concurrent requests in flight
#   L = λ × S = 8000 × 0.010 = 80 requests in flight on average.

# For CPU-bound work you'd cap the pool near num_cores.
# But here, 9 of every 10 ms is I/O wait — threads are blocked, not running.
# So pool size should target the "in flight" number, not core count.

# Target: pool size = L × safety_factor
#   safety = 1.5 to absorb burst → pool size ≈ 120 threads

# Utilisation check:
#   Effective service rate per thread: μ = 1/S = 100 req/s
#   With 120 threads: total μ = 12,000 req/s
#   Utilisation ρ = 8000 / 12000 = 67%   ← well under 80%

# At ρ = 67%, M/M/c with c=120 → wait time ≈ 0.5 × S = 5 ms
# Total response time ≈ S + wait = 15 ms mean; P95 well under 50 ms.

# What if the pool is sized for "just enough" — say 100 threads?
#   ρ = 8000 / 10000 = 80%
#   Wait time ≈ 1.3 × S = 13 ms
#   Total ≈ 23 ms — still meets SLO but no headroom for spike

# What if undersized at 90 threads?
#   ρ = 8000 / 9000 = 89%
#   Wait time ≈ 2× S = 20 ms; P95 starts breaching.

# Conclusion: 120 threads gives 1.5× headroom; goes to ρ=80% at
# a 9,600 QPS spike; degrades gracefully past that.

Three principles fall out of this exercise. First, thread-pool size should be driven by Little's Law on actual service time, not by core count, when the work is I/O-bound. Second, target ρ ≈ 70–80% under expected load so that spikes don't push you past 90%. Third, the curve's shape means doubling the pool size when you're at 95% utilisation is the right move — the latency win is far larger than the resource cost.

Bufferbloat — why bigger queues make things worse

A common intuition is "add more queue to absorb bursts". Queueing theory disagrees, at least once you're past saturation. If λ > μ even briefly, the queue grows. A bigger queue just stores more latency — the requests at the back of a 10,000-deep queue have already timed out from the user's perspective, but the system keeps processing them.

The right pattern is the opposite: a small queue with an explicit drop policy. AQM (Active Queue Management) does this in network stacks; HTTP servers do it via maxConnections + 503 on overload; the Linux kernel has CoDel for TCP. The principle is the same: refuse work you cannot complete on time, so that the work you do accept finishes inside the user's deadline.

The TCP/CoDel insight. Bufferbloat in home routers caused seconds of latency because the buffers were too big — the router happily accepted megabytes of work it couldn't drain. Once CoDel started dropping packets at the queue head when delay exceeded 5 ms, latency collapsed. Same principle in any queue: short queue + early drop beats long queue + eventual timeout, every time.

Capacity planning with the curve

Capacity planning is just deciding where to sit on the curve and then keeping the fleet sized so you stay there. The method is short. Measure the service rate of one instance under realistic work — not a microbenchmark, but the real mix of requests, because the variance in that mix is what Pollaczek–Khinchine warns you about. Pick a target utilisation in the flat part of the curve, 60–70% for systems that must stay fast under spikes, maybe 80% for batch work where a little queue is fine. Divide your peak arrival rate by the per-instance service rate and the target utilisation, and that is your instance count. Round up, then add a margin for a node failing.

The number you must not skip is the one you measure rather than assume: the real service rate and its variance. The only honest way to get it is to drive the system the way production drives it, which is what load testing without lying is about. A closed-loop test that waits for each response before sending the next one quietly caps its own arrival rate and will never show you the cliff, because it cannot push ρ past 1 — it throttles itself the moment the system slows down. An open-loop test sends at a fixed rate regardless of how the system is coping, which is how real traffic behaves and the only kind of test that reveals where your curve turns vertical. If your load test cannot reproduce the hockey stick, it is hiding your real ceiling.

Tie it back to the models. Little's Law gives you the in-flight count and lets you size queues and pools. The M/M/1 curve tells you how much headroom the latency target demands. M/M/c tells you whether consolidating pools buys you room for free. Pollaczek–Khinchine tells you how much your service-time variance is costing and where tuning the tail will pay. Together they turn capacity planning from guesswork into arithmetic you can defend in a review.

Production checklist

Use Little's Law for capacity math. L = λW. Two of three are easy to measure; the third drops out. Most "how many servers" questions answer here.
Target 70–80% utilisation, not 95%. The latency penalty above 80% rises far faster than the throughput win.
Prefer one large pool over many small ones. M/M/c with c=16 absorbs bursts much better than four M/M/4s. PgBouncer, Envoy, connection-pool middleware all use this.
Cut tail variance, not just the mean. Pollaczek–Khinchine: reducing C² has a roughly squared effect on wait time. GC tuning, slow-query elimination, removing cold-cache spikes — all compound.
Size pools for in-flight requests when I/O-bound. If 90% of the time a thread is blocked on I/O, the pool can be far larger than core count without hurting throughput.
Bounded queues with explicit drop. A long queue doesn't help — it just stores latency. Cap the queue; drop with a clear 503/429 and Retry-After.
Set the alert below the cliff. If alert thresholds are at 95% utilisation, you'll page during the latency cliff. Alert at 80% sustained — that's the leading indicator.

Queueing theory for engineers

The picture before the maths

Little's Law — the universal one

Why utilisation kills you — the M/M/1 latency curve

Why you should never run near 100%

Headroom, autoscaling, and the cost of slow reactions

M/M/c — multiple servers, surprisingly different

Pollaczek–Khinchine — variance matters more than you think

How this explains tail latency

Sizing a thread pool — the worked example

Bufferbloat — why bigger queues make things worse

Capacity planning with the curve

Production checklist

Further reading

Profiling in production