A queueing system under load.

Every server you\'ve ever deployed is a queueing system. Requests arrive at rate λ. The server processes at rate μ. As long as λ < μ you\'re fine. As ρ = λ/μ approaches 1, queue length and wait time blow up. Move the sliders and watch the knee.

λ · arrivals/sec 80 μ · service rate/sec 100 servers c 1

Little\'s Law · the one equation you have to know

L = λ · W. Items in system equals arrival rate times time per item. It holds for any stable queueing system, regardless of the distribution of arrivals or services. If your service averages 100 ms per request and you sustain 50 req/s, you have 5 requests in flight at any moment. If you suddenly get 100 ms latency spike, either λ went up or W went up — one of those numbers is your culprit.

Why the curve is non-linear · the 70% rule

W = ρ / (μ · (1 − ρ)) for M/M/1. As ρ approaches 1, W approaches infinity — slowly at first, then catastrophically. At ρ = 0.5, W is 2× service time. At ρ = 0.8, it\'s 5×. At ρ = 0.9, it\'s 10×. At ρ = 0.95, it\'s 20×. This is the source of the operational rule "keep utilization below 70%." You\'re not wasting CPU; you\'re buying headroom against demand spikes that would otherwise punch you onto the cliff.

What helps · more servers, faster service, fewer arrivals

Adding a server (c > 1) flattens the curve dramatically. At c = 2 the effective capacity doubles, and the same arrival rate gives much lower W. Speeding up service (lower 1/μ) shifts the whole curve down. Capping arrivals via rate limiting or admission control is the ugly-but-essential safety valve when neither of the above is feasible. Most production SREs run on M/M/c assumptions and target ρ between 0.5 and 0.7 for latency-sensitive services.

Go deeper

Queueing in production →

M/M/1 vs M/M/c, Erlang formulas, USL (Universal Scalability Law) for distributed workloads, why coordinated omission hides the worst behaviour in benchmarks.

Open the Codex →

Found this useful?