Capacity planning — System Design Handbook

Capacity planning is the unglamorous skill that decides whether your design holds up under load. Every other chapter assumes you know how to take "10k requests per second, p99 under 50ms" and turn it into a host count, a memory budget, and a defensible answer about headroom.

The hard part is not the arithmetic. It is keeping the chain of reasoning explicit — request rate, service time, concurrency, server count, bottleneck, headroom — so a reviewer can poke at the assumptions instead of arguing with the conclusion. This chapter is the chain, with the one law that holds it together and the worked examples that make it stick.

Little's Law — the only formula you really need

For any stable queue, the average number of items in the system equals the arrival rate times the average time each item spends in the system. Written in three letters: L = λ × W. The rest of the chapter is what to do with those three letters.

L: Average number of items in the system. For a web service, concurrent requests in flight.
λ (lambda): Arrival rate. Requests per second, messages per second, jobs per second — whatever the system processes.
W: Average time an item spends in the system. For a web server, roughly the average response time.

Three uses of the same equation. Pick the two you can measure and solve for the third.

Sizing concurrency.: 10 000 requests per second × 50 ms average latency = 500 concurrent requests. You need a thread pool, an event loop, or a process count that can hold 500 in flight without queueing. Half that and the system starts to back up; a tenth of that and it falls over.
Predicting latency.: If your service is single-threaded with concurrency 1 and the load is 5 rps with W = 20 ms, your steady-state queue is L = 5 × 0.020 = 0.1 — effectively idle. At 50 rps × 20 ms = 1.0 you are at the knee of the curve.
Bounding throughput.: A worker that handles one item at a time with service time S can sustain at most λ_max = 1/S items per second. A 10 ms job means 100/sec per worker, period. Anything more requires concurrency or parallelism.

The queueing primer — why p99 explodes before utilisation does

The arithmetic above gave you an average. Averages lie about tail latency in a way that has crashed more systems than any other single misunderstanding. The relationship between utilisation (ρ) and queueing delay is non-linear: as ρ approaches 1, waiting time goes to infinity. The M/M/1 model is wrong in detail and correct in shape, and the shape is the part you need.

For an M/M/1 queue (Poisson arrivals, exponential service times, one server), expected waiting time in the queue is:

W_q = ρ / (μ × (1 − ρ))

where  ρ = λ / μ    utilisation
       μ = 1 / S    service rate, with S = service time

Plug numbers in and you get the rule everyone running production traffic eventually internalises.

ρ = 0.50: Average wait ≈ 1 × service time. The system feels responsive.
ρ = 0.80: Average wait ≈ 4 × service time. The system feels slow under bursts.
ρ = 0.90: Average wait ≈ 9 × service time. The system is two events away from cascading failure.
ρ = 0.95: Average wait ≈ 19 × service time. Pages are firing.
ρ = 0.99: Average wait ≈ 99 × service time. The system is on fire.

This is why capacity targets sit around 60–70% utilisation at peak, not 95%. The remaining headroom is not slack — it is the buffer that keeps tail latency bounded when a burst arrives, a host dies, or a slow query lands. The classic mistake is to plan for 90% steady-state utilisation and discover the p99 doubles every Black Friday.

Back-of-envelope sizing — a worked example

Take a concrete spec and walk it through. A search-suggest API: 50 000 requests per second at peak, target p99 ≤ 30 ms, headroom for a single-AZ failure.

Step 1 — Service time and concurrency

The service does a Redis lookup (~1 ms p50, ~3 ms p99), a Postgres query for personalisation when cache misses (~5 ms p50, ~15 ms p99, hit rate 90%), and serialisation plus network overhead (~1 ms). Weighted service time is roughly 0.9 × 1 + 0.1 × 5 + 1 = 2.4 ms average, with a service-side p99 of maybe 8 ms before the network roundtrip.

Little's Law: 50 000 rps × 0.0024 s = 120 concurrent requests at steady state. Each pod can comfortably handle 50 concurrent if it is an async Go or Node service with a goroutine or coroutine per request. So 120 / 50 = 3 pods at theoretical minimum.

Step 2 — Headroom

Target utilisation at peak is 60% (the queueing rule from above). Required pod capacity is 120 / 0.6 = 200 concurrent. At 50 per pod, that is 4 pods. Headroom for losing an AZ — assume three AZs, lose one — multiplies by 1.5. So 4 × 1.5 = 6 pods minimum.

Step 3 — Memory and CPU

Each pod holds the connection pool (Redis: 32 conn × 8 KB = 256 KB; Postgres: 32 conn × 16 KB = 512 KB), the working set of the personalisation index (~200 MB), and runtime overhead (Go runtime: ~30 MB; framework: ~50 MB). Round up to 512 MB per pod with safety margin. CPU at 60% utilisation with 50 concurrent requests doing mostly I/O wait: 1 vCPU per pod is enough, 2 gives headroom.

Six pods × 2 vCPU × 512 MB = 12 vCPU and 3 GB RAM. Round to the nearest instance shape and run it on three nodes with two pods each, one node per AZ.

Step 4 — Downstream sizing

Redis: 50 000 rps × 1 lookup = 50 000 ops/sec. A single Redis instance handles ~100 000 ops/sec. One replica per AZ gives read scaling and failover headroom. Postgres on cache miss: 5 000 rps. A modest db.m6g.xlarge sustains 10 000 simple SELECTs/sec; budget two read replicas and you can absorb a doubled miss rate.

The four resources, in order of who blows up first

Whatever the service, capacity is bounded by one of four things at any moment. The exercise is to figure out which one well before it actually saturates.

CPU: Compute-bound services — JSON serialisation at scale, anything doing crypto or compression, ML inference. The cheapest to scale: add cores or pods. Easy to spot in profiles. p99 grows smoothly with utilisation up to ~80%.
Memory: Working sets that do not fit. A cache that exceeds RAM falls off a cliff to disk. Postgres working set bigger than shared_buffers means double the disk I/O. The signature is a non-linear latency jump as you cross a threshold.
I/O — disk and network: fsync-bound writes (WAL, Kafka), tail-latency-sensitive reads (B-tree on cold pages), and inter-service network calls. Hardest to provision because cloud-published numbers are best-case; real fsync latency on shared EBS is often 5–10× the headline.
Coordination: Locks, leader bottlenecks, message queues with single-partition keys, hot Redis keys, single-row-update databases. The limit that does not yield to "add more pods" — it requires a redesign.

Always identify the bottleneck explicitly. "We are CPU-bound at 60% utilisation, so capacity grows linearly with pods until we hit the Redis throughput ceiling at roughly 5× this load." That is a planning statement. "Add a few more pods and see" is not.

Forecasting growth

Plans that hold for steady state break as the business grows. Build the projection explicitly so it is auditable.

Linear projection: Today's traffic × monthly growth rate × planning horizon. Two-month growth at 10% per month is 1.21×; six months is 1.77×; a year is 3.14×. Plug the larger number into your sizing.
Peak-to-average: Most services see peak traffic 3–5× their daily average. Plan for the peak, including seasonal ones — Black Friday for commerce, end-of-quarter for B2B, Sunday evenings for streaming. Multiply steady-state sizing by the peak-to-average ratio.
Step changes: Product launches, marketing campaigns, regulatory triggers. A 10× spike for a 4-hour launch is not covered by linear growth — provision for the spike or have autoscaling that can keep up. Autoscaling has its own lag (60–300 seconds), so a spike that lasts less than that will not be caught.

What a defensible capacity plan looks like

Six lines. Anyone reading should be able to verify each in under five minutes — that is the test of whether the plan is honest.

Workload. "50 000 requests per second peak, 3× average daily, with a target p99 of 30 ms."
Service time. "Weighted average 2.4 ms, derived from cache hit rate 90%, Redis lookup 1 ms, Postgres miss 5 ms."
Concurrency. "120 in flight at peak (Little's Law: 50 000 × 0.0024)."
Capacity. "Six pods at 50 concurrent each, sized for 60% utilisation and single-AZ failure."
Bottleneck. "Service is CPU-bound on JSON serialisation. Redis becomes the constraint at ~150 000 rps."
Headroom. "Current sizing absorbs a 3× spike before pages fire. Doubling traffic requires sharding the personalisation index."

A plan that misses any of these is hiding an assumption. Find it before something else does.

Common mistakes

Sizing on the average: Capacity planned on average traffic and average latency falls over on the first spike or the first slow query. Always size for peak × p99, not mean × mean.
Forgetting concurrency: Threads, goroutines, async event loops — each has a different cost profile. A 100 ms service handling 10 000 rps needs 1 000 concurrent slots somewhere. If your runtime cannot host that, it does not matter how many CPUs you add.
Treating cache hit rate as a constant: A cache stampede or invalidation event drops hit rate from 95% to 0% in seconds and multiplies downstream load by 20×. Plan capacity assuming hit rate degrades — at least transiently — and the downstream can absorb it.
Ignoring fan-out: One incoming request becomes ten downstream calls. Sizing the front door for 10 000 rps means sizing every downstream service for 100 000. Walk the diagram.
Conflating availability and durability headroom: Surviving an AZ failure (availability) is a 1.5× factor for three-AZ deployments. Surviving a region failure (durability + availability) is a different exercise involving cross-region replication and a separate capacity bucket. Do not mix them.

Capacity planning.