01 / 05

Principle / 01

Performance vs scalability

Performance is how fast one request gets through. Scalability is how the system holds up when you add a hundred more. At low load the two curves look the same; at the knee they diverge sharply, and the entire architecture question is which side of the knee you've decided to live on.

The two words, separately

Performance is how fast one request gets through. You can measure it with one user, one machine, and a stopwatch. P50, P95, P99 — whichever percentile you care about.

Scalability is what happens to that number when load grows. Add more concurrent users, more data, more complex queries. Does latency stay flat, climb gently, or fall off a cliff?

A single beefy box can serve one request in 5 ms and twenty in 100 ms. A sharded cluster might serve one in 50 ms and twenty thousand in… 50 ms. The first is faster at low load. The second handles more. Different things, often confused.

The knee in the curve

Plot latency (Y) against load (X). At first, latency stays flat: the system has spare capacity. Then it hits a bottleneck — a CPU pinned, a thread pool drained, a queue filling faster than it empties — and latency starts to climb. The point where it bends is the knee.

Past the knee, things go bad quickly. P99 climbs from 100 ms to 800 ms to seconds. Requests time out. Clients retry. The retries pile on, and the system collapses in a way that looks nothing like the gentle curve before the knee.

Performance work moves the knee a bit (call it 20–50%). Scalability work changes the curve's shape so the knee sits much further out, usually at the cost of a higher floor. The dashed line in the hero diagram is the scaled version: flatter for longer, but it starts higher than the single-box line.

Why optimising one usually hurts the other

What makes a single request fast often caps how many requests you can run at once. Four common examples:

One big in-memory cache. Reads are near-instant, but every reader takes the same lock, every invalidation stalls every reader, and one box's memory becomes a hard ceiling on capacity.
State pinned to the process. Skipping the external lookup on every request is fast. The catch: you can't move a user's traffic to another instance without rebuilding their state, so horizontal scaling stops working.
Synchronous service chains. Easy to reason about when each hop is fast. Under load they multiply: A's 50 ms call to B becomes a 5 s call once A's thread pool fills up and B's queue grows.
Batching. Combine ten writes into one round-trip and throughput jumps 10x. Each write now waits for the batch to fill, so per-request latency climbs.

The question isn't "which is better." It's "where do we want the knee?" Below it, performance dominates. At or past it, scalability does.

A worked example

You're designing a session store for a web service. Each request looks up a session by token, returns the user ID + permissions. Hot path. Two designs are on the whiteboard.

Design A — single Redis instance. All sessions in one Redis. Read latency under 1 ms at low load. Reads sit on an internal hash table; writes happen via Redis' single-threaded event loop. Tuned correctly, one large instance handles ~100K reads/sec.

Design B — sharded Redis cluster, 16 nodes. Sessions hashed by token to one of 16 shards. Each read costs 2–3 ms (lookup-and-route by the client driver). Cluster handles 1.6M reads/sec aggregate.

At 1K reads/sec, A wins. At 50K, A still wins but the gap is closing. At 200K, A has fallen over: Redis hits its single-thread ceiling, P99 spikes from 1 ms to seconds. B is still flat at 2–3 ms. A is more performant; B is more scalable. The right pick comes down to whether you'll actually see 200K reads/sec.

In the room. Don't say "Redis is fast" or "sharding scales." Say: "Single Redis holds up to ~100K reads/sec at sub-ms P99. Past that it cliffs. Sharded gives me 2–3 ms flat regardless of load, but it triples my floor and the ops story is messier. We expect peak X — here's the pick." The trade-off is the answer.

The two failure modes interviewers test for

Two failure modes show up over and over:

Premature scaling. Reaching for sharding, queues, replicas, microservices before doing the math that says you need them. A single Postgres on a beefy box handles most production workloads. If you can't say why one isn't enough here, you don't get to skip past it.
Scaling without a target. Adding nodes without admitting each one costs latency, complexity, and a new failure mode. "Just add replicas" is wrong if the bottleneck is write contention on one row — replicas don't help that at all.

The fix for both is the same. State the expected load. State where the knee sits in each design. Pick the cheapest one whose knee is comfortably past the load, with room to spare.

Practical heuristics

Measure before scaling. If you can't point at where the knee is today, you have nothing to compare scaled designs against.
Design for 3–10x current peak. Under 3x, you'll redesign within a quarter. Over 10x, you're paying for capacity that won't get used.
Scale the bottleneck, not the service. Reads bottlenecked? Add read replicas. Writes? Shard. CPU? More nodes. Queue? More consumers. Doubling the service overall just doubles the bill.
Watch the climb, not the cliff. A creeping P99 is the system warning you it's near the knee. By the time it has tripled, you're already past it.

Amdahl's Law — the ceiling nobody escapes

Amdahl's Law says the maximum speedup of a parallel system is bounded by the fraction that cannot be parallelised. If 5% of your workload is serial (database transaction commit, global lock, leader-only operation), and you parallelise everything else perfectly across N nodes, the maximum speedup you can ever achieve is 1 / (0.05 + 0.95/N). At N = ∞ that caps at 20×. Not 100×. Not "infinitely scalable." Twenty.

Amdahl in working numbers. 1% serial → max speedup 100×. 5% serial → max 20×. 10% serial → max 10×. 25% serial → max 4×. The curve is unforgiving at the interesting end. Most real systems have 5-15% serial portion they didn't notice — the leader-election round-trip, the global counter, the synchronised cache invalidation — and they hit a wall around 7-20× and cannot understand why.

Two consequences for design:

Find the serial portion first. Adding nodes to a workload bottlenecked by a single hot row, a single coordinator, or a single broadcast does nothing useful. Profile before scaling. The single-shard write hotspot is the single most common "we scaled and it got slower" cause in production.

Reducing the serial fraction beats adding more nodes. Going from 10% serial to 5% (sharding the global counter, eliminating the leader read, batching the synchronous commit) lifts the ceiling from 10× to 20×. Going from 16 nodes to 32 with 10% serial barely moves you (6.4× → 7.8×). Architecture work beats infrastructure work at the knee.

Universal Scalability Law — Amdahl plus the coordination tax

Neil Gunther's Universal Scalability Law (USL) extends Amdahl by adding a quadratic term for coordination overhead. The formula:

capacity(N) = N / (1 + σ(N - 1) + κN(N - 1))
where σ is the serial fraction and κ is the coherency cost (locks, cache invalidation, cross-node coordination).

Without κ, capacity rises forever (slowly). With κ, capacity rises, peaks, then falls. Real systems have a non-zero κ — every additional node spends some time coordinating with the others — and so they have a maximum capacity beyond which adding nodes makes the system slower, not faster. The retrograde region is real and shows up in any contention-heavy workload.

Three curves on the same axes. The straight dashed line is the fantasy: every node adds full capacity. The middle curve is Amdahl, where contention flattens you against a ceiling. The heavy curve is the USL, where the coherency cost pulls it back down past the peak, so beyond some N each extra node makes the system slower.

Examples in the wild: a Postgres write-heavy table peaks somewhere around 4-8 cores on a single box because lock contention dominates; an in-memory KV store on a NUMA system plateaus at 16-32 threads because cross-socket cache invalidation eats every extra core; most Java application servers showed a degradation curve past ~200 GC threads in the early 2010s. The USL fits these curves with two parameters and accurately predicts the peak.

Two directions to grow: up and out

When a single box runs out of room, there are exactly two ways to add capacity. You make the box bigger (scale up, vertical) or you add more boxes (scale out, horizontal). They are not interchangeable, and the choice locks in a different set of ceilings, costs, and failure modes for the life of the system.

Scale up grows one machine until it hits the largest box money can buy. Scale out adds identical machines and keeps going, but every box you add owes a coordination tax to the others — the κ term from the USL above.

Scale up keeps the system simple: one process, one set of logs, one failure mode, no distributed-systems tax. Its ceiling is the biggest box you can rent. Scale out has no single-box ceiling, but it trades that for the cost of keeping many boxes in agreement. The rest of this section walks each direction honestly.

Vertical scaling — the underrated default

The unfashionable answer that wins more often than its reputation suggests. A modern bare- metal box: 192 cores (AMD EPYC 9684X), 6 TB DDR5 RAM, several NVMe drives with millions of IOPS, 100 Gbps network. That single box handles workloads that would have required a small datacenter in 2010.

Where vertical scaling wins:

Stateful workloads with high coordination. Postgres on a 64-core box with 512 GB RAM does ~30K TPS for normal OLTP without sharding. A sharded multi-node setup doing the same throughput costs 5-10× as much, has 3× the operational complexity, and is slower on cross-shard queries. The vertical answer is right until you cannot fit the working set on one box.

Workloads with strict latency SLOs. Network round-trips inside a single machine are nanoseconds; between machines they are microseconds (best case, with kernel bypass) to milliseconds (typical). If your service has a 1ms p99 budget, going horizontal cost you 30% of that budget on every internal call.

Operational simplicity. One box has one set of logs, one process tree, one backup story, one failure mode you actually understand. A 16-node cluster has 16× the operational surface area; a 100-node cluster has its own SRE team. The boring vertical answer wins on team-of-three.

Where vertical hits its ceiling:

Single-box capacity ceilings. 192 cores, 6 TB RAM is roughly the upper end of commodity hardware in 2026. Beyond that you pay exponentially more for marginal increase (mainframes, exotic NUMA topologies). Most teams hit the limit somewhere around 100-500 TPS on serious write workloads.

The bigger-failure-domain problem. A 192-core box that crashes takes 192-cores-worth of work with it. A 16-node cluster of 12-core boxes loses 6% of capacity when one node dies. For workloads where a multi-minute outage during failover is unacceptable, horizontal is the answer even if the box could have held the load.

The cost-at-scale break-even. Above a certain volume the per-unit cost of a sharded cluster of commodity boxes is lower than the per-unit cost of progressively larger single boxes. The exact crossover depends on workload — for stateless web tier the crossover is low (~10K req/s); for OLTP databases it is much higher (~50-100K TPS).

Horizontal scaling — what you actually buy

What horizontal scaling actually gives you, written honestly:

Capacity beyond a single box. The obvious benefit. Spread N units of work across M nodes; throughput scales linearly until Amdahl/USL kicks in.

Smaller blast radius on failure. 1/N of capacity dies when one node dies. Failover is a load-balancer reshape, not a recovery operation.

Independent scaling per service. The read-heavy frontend and the write- heavy worker can scale independently when they are separate services on separate node pools.

What horizontal scaling costs you, also honestly:

Coordination tax (κ in USL). Every cross-node operation pays a network round-trip. A simple "get user, then get their orders" goes from one in-memory call (monolith) to two RPC calls (microservices), each with its own retry budget, timeout, failure semantics, observability overhead. The CTO who didn't budget for this is in the tail-latency-blew-up postmortem.

Operational multiplier. N nodes have N times the log volume, N times the deploy surface, N times the cost-explosion potential during the 3am page. Kubernetes, service mesh, distributed tracing all exist to mitigate this — they do not eliminate it.

The distributed-systems tax. Once your data is on more than one box you have to think about partial failure, network partitions, replication lag, eventual consistency, distributed transactions, idempotency. Each of those is its own deep topic. The single Postgres on the single box does not have any of these problems.

The cost curve — exponential vs linear vs L-shaped

Different scaling shapes produce wildly different bills. Three you will see in the wild:

Linear (best case). Cost grows in proportion to traffic. Stateless web tier behind an autoscaling group: 2x traffic, 2x nodes, 2x bill. Predictable, fine. Most horizontally-scalable workloads aim for this.

Sub-linear (great when achievable). Cost grows slower than traffic. Caching layers (one extra cache node serves disproportionately more reads), CDN edge (amortized over many origins), read replicas with high read/write ratios. Hard-won but real. The economics of every successful consumer business depend on finding sub-linear cost curves somewhere in the stack.

Super-linear (the bill that breaks teams). Cost grows faster than traffic. Cross-region replication overhead, sharded databases with high write amplification, microservices where a single user request fans out into 50 internal calls. A 10x traffic increase produces a 15-30x bill increase. This is the case where the FinOps email arrives.

L-shaped (the autoscaling cliff). Cost is roughly flat until some threshold, then jumps to a new flat. Reserved instances with on-demand burst, RDS instance size jumps, the Postgres-to-Aurora migration step, the moment you outgrow Vercel's free tier. Often unavoidable; just make sure you have seen the steps in advance.

Knowing which shape your workload sits on tells you whether the right action at 2x growth is "do nothing", "redesign the hot path", or "have the FinOps conversation now".

Two case studies, named

Twitter's monorail to microservices, then partially back. Twitter spent the early 2010s breaking up a Rails monolith into ~1500 microservices. Then in the early 2020s started consolidating services back together — the timeline service merged with several adjacent services because the cross-service latency cost (5-15ms per hop, 3-5 hops deep) exceeded the operational benefit. The architecture that won was neither monolith nor pure micro: services sized to teams, with explicit latency budgets per call. Lesson: horizontal scaling has a coordination cost; sometimes the right move is back toward vertical.

Discord's switch from Cassandra to ScyllaDB on the message store. Discord ran trillions of messages on Cassandra; the GC pauses and operational cost of the JVM- based system became the bottleneck — not raw capacity. They migrated to ScyllaDB (same data model, C++ implementation), reducing p99 by 10× on the same hardware. The lesson is about performance, not scalability — the cluster size barely changed, but the per-node efficiency moved dramatically. Engineering effort spent on per-node performance can sometimes save you from a horizontal expansion that would have been more expensive.

Reading the question in an interview

When the interviewer says "design X for Y users", the implicit ask is almost always: where does your design sit on the performance-vs-scalability axis, and have you justified it. The candidates who fail usually do one of three things:

Reach for "scale" first. Talking about sharding, queues, and microservices before establishing that a single Postgres cannot handle the load. Stating a target, doing the napkin math, and concluding "one box handles this" is a confident answer; the candidate who instead lists every distributed technology they have heard of is the candidate who has not done the math.

Confuse performance with scalability. "We'll use Redis, it's fast" is a performance claim ("Redis answers individual requests quickly"). "We'll add nodes to handle growth" is a scalability claim. They are different decisions. The good answer names which one you are optimising for and why.

Skip the cost. Every scaling decision has a cost shape. A candidate who designs a multi-region active-active system for a workload that does not need it is throwing 10× the budget at a problem that did not exist. Naming the cost shape — and the workload threshold where it becomes worth it — is the senior move.

The phrase that signals you understand the trade-off: "at expected load X this design runs comfortably on Y, with the knee at Z. If we exceed Z we add a read replica / shard / coordinator — the cost goes from Y to Y+W, and we lose property V." That is the shape the room wants to hear.

Statelessness and sharding — what actually lets you scale out

Scale-out is not free. You have to design for it, and two properties do most of the work: statelessness in the request tier and sharding in the data tier. Skip either and adding nodes stops helping.

Statelessness. A stateless service keeps no per-user state in process memory between requests. Everything it needs to handle a request either arrives in the request or gets read from a shared store. That one property is what lets a load balancer send a user's next request to any node in the pool. If instead the node holds the user's session in local memory, the balancer has to pin that user to that node, and you have just turned a flat pool into a set of single points of failure. When the node dies, the pinned users lose their state, and you cannot add capacity by adding nodes, because the existing users do not move. Push the state out — into a session store, a cache, a database — and the request tier becomes a flat, replaceable pool you can grow by changing a number.

Sharding. The request tier scales by cloning identical stateless nodes; the data tier cannot, because the data is the state. So you split the data instead. Each shard owns a slice of the keyspace, picked by hashing or ranging on a shard key, and each node holds only its slice. Reads and writes for a key go to the one shard that owns it, so total write capacity grows with shard count, which is the thing a read replica can never give you. The catch is the part the USL warned about: any operation that crosses shards — a join, a transaction, a query without the shard key — pays a coordination cost, and a badly chosen shard key concentrates traffic on one shard (a hot shard), which puts you right back on a single-box ceiling with extra steps. Good sharding is mostly good shard-key choice.

Do you have a performance problem or a scalability problem?

These get diagnosed and fixed differently, so naming which one you have is the first move. A simple test: hold load constant and look at latency, then hold the work constant and add load.

Performance problem. One request is slow even when the system is idle. A single user, no contention, and P50 is still high. The fix lives inside the request: a missing index, an N+1 query, an unbatched call, a slow serialisation. Adding nodes does nothing here — every node is equally slow.
Scalability problem. One request is fast in isolation but latency climbs as load rises. P50 at low load is fine; P99 at peak is terrible. The fix lives in the system's shape: a contended lock, a drained thread pool, a queue growing faster than it drains, a hot shard. This is where queueing theory earns its keep — once utilisation passes about 70-80%, queue time explodes non-linearly, which is the mathematical reason the knee is a cliff and not a gentle slope.

The trap is treating one as the other. Throwing nodes at a performance problem just buys more slow nodes and a bigger bill. Optimising a single request that was already fast does nothing for a system that falls over at the knee. Run both tests, then spend your effort where the curve actually bends.

Cost per request at scale

The number that ties performance and scalability to the business is cost per request: the total spend divided by the requests served. It is the lens that catches the trade-offs the latency graph hides, because the same design can look great on P99 and terrible on the bill.

Performance work usually lowers cost per request — a faster request holds a worker for less time, so one box serves more of them, so each one costs less. Scale-out work usually raises the floor on cost per request before it lowers the ceiling on capacity, because the coordination tax means some fraction of every node's work is now spent talking to other nodes rather than serving users. A 16-node sharded cluster that spends 15% of its cycles on cross-shard coordination is paying for 16 nodes but getting fewer than 14 nodes' worth of useful work, and that gap is pure cost per request.

The number to track. Watch cost per request, not just total cost. Total cost rising with traffic is normal. Cost per request rising with traffic means you are on a super-linear curve — the coordination tax is growing faster than the work — and that is the early signal that your scaling story has a leak in it before the FinOps email arrives.

How this connects to the rest

Every other principle in this section reads off this one. Latency vs throughput is the same trade-off seen per-request instead of per-system. Availability vs consistency changes which side of the knee you can recover from. Availability patterns are the techniques for pushing the knee further out.

If you find yourself adding capacity without first naming where the knee is now and where it needs to be, you're guessing. Show the math instead.

Related on Semicolony

Latency vs throughput — the same trade-off from a single-request perspective.
Queueing theory — why latency explodes past ~70% utilisation, the maths behind the knee.
Load balancing — how a stateless tier turns into a flat pool you can grow.
Sharding deep dive — the scaling technique that most often gets reached for prematurely.
Paper: The Tail at Scale — Jeff Dean on why P99 latency is a different problem from median.
System-design handbook — decision rules for caching, sharding, queueing.
System Design Roadmap — the full 15-stage arc.

Principle 02 / 05

Latency vs throughput →

Two numbers, one equation. Little's Law (concurrency = throughput × latency) is the unforgiving rule behind every capacity calculation.

Next principle

← back to principles → system design roadmap

Found this useful?