Scaling out — System Design Handbook

Scaling up has a ceiling priced in dollars. Scaling out has a ceiling priced in coordination. Pick your bill.

Vertical scaling has a ceiling, and caching only buys you so much room. Eventually the workload outgrows the biggest single box money can buy. Horizontal scaling, running many small machines instead of one big one, is the way out. The trade-off is consequential: state, sessions, deploys, failure, and observability all change shape. This page covers the patterns that make horizontal scaling work and the failure modes that make it bite.

Vertical scaling buys real headroom but the cost curve goes vertical at the top. Horizontal scaling stays cheap per unit — until coordination overhead is the new bottleneck.

Vertical vs horizontal — the actual trade-off

The textbook says vertical scaling is "buy a bigger box" and horizontal is "buy more boxes." That's true but underspecified. The real trade-off shows up when you ask what happens after each choice.

	Vertical (scale up)	Horizontal (scale out)
Cost curve	Linear-ish until ceiling, then vertical	Linear, but with overhead per node
Code changes	None	Significant — must remove shared state
Failure mode	Single box dies = total outage	One node dies = small fraction lost
Performance ceiling	The biggest available SKU	Effectively unlimited (within reason)
Deploy	Single rolling restart	Coordinated rolling deploy across N
Coordination	None	Service discovery, distributed locks, leader election
Best when	You're early; budget allows; code is hard to change	You're past the ceiling; you have ops maturity

In practice you do both. Run reasonably-sized boxes, scale them up until the next size doubles your bill for 30% more capacity, then scale out from there. The mistake is doing pure horizontal early — you pay the coordination tax before you need to.

The stateless rule — the only invariant that matters

Horizontal scaling depends on one rule: each request can be handled by any instance, with no information held in the instance from a previous request. State lives in shared services (database, cache, object store), not in the application server's memory. Break this rule and the whole architecture falls apart.

Sessions: Don't store sessions in app memory. Use signed cookies (JWTs) for stateless auth, or Redis for traditional server-side sessions that any instance can read.
Uploads in progress: Don't keep multipart uploads on local disk. Stream straight to object storage with a per-upload ID; the next request reconnecting can attach by ID.
Caches: Local in-process caches are fine when stale data is acceptable, but the source of truth must be elsewhere. For shared cache, use Redis or Memcached.
WebSockets: The connection itself is stateful but the messages don't have to be. Use a pub/sub layer (Redis, NATS) so any node can deliver to any client.
Locks: Don't use language-level locks (mutex, semaphore) for cross-instance coordination. Use Redis locks, ZooKeeper, etcd, or — better — design out the need for distributed locks.

The patterns that go with horizontal scaling

Load balancer

The front door. Distributes requests across the fleet, removes dead nodes, terminates TLS. Without it, horizontal scaling is just N machines clients can't find.

Service discovery

How instances find each other. DNS-based (Consul, Cloud Map), client-side (Eureka, Ribbon), or platform-native (Kubernetes Services). Picks one.

Distributed cache

Redis or Memcached. The shared "fast tier" every instance can hit. Pre-warm on deploy, monitor hit ratio, plan for failover.

Centralised logging

SSH-ing to one of N boxes is no longer a debugging strategy. Ship structured logs to a central store (CloudWatch, Datadog, ELK) from the start.

Distributed tracing

One request now touches 5 services across 3 zones. Tracing (OpenTelemetry, Jaeger, X-Ray) is how you correlate them. Add the SDK before you need it.

Auto-scaling

Horizontal scale only pays off if the fleet shrinks during quiet hours. Auto-scaling on CPU + RPS, with both a floor (never below N) and a ceiling (never explode the bill).

The N+1 / N+2 capacity rule

Run more capacity than you need. The convention in production is N+2: enough to lose two nodes (one failed, one in maintenance) and still serve peak traffic. N+1 is acceptable for non-critical services. N alone is a recipe for paged engineers.

The arithmetic: if peak traffic needs four nodes, run six. The headroom isn't waste — it's the buffer between "graceful degradation" and "page everyone." When you hit autoscaling triggers, you should still have enough headroom for the new instance to come up cleanly without the existing fleet falling over while it boots.

Deploy strategies for horizontal fleets

Rolling deploy: Replace instances 5-10% at a time. The default. Cheap, slow, low blast radius. Watch out for instances flapping during health checks.
Blue-green: Run two full fleets, switch traffic between them. Rollback is instant (flip the switch back). Costs 2× during the switchover.
Canary: Deploy to 1% of traffic. Watch metrics for 10 minutes. If clean, expand to 5%, then 25%, then 100%. Catches bad deploys before they hit everyone.
Shadow / dark launch: Send real traffic to the new version but discard the response. Tests behaviour without affecting users. Used for risky DB migrations and rewrites.

The hard cases

Sticky state that won't die. Horizontal scaling fails when something — a session, an in-memory queue, a long polling subscription — quietly accumulates on one instance. When it dies, that state is gone. Audit your services for "what would lose data if this process restarted right now?" The list should be empty or tiny and explicit.

The thundering herd on autoscale. Traffic surges, autoscaler spawns 50 new instances. They all boot, all hit the database for warm-up queries, all hit the cache cold. The fleet falls over harder than if you'd never scaled. Mitigations: pre-warm caches before adding instances to the LB, stagger boot times, add slow-start to the load balancer.

Coordination overhead is real. Every distributed lock, every consensus operation, every cross-region round-trip adds latency. At enough scale, the coordination tax is bigger than the work. The fix is to design fewer cross-instance dependencies — partition by user_id so each request only needs one shard, etc.

Practical defaults

Default to "vertical first, horizontal second." A reasonably-sized box covers more ground than people expect.
Make every service stateless on day one. Even if you have one instance, write it as if you'll have ten.
Run N+2 capacity for production-critical paths.
Auto-scale on CPU + RPS together. Single-signal autoscaling is brittle.
Deploy with canary or rolling at minimum. Never push to 100% in one step on a service with users.
Centralised logs and traces from day one. Adding them later is painful.
Health checks should hit a real code path (DB connection, cache, external dep). Static /ping hides slow degradations.

Scaling out.

Vertical vs horizontal — the actual trade-off

The stateless rule — the only invariant that matters

The patterns that go with horizontal scaling

The N+1 / N+2 capacity rule

Deploy strategies for horizontal fleets

The hard cases

Practical defaults