Scaling up has a ceiling priced in dollars. Scaling out has a ceiling priced in coordination. Pick your bill.
Vertical scaling has a ceiling, and caching only buys you so much room. Eventually the workload outgrows the biggest single box money can buy. Horizontal scaling, running many small machines instead of one big one, is the way out. The trade-off is consequential: state, sessions, deploys, failure, and observability all change shape. This page covers the patterns that make horizontal scaling work and the failure modes that make it bite.
Vertical vs horizontal — the actual trade-off
The textbook says vertical scaling is "buy a bigger box" and horizontal is "buy more boxes." That's true but underspecified. The real trade-off shows up when you ask what happens after each choice.
| Vertical (scale up) | Horizontal (scale out) | |
|---|---|---|
| Cost curve | Linear-ish until ceiling, then vertical | Linear, but with overhead per node |
| Code changes | None | Significant — must remove shared state |
| Failure mode | Single box dies = total outage | One node dies = small fraction lost |
| Performance ceiling | The biggest available SKU | Effectively unlimited (within reason) |
| Deploy | Single rolling restart | Coordinated rolling deploy across N |
| Coordination | None | Service discovery, distributed locks, leader election |
| Best when | You're early; budget allows; code is hard to change | You're past the ceiling; you have ops maturity |
In practice you do both. Run reasonably-sized boxes, scale them up until the next size doubles your bill for 30% more capacity, then scale out from there. The mistake is doing pure horizontal early — you pay the coordination tax before you need to.
The stateless rule — the only invariant that matters
Horizontal scaling depends on one rule: each request can be handled by any instance, with no information held in the instance from a previous request. State lives in shared services (database, cache, object store), not in the application server's memory. Break this rule and the whole architecture falls apart.
- Sessions
- Don't store sessions in app memory. Use signed cookies (JWTs) for stateless auth, or Redis for traditional server-side sessions that any instance can read.
- Uploads in progress
- Don't keep multipart uploads on local disk. Stream straight to object storage with a per-upload ID; the next request reconnecting can attach by ID.
- Caches
- Local in-process caches are fine when stale data is acceptable, but the source of truth must be elsewhere. For shared cache, use Redis or Memcached.
- WebSockets
- The connection itself is stateful but the messages don't have to be. Use a pub/sub layer (Redis, NATS) so any node can deliver to any client.
- Locks
- Don't use language-level locks (mutex, semaphore) for cross-instance coordination. Use Redis locks, ZooKeeper, etcd, or — better — design out the need for distributed locks.
The patterns that go with horizontal scaling
The front door. Distributes requests across the fleet, removes dead nodes, terminates TLS. Without it, horizontal scaling is just N machines clients can't find.
How instances find each other. DNS-based (Consul, Cloud Map), client-side (Eureka, Ribbon), or platform-native (Kubernetes Services). Picks one.
Redis or Memcached. The shared "fast tier" every instance can hit. Pre-warm on deploy, monitor hit ratio, plan for failover.
SSH-ing to one of N boxes is no longer a debugging strategy. Ship structured logs to a central store (CloudWatch, Datadog, ELK) from the start.
One request now touches 5 services across 3 zones. Tracing (OpenTelemetry, Jaeger, X-Ray) is how you correlate them. Add the SDK before you need it.
Horizontal scale only pays off if the fleet shrinks during quiet hours. Auto-scaling on CPU + RPS, with both a floor (never below N) and a ceiling (never explode the bill).
The N+1 / N+2 capacity rule
Run more capacity than you need. The convention in production is N+2: enough to lose two nodes (one failed, one in maintenance) and still serve peak traffic. N+1 is acceptable for non-critical services. N alone is a recipe for paged engineers.
The arithmetic: if peak traffic needs four nodes, run six. The headroom isn't waste — it's the buffer between "graceful degradation" and "page everyone." When you hit autoscaling triggers, you should still have enough headroom for the new instance to come up cleanly without the existing fleet falling over while it boots.
Deploy strategies for horizontal fleets
- Rolling deploy
- Replace instances 5-10% at a time. The default. Cheap, slow, low blast radius. Watch out for instances flapping during health checks.
- Blue-green
- Run two full fleets, switch traffic between them. Rollback is instant (flip the switch back). Costs 2× during the switchover.
- Canary
- Deploy to 1% of traffic. Watch metrics for 10 minutes. If clean, expand to 5%, then 25%, then 100%. Catches bad deploys before they hit everyone.
- Shadow / dark launch
- Send real traffic to the new version but discard the response. Tests behaviour without affecting users. Used for risky DB migrations and rewrites.
The hard cases
Practical defaults
- Default to "vertical first, horizontal second." A reasonably-sized box covers more ground than people expect.
- Make every service stateless on day one. Even if you have one instance, write it as if you'll have ten.
- Run N+2 capacity for production-critical paths.
- Auto-scale on CPU + RPS together. Single-signal autoscaling is brittle.
- Deploy with canary or rolling at minimum. Never push to 100% in one step on a service with users.
- Centralised logs and traces from day one. Adding them later is painful.
- Health checks should hit a real code path (DB connection, cache, external dep). Static
/pinghides slow degradations.