Handbook · Vol. IV · 2026 Track III · Going horizontal · piece 1 of 5 Primer

Track III · Going horizontal

Scaling out.

Horizontal growth patterns and the seams they introduce — stateless tiers, session storage, sticky routing, and the moment "just add more boxes" stops being enough.

Track III · Going horizontal
When one box stops being enough.
  1. Primer
    Scaling out
  2. Primer
    Load balancing
  3. Essay
    Monolith limits
  4. Primer
    Capacity planning
  5. Primer
    How to estimate cost

Scaling up has a ceiling priced in dollars. Scaling out has a ceiling priced in coordination. Pick your bill.

Vertical scaling has a ceiling, and caching only buys you so much room. Eventually the workload outgrows the biggest single box money can buy. Horizontal scaling, running many small machines instead of one big one, is the way out. The trade-off is consequential: state, sessions, deploys, failure, and observability all change shape. This page covers the patterns that make horizontal scaling work and the failure modes that make it bite.

VERTICAL · UP TO A WALL · HORIZONTAL · OUT FOREVER small medium large $$ $$$ $$$$$$ VERTICAL — diminishing return HORIZONTAL — linear $ throughput
Vertical scaling buys real headroom but the cost curve goes vertical at the top. Horizontal scaling stays cheap per unit — until coordination overhead is the new bottleneck.

Vertical vs horizontal — the actual trade-off

The textbook says vertical scaling is "buy a bigger box" and horizontal is "buy more boxes." That's true but underspecified. The real trade-off shows up when you ask what happens after each choice.

Vertical (scale up)Horizontal (scale out)
Cost curveLinear-ish until ceiling, then verticalLinear, but with overhead per node
Code changesNoneSignificant — must remove shared state
Failure modeSingle box dies = total outageOne node dies = small fraction lost
Performance ceilingThe biggest available SKUEffectively unlimited (within reason)
DeploySingle rolling restartCoordinated rolling deploy across N
CoordinationNoneService discovery, distributed locks, leader election
Best whenYou're early; budget allows; code is hard to changeYou're past the ceiling; you have ops maturity

In practice you do both. Run reasonably-sized boxes, scale them up until the next size doubles your bill for 30% more capacity, then scale out from there. The mistake is doing pure horizontal early — you pay the coordination tax before you need to.

The stateless rule — the only invariant that matters

Horizontal scaling depends on one rule: each request can be handled by any instance, with no information held in the instance from a previous request. State lives in shared services (database, cache, object store), not in the application server's memory. Break this rule and the whole architecture falls apart.

Sessions
Don't store sessions in app memory. Use signed cookies (JWTs) for stateless auth, or Redis for traditional server-side sessions that any instance can read.
Uploads in progress
Don't keep multipart uploads on local disk. Stream straight to object storage with a per-upload ID; the next request reconnecting can attach by ID.
Caches
Local in-process caches are fine when stale data is acceptable, but the source of truth must be elsewhere. For shared cache, use Redis or Memcached.
WebSockets
The connection itself is stateful but the messages don't have to be. Use a pub/sub layer (Redis, NATS) so any node can deliver to any client.
Locks
Don't use language-level locks (mutex, semaphore) for cross-instance coordination. Use Redis locks, ZooKeeper, etcd, or — better — design out the need for distributed locks.

The patterns that go with horizontal scaling

Load balancer

The front door. Distributes requests across the fleet, removes dead nodes, terminates TLS. Without it, horizontal scaling is just N machines clients can't find.

Service discovery

How instances find each other. DNS-based (Consul, Cloud Map), client-side (Eureka, Ribbon), or platform-native (Kubernetes Services). Picks one.

Distributed cache

Redis or Memcached. The shared "fast tier" every instance can hit. Pre-warm on deploy, monitor hit ratio, plan for failover.

Centralised logging

SSH-ing to one of N boxes is no longer a debugging strategy. Ship structured logs to a central store (CloudWatch, Datadog, ELK) from the start.

Distributed tracing

One request now touches 5 services across 3 zones. Tracing (OpenTelemetry, Jaeger, X-Ray) is how you correlate them. Add the SDK before you need it.

Auto-scaling

Horizontal scale only pays off if the fleet shrinks during quiet hours. Auto-scaling on CPU + RPS, with both a floor (never below N) and a ceiling (never explode the bill).

The N+1 / N+2 capacity rule

Run more capacity than you need. The convention in production is N+2: enough to lose two nodes (one failed, one in maintenance) and still serve peak traffic. N+1 is acceptable for non-critical services. N alone is a recipe for paged engineers.

The arithmetic: if peak traffic needs four nodes, run six. The headroom isn't waste — it's the buffer between "graceful degradation" and "page everyone." When you hit autoscaling triggers, you should still have enough headroom for the new instance to come up cleanly without the existing fleet falling over while it boots.

Deploy strategies for horizontal fleets

Rolling deploy
Replace instances 5-10% at a time. The default. Cheap, slow, low blast radius. Watch out for instances flapping during health checks.
Blue-green
Run two full fleets, switch traffic between them. Rollback is instant (flip the switch back). Costs 2× during the switchover.
Canary
Deploy to 1% of traffic. Watch metrics for 10 minutes. If clean, expand to 5%, then 25%, then 100%. Catches bad deploys before they hit everyone.
Shadow / dark launch
Send real traffic to the new version but discard the response. Tests behaviour without affecting users. Used for risky DB migrations and rewrites.

The hard cases

Sticky state that won't die. Horizontal scaling fails when something — a session, an in-memory queue, a long polling subscription — quietly accumulates on one instance. When it dies, that state is gone. Audit your services for "what would lose data if this process restarted right now?" The list should be empty or tiny and explicit.
The thundering herd on autoscale. Traffic surges, autoscaler spawns 50 new instances. They all boot, all hit the database for warm-up queries, all hit the cache cold. The fleet falls over harder than if you'd never scaled. Mitigations: pre-warm caches before adding instances to the LB, stagger boot times, add slow-start to the load balancer.
Coordination overhead is real. Every distributed lock, every consensus operation, every cross-region round-trip adds latency. At enough scale, the coordination tax is bigger than the work. The fix is to design fewer cross-instance dependencies — partition by user_id so each request only needs one shard, etc.

Practical defaults

  1. Default to "vertical first, horizontal second." A reasonably-sized box covers more ground than people expect.
  2. Make every service stateless on day one. Even if you have one instance, write it as if you'll have ten.
  3. Run N+2 capacity for production-critical paths.
  4. Auto-scale on CPU + RPS together. Single-signal autoscaling is brittle.
  5. Deploy with canary or rolling at minimum. Never push to 100% in one step on a service with users.
  6. Centralised logs and traces from day one. Adding them later is painful.
  7. Health checks should hit a real code path (DB connection, cache, external dep). Static /ping hides slow degradations.
Found this useful?