Orchestration & resiliency — System Design Handbook

Distributed systems fail. The discipline isn't preventing failure — it's deciding which failures are acceptable and engineering for everything else to be visible, bounded, and recoverable.

Scale alone is not enough. Distributed systems have to be resilient: network partitions, hardware failure, traffic spikes, downstream outages, and the worst category of all, operator error. This page covers the patterns and the platform that turn fragile systems into ones that recover by themselves: orchestration on Kubernetes, circuit breakers, timeouts, retries, bulkheads, graceful degradation.

The circuit breaker. Closed = traffic flows. Open = fail fast (don't pile on a sick service). Half-open = let one request through; if it works, close again.

Why distributed systems break

Distributed systems fail in ways monoliths don't. A monolith either works or doesn't — there's one process, one machine, one health state. A distributed system has dozens of processes, dozens of machines, and partial failures everywhere. Six failure modes show up over and over.

Network partition: One zone can't talk to another. Both sides are alive. Both think the other is dead. Whichever invariant you most cared about (consistency, availability) is now broken.
Slow downstream: A dependency hasn't failed; it's just slow. Threads pile up waiting for it. Your service runs out of capacity. The slowness propagates upward.
Cascading failure: One service degrades. Its callers retry harder. The retries amplify load on the next service down. Within minutes the entire dependency graph is on fire.
Thundering herd: A cache expires, 10,000 instances all hit the database simultaneously, the database falls over. Or 10,000 nodes restart at once and all hit a config service at boot.
Split brain: The cluster has a leader. The network partitions. Each side elects a new leader. Now you have two leaders, two states, and a divergence problem when the partition heals.
Operator error: A bad deploy. A wrong DNS update. A SQL DELETE with no WHERE. Studies put this at ~30% of major incidents. Build for it.

The five resilience patterns

Each pattern addresses a specific failure mode. Most production services use all five. The patterns are well-known, well-named, and worth deploying as defaults, not afterthoughts.

Timeout

Every network call must have a finite, short timeout. Default infinite timeouts are how outages last hours instead of minutes. Set the timeout shorter than the caller's timeout — always.

Retry with backoff

Retry transient failures (5xx, network errors), not idempotent failures (4xx). Exponential backoff + jitter, max 3-5 attempts. Carry an idempotency key so the retry is safe.

Circuit breaker

Track recent failures; if the rate exceeds a threshold, "open" the circuit and fail fast for a cooldown period. Stops a sick downstream from poisoning the caller's thread pool.

Bulkhead

Isolate critical paths so a noisy neighbour can't drain shared resources. Separate connection pools, separate thread pools, separate queues per dependency.

Graceful degradation

When a non-essential dependency dies, return a partial answer rather than no answer. Recommendations service down? Show the page without recommendations.

Load shedding

When overloaded, shed lowest-priority traffic before everything dies. Health-check probes get priority; the dashboard's analytics page can wait. Better partial service than total outage.

Timeouts — the rule that surprises people

Every call has a timeout, and the timeout for a downstream call is shorter than the timeout for the call above it. If the user's request has a 5-second budget, the downstream call must have less — say 4.5 seconds — so the layer above can do something useful with the remaining 500ms (return a graceful error, fall back to a cache, log the timeout).

The opposite — downstream timeouts longer than upstream — is a classic outage shape: the upstream gives up, the user retries, but the downstream is still busy serving the original abandoned request. Resources stack up; the system spirals. Always shorter going down.

Retry semantics — the part everyone gets wrong

Don't retry 4xx: 4xx means "you sent something invalid." Retrying changes nothing. Burn budget on something else.
Retry 5xx and network errors: These are transient: a node restart, a brief overload, a timeout. With backoff and jitter they usually succeed.
Idempotency keys are mandatory: If the call is non-idempotent (POST that creates a payment), the retry might double-charge. Send an idempotency key with every retryable request; the server dedupes.
Exponential + jitter: Without jitter, all retrying clients hit the recovering service at the same moment. With jitter (each backoff times a random 0-100% factor), the load is spread.
Cap retries: Three to five attempts max, total time bounded by the user's budget. Retrying forever is how a transient outage becomes a permanent one.

Kubernetes — the orchestration platform that became the default

Kubernetes is the default control plane for running distributed services. Its job: take a declarative spec ("I want 5 replicas of this image with these resources"), reconcile reality to that spec, and keep doing it forever. The resilience features it bakes in:

Liveness probes: "Is this container alive?" Failing liveness restarts the container. Best for catching deadlocks. Cheap; do it.
Readiness probes: "Is this container ready to serve traffic?" Failing readiness removes it from the load balancer without restarting. Use during slow start-up or graceful shutdown.
PodDisruptionBudgets: "Don't take more than 1 of these pods down at once for voluntary disruptions." Prevents node drains from killing a quorum-sized service.
Resource requests and limits: Reserve CPU/memory so noisy neighbours can't starve you. Limits are the bulkhead at the kernel level.
HorizontalPodAutoscaler: Scale replicas based on CPU, memory, or custom metrics. Combine with Cluster Autoscaler so the underlying nodes also scale.
Rolling updates with maxSurge/maxUnavailable: Replace pods gradually. Combined with readiness probes, deploys take seconds with zero dropped requests.

The hard cases

Retry storms. Every layer in the stack retries. A single 5xx from the bottom causes 3 retries from the layer above, 9 from the layer above that, 27 from the top. The downstream now sees 30× its real load, gets sicker, fails harder. Mitigations: retry only at the outermost edge, use circuit breakers in the middle, and budget retries with a token-bucket per request.

The graceful shutdown nobody implements. Pod gets killed mid-request. The request errors. The user sees a blip. Mitigations: SIGTERM handler that drains connections, removes the pod from the LB before stopping, finishes in-flight work. Kubernetes gives you 30 seconds by default; use them.

Liveness probes that hide problems. A liveness probe that just returns 200 detects nothing. A liveness probe that hits the database can detect deadlocks but also makes a DB outage look like a service crash, causing pods to restart and worsen the situation. Tune carefully; the probe should detect real "this process is wedged" without being one.

Practical defaults

Timeouts on every network call. Default for HTTP: 5 seconds. For internal RPC: 1 second. Tune per-call once you measure.
Retries: max 3 attempts, exponential 1s/2s/4s, jitter ±20%, only on 5xx and network errors, only on idempotent or idempotency-keyed operations.
Circuit breakers in front of every external dependency. Trip on 50% failure rate over a 30-second window. Cooldown 30s.
Liveness + readiness probes on every Kubernetes deployment, with resource requests + limits. PodDisruptionBudgets on anything serving real traffic.
Graceful shutdown: drain → finish in-flight → exit cleanly within 30s.
Game day exercises: regularly kill production pods, drop network connectivity, fill disks. The first time you find these problems should not be during an incident.
Postmortems for every page. Blameless. Read the SRE workbook for the template.

Orchestration & resiliency.

Why distributed systems break

The five resilience patterns

Timeouts — the rule that surprises people

Retry semantics — the part everyone gets wrong

Kubernetes — the orchestration platform that became the default

The hard cases

Practical defaults