Distributed systems fail. The discipline isn't preventing failure — it's deciding which failures are acceptable and engineering for everything else to be visible, bounded, and recoverable.
Scale alone is not enough. Distributed systems have to be resilient: network partitions, hardware failure, traffic spikes, downstream outages, and the worst category of all, operator error. This page covers the patterns and the platform that turn fragile systems into ones that recover by themselves: orchestration on Kubernetes, circuit breakers, timeouts, retries, bulkheads, graceful degradation.
Why distributed systems break
Distributed systems fail in ways monoliths don't. A monolith either works or doesn't — there's one process, one machine, one health state. A distributed system has dozens of processes, dozens of machines, and partial failures everywhere. Six failure modes show up over and over.
- Network partition
- One zone can't talk to another. Both sides are alive. Both think the other is dead. Whichever invariant you most cared about (consistency, availability) is now broken.
- Slow downstream
- A dependency hasn't failed; it's just slow. Threads pile up waiting for it. Your service runs out of capacity. The slowness propagates upward.
- Cascading failure
- One service degrades. Its callers retry harder. The retries amplify load on the next service down. Within minutes the entire dependency graph is on fire.
- Thundering herd
- A cache expires, 10,000 instances all hit the database simultaneously, the database falls over. Or 10,000 nodes restart at once and all hit a config service at boot.
- Split brain
- The cluster has a leader. The network partitions. Each side elects a new leader. Now you have two leaders, two states, and a divergence problem when the partition heals.
- Operator error
- A bad deploy. A wrong DNS update. A SQL
DELETEwith noWHERE. Studies put this at ~30% of major incidents. Build for it.
The five resilience patterns
Each pattern addresses a specific failure mode. Most production services use all five. The patterns are well-known, well-named, and worth deploying as defaults, not afterthoughts.
Every network call must have a finite, short timeout. Default infinite timeouts are how outages last hours instead of minutes. Set the timeout shorter than the caller's timeout — always.
Retry transient failures (5xx, network errors), not idempotent failures (4xx). Exponential backoff + jitter, max 3-5 attempts. Carry an idempotency key so the retry is safe.
Track recent failures; if the rate exceeds a threshold, "open" the circuit and fail fast for a cooldown period. Stops a sick downstream from poisoning the caller's thread pool.
Isolate critical paths so a noisy neighbour can't drain shared resources. Separate connection pools, separate thread pools, separate queues per dependency.
When a non-essential dependency dies, return a partial answer rather than no answer. Recommendations service down? Show the page without recommendations.
When overloaded, shed lowest-priority traffic before everything dies. Health-check probes get priority; the dashboard's analytics page can wait. Better partial service than total outage.
Timeouts — the rule that surprises people
Every call has a timeout, and the timeout for a downstream call is shorter than the timeout for the call above it. If the user's request has a 5-second budget, the downstream call must have less — say 4.5 seconds — so the layer above can do something useful with the remaining 500ms (return a graceful error, fall back to a cache, log the timeout).
The opposite — downstream timeouts longer than upstream — is a classic outage shape: the upstream gives up, the user retries, but the downstream is still busy serving the original abandoned request. Resources stack up; the system spirals. Always shorter going down.
Retry semantics — the part everyone gets wrong
- Don't retry 4xx
- 4xx means "you sent something invalid." Retrying changes nothing. Burn budget on something else.
- Retry 5xx and network errors
- These are transient: a node restart, a brief overload, a timeout. With backoff and jitter they usually succeed.
- Idempotency keys are mandatory
- If the call is non-idempotent (POST that creates a payment), the retry might double-charge. Send an idempotency key with every retryable request; the server dedupes.
- Exponential + jitter
- Without jitter, all retrying clients hit the recovering service at the same moment. With jitter (each backoff times a random 0-100% factor), the load is spread.
- Cap retries
- Three to five attempts max, total time bounded by the user's budget. Retrying forever is how a transient outage becomes a permanent one.
Kubernetes — the orchestration platform that became the default
Kubernetes is the default control plane for running distributed services. Its job: take a declarative spec ("I want 5 replicas of this image with these resources"), reconcile reality to that spec, and keep doing it forever. The resilience features it bakes in:
- Liveness probes
- "Is this container alive?" Failing liveness restarts the container. Best for catching deadlocks. Cheap; do it.
- Readiness probes
- "Is this container ready to serve traffic?" Failing readiness removes it from the load balancer without restarting. Use during slow start-up or graceful shutdown.
- PodDisruptionBudgets
- "Don't take more than 1 of these pods down at once for voluntary disruptions." Prevents node drains from killing a quorum-sized service.
- Resource requests and limits
- Reserve CPU/memory so noisy neighbours can't starve you. Limits are the bulkhead at the kernel level.
- HorizontalPodAutoscaler
- Scale replicas based on CPU, memory, or custom metrics. Combine with Cluster Autoscaler so the underlying nodes also scale.
- Rolling updates with maxSurge/maxUnavailable
- Replace pods gradually. Combined with readiness probes, deploys take seconds with zero dropped requests.
The hard cases
Practical defaults
- Timeouts on every network call. Default for HTTP: 5 seconds. For internal RPC: 1 second. Tune per-call once you measure.
- Retries: max 3 attempts, exponential 1s/2s/4s, jitter ±20%, only on 5xx and network errors, only on idempotent or idempotency-keyed operations.
- Circuit breakers in front of every external dependency. Trip on 50% failure rate over a 30-second window. Cooldown 30s.
- Liveness + readiness probes on every Kubernetes deployment, with resource requests + limits. PodDisruptionBudgets on anything serving real traffic.
- Graceful shutdown: drain → finish in-flight → exit cleanly within 30s.
- Game day exercises: regularly kill production pods, drop network connectivity, fill disks. The first time you find these problems should not be during an incident.
- Postmortems for every page. Blameless. Read the SRE workbook for the template.