05 / 05
Principle / 05

Availability patterns

Availability is the fraction of time a system answers correctly when asked. This page is the engineer's version: the nines and the downtime they buy you, the building blocks (redundancy, failover, health checks, replication, load balancing), the patterns that protect a system under load (timeouts, retries, circuit breakers, bulkheads), and the math for how availability composes across the things you depend on. Plus the counterintuitive bit: past a point, adding redundancy makes availability worse, not better.


What "available" actually means

Availability is usually defined as uptime divided by total time: the fraction of the year a system was up and serving correct responses. That single number hides a lot. A service that returns errors for one user out of a thousand, all year long, has a very different feel from one that is perfect for 364 days and then goes dark for one. Both can report the same percentage. So before the patterns, fix what you are actually measuring, because the measurement is what the whole discipline is built to defend.

The honest definition is success rate as seen by the people who depend on you. A request that times out, returns a 500, or returns stale garbage is a failed request, even if the server process is technically alive. This is why "the box is up" and "the service is available" are not the same claim. A node can be running, listening on its port, and passing a shallow health check while every real request behind it fails because a downstream database is wedged. Availability lives at the edge where requests meet responses, not at the level of whether a process has a heartbeat.

Two clocks matter when something does break. The first is recovery time: how long from the failure starting to service being restored. The second is data loss: how much of the most recent work disappeared in the failure. The industry names these RTO (recovery time objective) and RPO (recovery point objective). A daily backup gives you an RPO of up to a day; synchronous replication gives you an RPO near zero. Cold spare hardware gives you an RTO of hours; a hot standby already serving traffic gives you an RTO of seconds. Every pattern below moves one or both of those numbers, and the cost of the pattern tracks how close to zero you push them.

One more distinction worth holding: availability is not reliability and not durability. Reliability is whether a request that should succeed does succeed. Durability is whether committed data survives. A system can be highly available and unreliable (it answers fast but often wrongly), or durable and unavailable (your data is safe on disk but nobody can reach it). When someone says "we need higher availability," the first job is to ask which of these they actually mean, because the fixes diverge fast.

The math: what each nine costs

Availability gets reported as a percentage, and the percentages cluster around a count of nines. What matters is not the number but what it means in wall-clock downtime per year. Each extra nine cuts the allowed downtime by roughly ten times, and the engineering bill to get there climbs by about the same factor. The jump from 99% to 99.9% is mostly automation. The jump from 99.99% to 99.999% is multi-region architecture, consensus replication, and an on-call culture that treats minutes as expensive.

Work the table in your head once and it sticks. A year is about 525,600 minutes. Two nines (99%) leaves you 1% of that, which is roughly 3.65 days of downtime you are allowed to spend. Three nines is 8.8 hours. Four nines is about 52 minutes. Five nines is about 5 minutes, which is shorter than many teams' deploy windows, which tells you something blunt: at five nines you cannot take the system down to ship, so every change has to be safe while live. Drag the slider below to feel the curve, then read the fixed table under it.

Allowed downtime per year 52.6 min
Per month 4.3 min
Per week 1.0 min
Per day 8.6 s

99% (two nines): 3.6 days down per year. 99.99% (four nines): about 52 minutes. 99.999% (five nines): 5 minutes — shorter than most deploy windows.

NinesAvailabilityDown / yearDown / monthWhat it takes
Two99%3.65 days7.2 hoursOne async replica, manual recovery
Three99.9%8.76 hours43.8 minAsync replica plus automated failover
Four99.99%52.6 min4.4 minHot standby across AZs, automated cutover
Five99.999%5.26 min26 sMulti-region active-active, consensus replication
Six99.9999%31.5 s2.6 sHardware and vendor redundancy, near-perfect ops

Notice what those numbers do to your release process. At three nines you have most of a working day each year to burn, so a careful manual rollback is survivable. At four nines, 52 minutes is the entire annual budget, which means a single bad deploy can eat the whole year in one afternoon. At five nines you cannot afford a human in the loop for routine failures at all; recovery has to be automatic and faster than a person can open a dashboard. The table is not really about percentages. It is about which failures you are still allowed to handle by hand.

How availability composes: serial vs parallel

Real systems are not one box. They are chains and fans of dependencies, and the way those pieces connect decides the whole. There are two shapes, and they pull in opposite directions. Get this wrong and you will set an availability target the architecture cannot physically reach.

In series, every component must be up for the request to succeed. A request that touches the gateway, then the service, then the cache, then the database, fails if any one of them fails. The availabilities multiply, so the result is always lower than the weakest link, and adding more links only drags it down. This is why a long dependency chain is a liability even when each piece looks healthy on its own dashboard.

In parallel, the request succeeds if any one of the redundant copies is up. The failure probabilities multiply instead, so the result climbs fast. Two copies that each fail 1% of the time together fail only 0.01% of the time, turning two nines into four. This is the entire mathematical reason redundancy works: it converts a single point of failure into a product of small numbers.

series — every link must holdgateway99.9%service99.9%cache99.9%database99.9%= 0.999⁴ ≈ 99.6% (lower)parallel — any copy will doreplica · 99%replica · 99%= 1 − 0.01² = 99.99% (higher)series multiplies uptime down; parallel multiplies downtime away
Series dependencies lower the total below the weakest link. Parallel redundancy raises it well above any single copy.

The design lesson falls out directly. Shorten serial chains wherever you can, because each hop is a tax. And put redundancy in parallel at the layers where it counts: the load balancer, the application tier, the database. A system is only as available as its longest serial path, minus whatever the parallel sections claw back.

Compose availability across components

N independent components in series, where any one failure takes the system down, multiply their availabilities together. That's why "99.9% on each of 10 microservices" is much weaker than it sounds.

Composite availability 99.501%
Effective downtime / year 1.8 days

Ten 99.9% services in series land at roughly 99% composite — 88 hours of downtime per year. Microservices pay this tax until something happens in parallel.

The building blocks

Every availability story is assembled from the same small kit of parts. Learn the parts and the named patterns later read as combinations, not new ideas.

Redundancy is the foundation: keep more than one of anything that can fail. It comes in two flavours that get confused constantly. Active-active means every copy serves live traffic at once, so a failure just removes capacity and the survivors absorb it. Active-passive means one copy serves while the others wait, and a failure triggers a promotion of a standby. Active-active gives you the fastest recovery and uses your hardware fully, but every copy must handle writes or share state, which is harder to build. Active-passive is simpler and a natural fit for systems with a single writer, like a relational database, at the cost of a cutover gap when the active dies.

Health checks are how the system finds out a copy is bad. A shallow check confirms the process is listening; a deep check exercises the real dependency path, hitting the database the way a request would. Shallow checks are cheap but lie when the process is alive yet useless. Deep checks tell the truth but cost more and can themselves cause cascades if every node hammers a struggling database to prove it is healthy. The check feeds the load balancer and the failover logic, so its quality sets a ceiling on how good your recovery can be: a system cannot route around a failure it cannot detect.

Failover is the act of moving work off a failed copy onto a healthy one. When the health check marks a node dead, the load balancer stops sending it traffic, or a standby gets promoted to active. The speed of this step is your recovery time, and the danger is twofold: failing over too eagerly on a transient blip causes flapping, and failing over to a standby that is itself overloaded or stale turns one outage into two. Good failover is deliberate, with a short confirmation window and a destination you have verified can take the load.

Replication keeps the copies in sync so a failover lands on current data. Synchronous replication confirms the write on the replica before acknowledging the client, giving an RPO of zero at the cost of write latency and a write that blocks if the replica is unreachable. Asynchronous replication acknowledges first and ships the change after, which is fast and tolerant but loses whatever was in flight when the primary died. The choice is the same dial as everywhere in this area: how much recent data are you willing to lose to keep writes fast and available.

Load balancing spreads requests across the healthy copies and is the layer that makes active-active redundancy actually pay off. It is also where health checks turn into action, pulling bad nodes out of the pool within seconds so clients never see them. The balancing policy matters under stress: round-robin is simple but sends traffic to a node that is up-but-slow, while least-connections or latency-aware policies steer away from the struggling one. The mechanics of pools, algorithms, and stickiness live on the dedicated load balancing walkthrough.

load balancerhealth-checks the pool ↓node A✓ servingnode B✓ servingnode C✕ failed checkstandby D↑ promoteddata tier (replication keeps the standby current)primaryreplicaa failed health check drops node C; standby D is promoted to refill the pool
Redundancy + health checks + load balancing + replication, working together. The same kit underlies every named pattern below.

Eliminating single points of failure is the discipline of finding the one box, link, or service whose death takes everything down, and putting a second one beside it. The trap is that the SPOF is rarely the obvious server. It is the one load balancer in front of the redundant fleet, the single DNS provider, the shared NAT gateway, the one database the "stateless" services all quietly depend on, the deploy pipeline that, if broken, means you cannot ship a fix. Walk the request path and the control path and ask of each step: if this dies right now, what happens? Anything that answers "everything stops" is a SPOF, redundant fleet behind it or not.

Multi-AZ and multi-region are redundancy applied to the failure domains that take out whole groups of machines at once. Availability zones are isolated datacentres within a region, with separate power and network but low latency between them, so spreading copies across three AZs survives a datacentre fire or power event with no meaningful latency cost. Regions are geographically separate, so multi-region survives an entire region going dark, but cross-region writes add tens to hundreds of milliseconds and force you to confront consistency across distance. The rule that holds for most systems: three copies across three AZs in one region covers nearly everything; reach for multi-region only when the product needs to survive losing a region.

Graceful degradation is the admission that partial service beats no service. When a dependency fails, a well-built system sheds the feature that needed it and keeps the rest working: the product page renders without the recommendations panel, search falls back to a simpler index, the feed shows cached results instead of live ones. The design choice underneath is whether to fail open or fail closed. Fail open means continue when the check is unavailable, which suits a recommendations service whose absence is harmless. Fail closed means deny when the check is unavailable, which is the only safe answer for an authorization or payment check, where serving wrongly is worse than serving nothing. Pick per feature, deliberately, and make the default for anything touching money or access be fail closed.

Three failover shapes

Cold standby

Backup hardware exists, powered off. When the primary fails, ops boots it, restores state from backup, and points traffic over. RTO of hours to days. RPO depends on backup frequency. Cheap, and fine for "we can be down overnight" workloads.

Quarterly bookkeeping DB · regulatory archive
Warm standby

Backup is running and receiving async replication. On failure, traffic cuts over in minutes. RTO in minutes, RPO of seconds to minutes (whatever was in flight when the primary fell). The default for most internal databases.

Postgres async replica · Redis primary/replica
Hot standby (active-active)

All replicas serve live traffic. A failure is invisible to clients — health checks drop the bad node from the LB pool within seconds. RTO in seconds. RPO of zero if writes are synchronously replicated, otherwise small. What every high-availability system is reaching for.

Sharded Cassandra · DynamoDB · Spanner

Three replication shapes

  • Primary-replica. One node accepts writes, N replicas serve reads. Failover means promoting a replica (election + DNS or connection redirect). The most common shape — Postgres, MySQL, MongoDB all default here.
  • Multi-primary. Every node accepts writes; conflicts get resolved by timestamp, vector clock, or CRDT merge. Higher write availability, messier consistency story. DynamoDB Global Tables, Cassandra, CockroachDB sit here.
  • Leaderless. No designated primary. Writes go to N replicas, reads to R, with R + W > N for strong consistency. Riak, Cassandra in some modes, Dynamo.

The shape sets your availability ceiling. Primary-replica has a brief write-blocked window during elections. Multi-primary has no such window but produces conflicts you have to resolve. The dial is consistency story versus write availability.

The patterns that protect availability under load

Redundancy handles a node dying. The harder failures are the slow ones: a dependency that has not died but has gone sluggish, and is now dragging every caller down with it. Four patterns guard against that, and they work together. The point of all of them is the same: stop one struggling dependency from consuming the resources of everything that touches it.

Timeouts are the first and most neglected. Every call to anything that can fail must have a deadline, because a call with no timeout will wait forever on a wedged dependency, and while it waits it holds a thread, a connection, and a slice of memory. Enough hung calls and the caller runs out of those resources and falls over too, even though nothing was wrong with the caller. The timeout should be set from real latency data, a little above the P99, not a round number picked by feel. A timeout that is too tight turns slow-but-fine responses into failures; one that is too loose lets a single slow dependency exhaust the caller. The default of "no timeout" is the worst of all and is, depressingly, the library default in many clients.

Retries with backoff recover from the transient errors that are normal in any distributed system: a brief network blip, a node mid-restart, a momentary overload. A retry often succeeds where the first attempt failed. But naive retries are dangerous, because the moment a dependency is struggling, every caller retrying immediately triples its load at the exact moment it can least afford it, turning a brownout into a full outage. The fix is exponential backoff (wait longer between each attempt) plus jitter (randomise the wait so callers do not all retry in sync). And retries must be bounded and reserved for errors that are actually transient; retrying a 400 or a validation error just wastes the dependency's capacity.

Circuit breakers stop the retries when they are no longer helping. A breaker watches the failure rate of calls to a dependency. While failures stay low it is closed and traffic flows. When failures cross a threshold it opens and fails calls instantly without even trying, which gives the struggling dependency room to recover instead of being pounded by a wall of retries. After a cooldown it goes half-open, letting a trickle of calls through to test the water; if they succeed it closes again, if they fail it re-opens. This converts a slow, resource-draining failure into a fast one, and a fast failure is one you can degrade gracefully around. You can watch the closed-open-half-open dance in the circuit breaker simulator.

CLOSEDtraffic flowsOPENfail fastHALF-OPENtest tricklefailure rate > thresholdcooldown elapsedtest calls passtest fails → re-openthe breaker turns a slow, draining failure into a fast one you can degrade around
Circuit breaker states. Open trips on too many failures, half-open probes for recovery, closed is normal flow.

Bulkheads contain the blast radius so one struggling dependency cannot starve the others. The name comes from ships, where watertight compartments stop a single hull breach from sinking the whole vessel. In a service it means giving each dependency its own pool of threads or connections, so when one dependency goes slow and its pool fills up, the calls to every other dependency still have their own resources and keep working. Without bulkheads, a single slow downstream can swallow the entire shared thread pool and take down features that had nothing to do with it. With them, the failure stays in its compartment, and the feature that needed that dependency degrades while everything else holds.

These four are not alternatives; they layer. A timeout bounds how long any single call can hurt you. Retries with backoff recover the transient failures inside that bound. A circuit breaker stops the retries once the dependency is clearly down. Bulkheads keep all of it confined so the failure of one dependency never spends the resources of another. Together they turn the unavoidable failures of a distributed system from cascading outages into local, survivable, gracefully-degraded blips.

Why more redundancy ≠ more availability

The intuition is: add more replicas, get more nines. True up to a point — past it, you lose nines.

Three reasons:

  1. Coordination overhead. A 5-node Raft cluster needs 3 to agree on every write. A 7-node cluster needs 4. The latency tax grows logarithmically with node count, but the chance of any one node being slow grows linearly. Past about 7 nodes you're paying more in latency than you gain in availability.
  2. Correlated failures. A second replica in the same rack doesn't help when the rack power goes out. A second replica in the same region doesn't help when AWS us-east-1 has a bad day. Real availability comes from diverse redundancy (across racks, AZs, regions, providers), not more redundancy.
  3. Operational surface. Each node is one more thing to misconfigure, version-drift, or rot. The chance operators introduce a fault somewhere in N nodes scales roughly with N. Three well-maintained replicas beat seven neglected ones most days.

Rule of thumb: 3 replicas in 3 AZs in 1 region covers most workloads. Multi-region only when the product needs cross-region availability (payments, planet-scale services). Cross-region replication adds 10–100 ms to every write.

SLO, SLA, error budget

An availability target is a contract between engineering and product:

  • SLO — Service Level Objective. The internal target. "P99 latency < 200 ms and availability > 99.95% over a rolling 30-day window." Set by what users actually need.
  • SLA — Service Level Agreement. The external commitment, with consequences (refunds, penalties) when violated. Usually one or two nines weaker than the SLO so there's room to miss without breaching the contract.
  • Error budget. The complement of the SLO. At 99.95% you have 0.05% downtime — about 21 minutes per month. While you're under budget, ship features. Once spent, freeze the deploy pipeline until you earn it back.

Google's SRE book is the reference here. The key idea: 100% availability isn't a goal, it's a sign you're not shipping fast enough. The error budget is what lets engineering move and lets product trust the movement.

Concrete examples by tier

99% — two nines

3.6 days down per year. Fine for internal admin tools, batch reports, dev/staging. One async replica is enough.

99.9% — three nines

8.8 hours per year. Fine for most B2B SaaS, content sites, blogs. Async replica plus automated failover.

99.99% — four nines

52 minutes per year. The bar for serious consumer services. Hot standby in another AZ, automated cutover, blameless postmortems.

99.999% — five nines

5 minutes per year. Telecom, payments, ad bidding, infra services. Multi-region active-active, consensus replication, chaos engineering by default.

99.9999% — six nines

31 seconds per year. Stock exchanges, life-critical systems. Hardware redundancy, multiple vendors, and an operational discipline more expensive than most companies' entire engineering budget.

Related on Semicolony

Found this useful?