03 / 05

Principle / 03

Availability vs consistency

Network partitions happen. Cables get cut, switches reboot, a region briefly loses its uplink. When one happens, a replicated system has to pick: refuse writes on the minority side to keep reads coherent, or accept writes on both sides and reconcile later. CAP says you can't have both. The real question isn't which is right — it's which one your product can survive losing.

The honest reading of CAP

Eric Brewer's 2000 keynote said you could pick two of three: consistency, availability, partition tolerance. The popular reading mangled it into a menu where you walk up and order any two, as if a team sat in a room and decided to skip partition tolerance the way you'd skip dessert. That's not what the theorem says, and the misreading has cost a lot of engineers a lot of confused design reviews. The careful version is much narrower, and once you have it the rest of the topic falls into place.

CAP, restated. A replicated stateful system that hits a network partition has to choose: reject requests on one side (giving up availability), or accept writes that may conflict across sides (giving up consistency). Partition tolerance isn't optional — partitions happen — so the only real choice is CP or AP.

The word that gets dropped is "during a partition." CAP is not a claim about your system in steady state. When the network is healthy, a well-built distributed database hands you strong consistency and stays available, and it does this every second of every day. The theorem only bites at the moment two nodes can no longer talk to each other and a client shows up wanting to read or write. In that moment, and only in that moment, the system faces a fork it cannot avoid: answer the client with possibly-stale or possibly-conflicting data, or refuse to answer until it can be sure the answer is right. That is the entire content of CAP. Everything else people attach to it is folklore.

It helps to be precise about the three words, because each has a technical meaning that's narrower than the English word suggests. Consistency here means linearizability: every read sees the most recent write, as if there were a single copy of the data and all operations happened in one order. This is a much stronger promise than the "C" in ACID, which is just "the database stays valid against its constraints." Availability means every request that reaches a non-failed node gets a non-error response, eventually — not "the system has good uptime," but the strict promise that a live node never replies with "I can't help you right now." Partition tolerance means the system keeps functioning when messages between nodes are dropped or delayed. Read with those definitions, the theorem is almost obvious: if the two sides of a partition can't coordinate, a node that insists on always answering must sometimes answer without knowing the truth, and a node that insists on always knowing the truth must sometimes decline to answer.

"CA" isn't a real category for distributed systems. A single non-replicated database is "CA" in a trivial way: there is one copy, so there is nothing to partition from, and the question never comes up. The moment you add a second node and replicate, you're back in CAP territory, because now there are two copies that can disagree and a network that can stop them from reconciling. People who say "we run a CA system" almost always mean "we run on one machine and haven't met the failure yet," or "we have replication but we've never thought about what happens when it breaks." Neither is a design; both are a bill that arrives later. If your data lives on more than one node and you care what happens when those nodes lose contact, you are choosing CP or AP whether or not you said so out loud.

Healthy, you get both. The fork between CP and AP opens only during a partition and closes when it heals.

Try it: partition the cluster

Two regions, replicated. Toggle the partition and switch between a CP and an AP design. Watch what each side does when the link drops. The point isn't the toy; it's the muscle memory — when you read "the network partitioned," you should picture exactly this and immediately ask which side refuses and which side keeps going. For a fuller version with read paths, quorum sizing, and conflict counts, the standalone CAP simulator lets you drive a larger cluster.

us-east-1 · majority

serving reads + writes

node A · primary

↔

replication

eu-west-1 · minority

serving reads + writes

node B · replica

State: normal

Normal operation. Writes on either node replicate to the other. Reads on either side see the same data within replication lag.

CP vs AP, concretely

The labels stay abstract until you watch real systems make the call. The pattern that emerges from looking at a handful of them is that the right answer always comes from the cost of being wrong, not from the data store's marketing. Here are four, two of each, chosen because the reasoning is visible.

CP — Spanner. Google's planet-scale relational database. TrueTime, a service backed by synchronised atomic clocks and GPS receivers, gives every transaction a timestamp with a known error bound, and Paxos groups replicate each shard. Together they deliver external consistency: a global order that matches real time. During a partition the minority replicas of a shard can't form a quorum, so they stop accepting writes for that shard; clients error out and retry against the majority side, which still has a quorum and keeps going. The trade-off the product can afford is plain. Spanner backs ads billing, payments, and inventory where a transaction crossing regions has to be correct, and the brief write stall on the minority side during a partition is far cheaper than a ledger that disagrees with itself. Google paid for atomic clocks specifically so they could keep consistency without giving up too much latency. That is a CP system being honest about its choice and spending money to soften the cost.

CP — a payments ledger on Postgres. Take a service that moves money, running Postgres with synchronous replication to a standby in another availability zone. Synchronous means a write isn't acknowledged until the standby has it. If the standby becomes unreachable, the primary has a choice baked into its configuration: keep waiting (writes block, the system is CP and currently unavailable on the write path) or drop to async and proceed alone (now it's AP and risks losing the most recent commits if the primary then dies). A team that moves money configures the first. A double charge or a lost deposit is a refund, a support ticket, a regulator's attention, and a loss of trust. A short window where new charges fail and the client retries is annoying and survivable. The cost asymmetry decides it: block the write, stay correct.

AP — a chat product on Cassandra. A messaging service storing a firehose of messages across a Cassandra cluster, typically written and read at quorum or even lower. During a partition both sides keep accepting writes; the ring is built to take them. When the partition heals, the replicas exchange what they missed, and conflicts resolve by last-write-wins on timestamps. The visible effect during an incident is small: a few messages may land briefly out of order, or a delivery receipt may lag. Almost no one notices, and no message is refused. The alternative — telling users they can't send because two data centers can't reach each other — would be a product crisis, the kind that trends on social media. Slight reordering is invisible; an outage is a headline. AP is the obvious call.

AP — a product catalogue or DNS. Reads of a catalogue tolerate a few seconds of staleness without anyone caring. A price or a description that's three seconds behind is fine; a 503 on the product page during a partition loses sales immediately. So the catalogue is served from caches and read replicas that stay up and stale rather than going dark. DNS is the same shape taken to its logical end: records have a TTL, resolvers cache aggressively, and the whole system is eventually consistent by design. When you update a record, the world sees it gradually as caches expire. Nobody would trade DNS's availability for the guarantee that every resolver has the latest record the instant you publish it. The system is AP on purpose, and the internet runs on it.

Notice what the four have in common. None of them chose CP or AP because of a database brand or a blog post. Each chose by asking what a wrong answer costs against what an unavailable answer costs, and picking the cheaper failure. A ledger's wrong answer is a financial event; its unavailable answer is a retry. A chat app's wrong answer is a reordered message; its unavailable answer is a furious user. Same theorem, opposite conclusions, because the cost structure is opposite.

Push the question harder

"We need consistency" answers a question nobody asked. It's the system-design equivalent of "we need it to be fast" — true of almost everything, useless as a constraint. The job is to turn that reflex into a real decision, and the way you do that is by interrogating it. Four follow-ups force the abstraction into something you can actually build against.

Across what surface? Consistency is not a property of a company; it's a property of an operation. Most systems want strong consistency for a few operations (commit a payment, decrement inventory, change a password) and are happy with eventual for nearly everything else (load a profile picture, render a feed, count likes). Asking "do we need consistency" at the company level produces nonsense. Ask it per operation and the answer is usually obvious.
For how long? "Eventual" with a hard upper bound of 100 ms is a completely different product than unbounded eventual. The first means "occasionally a read is a tenth of a second behind"; the second means "a read might be behind by minutes during an incident, and you have no guarantee when it catches up." Bounded staleness — eventual, but never more than X behind — is usually what people actually want when they say "consistent," and it's far cheaper to provide than full linearizability.
What does a violation cost? Put a number on it. Two users seeing different leaderboard positions for half a second: zero dollars, zero tickets, nobody notices. Two users both being sold the last seat on a flight: a refund, an apology, a rebooking, and possibly a regulator. The cost of a consistency violation is the single number that should drive the whole decision, and most teams never write it down.
What does the failure look like? CP fails loudly: requests return errors, dashboards light up, on-call gets paged, and you know immediately. AP fails quietly: writes succeed on both sides and the data diverges, and you may not find out until a customer reports something strange days later. A loud failure you can detect and recover from. A quiet one rots. Pick the failure mode your team and your tooling can actually catch.

The real answer in a design interview is rarely "we'd use CP" or "we'd use AP." It's "CP here, because losing the write costs money. AP there, because losing the read costs a user. Here's the boundary between them, and here's how the two halves talk to each other." A candidate who draws that boundary is showing the thing the question is really testing: that they think in costs and surfaces, not in slogans.

Run this per feature, not per company. The same product lands on both sides, and the boundary between them is the design.

PACELC — the second axis

CAP has a blind spot, and Daniel Abadi named it in 2010. The theorem only describes what happens during a partition, and partitions are rare — a healthy production system spends almost all of its life with every node able to reach every other node. So if CAP were the whole story, the trade-off would almost never matter, and yet engineers wrestle with consistency every day. The reason is that there is a second trade-off that's live even when the network is perfect, and PACELC is the extension that captures it.

Read PACELC as a sentence: if Partitioned, then A or C; Else, L or C. The first half is just CAP — under a partition you pick availability or consistency. The second half is the new part. When the network is healthy and there's no partition to force your hand, you still face a choice on every read and write: latency or consistency. To give a strongly consistent answer, a node often has to talk to other nodes — confirm it has the latest write, reach a quorum, wait for a leader. That round trip costs milliseconds. If you're willing to relax consistency and answer from local state, you can reply immediately. So even with a flawless network, strong consistency has a latency tax, and PACELC makes you decide whether to pay it.

Two letters per system: what it does under partition, and what it does the rest of the time. The second letter is the one you live with daily.

Walk the rows. Spanner is PC/EL — under a partition it picks consistency; when healthy it picks latency, which is exactly why Google built TrueTime, to keep that healthy-state latency low without giving up consistency. DynamoDB in its default mode is PA/EL — available under partition, and fast (so, eventually consistent) the rest of the time, though you can ask it for a strongly consistent read when you need one. CockroachDB is PC/EC — strong in both modes, and it accepts the higher latency that comes with always coordinating. Cassandra is PA/EL — it picks the cheap, available, low-latency option in both modes, and leans on per-query tuning when you need more.

This second axis is where most of your real decisions live, because partitions are rare and healthy operation is constant. "Can we serve this read from a local replica and accept that it might be a few hundred milliseconds behind, to cut our P99 in half?" is a PACELC question — it's about the Else branch, latency versus consistency, with no partition anywhere in sight. Teams that only know CAP miss this entirely and assume consistency is free until the network breaks. It isn't; it costs latency every single day. The full CAP & PACELC deep dive has the Jepsen test results behind each of these classifications, since vendor claims and measured behaviour don't always agree.

How this maps to consistency models

CAP and PACELC tell you that you're trading consistency for something. The consistency models tell you what exactly you're getting when you dial it down, because "consistent" and "eventual" are the two ends of a spectrum with useful stops in between. You don't have to memorise the whole taxonomy to make good decisions, but you should know the shape of it, because picking AP or relaxing the Else branch means choosing a point on this line, not falling off the edge of consistency entirely.

At the strong end sits linearizability, the "C" in CAP. Every operation appears to take effect at a single instant between its start and its finish, and once a write completes, every later read everywhere sees it. The system behaves as if there were one copy of the data and one global clock. This is the strongest practical guarantee and the most expensive, because it forces coordination on the critical path. Just below it, sequential and causal consistency relax the global-clock requirement while keeping order where it matters — causal consistency, for instance, guarantees that if you saw a comment before replying to it, everyone sees the comment before your reply, even if unrelated operations show up in different orders on different nodes. Causal is often the sweet spot: cheap enough to stay fast, strong enough that the visible weirdness of eventual consistency disappears.

At the weak end sits eventual consistency: if writes stop, all replicas converge to the same value, but until then a read may return any recent value, and there's no promise about how stale or how ordered. This is what most AP systems give you by default, and it's perfectly fine for a like count or a view counter. The trap is that "eventual" with no bound can mean "minutes behind during an incident," which is why bounded staleness — eventual, but never more than X seconds or X versions behind — is the model people usually want when they say they need consistency but can't pay for linearizability. It puts a ceiling on the surprise.

The practical link back to CAP is this: when you choose AP during a partition, you are choosing to serve from somewhere on the weaker part of this spectrum until the partition heals and the replicas reconcile. When you choose CP, you are choosing to refuse rather than drop below linearizability. And on the Else branch, relaxing consistency for latency is exactly the act of sliding from linearizable toward causal or eventual on the everyday path. The consistency models deep dive walks each band with the anomalies it does and doesn't permit, which is the part that actually matters when you're debugging a "this shouldn't be possible" report.

Common misreadings

A handful of confusions show up again and again, in interviews and in production postmortems alike. Each one comes from treating CAP as a fixed label rather than a behaviour under a specific failure.

"We're on Postgres, so we're CP." A single Postgres instance isn't a distributed system at all, so CAP doesn't classify it. CP and AP only become meaningful once you add replication. And the moment you do, the configuration decides the answer: synchronous replication leans CP (writes wait for the replica), while asynchronous replication is effectively AP on the read path — a read served by the replica can lag the primary by an unbounded amount during trouble. "We use Postgres" tells you nothing about where you sit; "we use Postgres with async replicas and read-your-writes routing" tells you everything.
"NoSQL means eventual." Not by itself, and the belief leads people to either over-trust or under-trust their store. DynamoDB does strongly consistent reads when you ask for them, at a cost. MongoDB exposes a tunable read concern and write concern, from fast-and-loose to majority-acknowledged. Cassandra picks consistency per query with levels like ONE, QUORUM, and ALL. The storage engine's shape and the consistency you get are independent dials, and on most modern systems you set the second one per operation. Treating "NoSQL" as a synonym for "eventual" hides the dial that actually matters.
"CAP says pick two." The original sin. CAP says pick C or A during a partition, and partition tolerance is not on the menu because partitions are a fact of the physical network, not a feature you opt into. The rest of the time — which is nearly all the time — you get whatever the system is tuned for, often both strong consistency and high availability at once. Most production systems run strong-by-default and degrade gracefully under partition, which feels like having all three right up until the network breaks and the choice you encoded in your config quietly takes effect.
"Add more replicas to get consistency." Replicas buy durability and read throughput; they do not buy consistency, and past a point they make it harder, because every additional replica is one more thing that can fall behind or be on the wrong side of a partition. Consistency comes from the coordination protocol — quorums, leaders, consensus — not from the number of copies. More replicas with a weak protocol just means more places to read stale data.

Related on Semicolony

CAP & PACELC — the full deep dive with Jepsen receipts.
Consistency patterns → — the five strength bands every database picks from.
Availability patterns → — failover shapes and the math behind nines.
CAP simulator — partition the cluster and watch the modes break.
Paper: Building on Quicksand — Pat Helland on what eventual consistency means at amazon scale.
Replication deep dive — sync vs async and what each costs.

Principle 04 / 05

Consistency patterns →

Weak, eventual, causal, strong, linearisable. Five strength bands and where each real system sits on the spectrum.

Next principle

← back to principles → system design roadmap

Found this useful?