Interactive

CAP Theorem Simulator

The trilemma, tangibly. Choose any two; toggle the partition; watch the third break.

The CAP theorem states that a distributed data store can guarantee at most two of three properties (Consistency, Availability, and Partition tolerance) when faced with a network partition. Eric Brewer first stated it in a 2000 keynote; Seth Gilbert and Nancy Lynch proved it formally in 2002. The intuition is unavoidable: when two replicas can't talk to each other, every write forces a choice between accepting it (and going inconsistent) or refusing it (and going unavailable). Below, the simulator lets you push on that choice for a two-replica service, and the long-form reference covers PACELC, linearizability, fencing tokens, and where production systems (etcd, Cassandra, Spanner, DynamoDB, MongoDB, CockroachDB) actually land on each axis.

Updated · 30 min read · 5,000 words.

Throughput
342/s
p99 latency
118 ms
Violations
0

Trilemma
CAPCONSISTENCYAVAILABILITYPARTITION TOLERANCECP · HBase, etcdAP · Cassandra, DynamoCA · single-node SQLyour system
Observation
Under a network partition, a system cannot be both consistent and available.
Pick two
Event log 6 events
t=0.02 write(k=x, v=1) → primary ✓
t=0.03 replicate → r2 ✓ r3 ✓
t=0.11 partition: r3 ⇍ r1
t=0.14 read(k=x) @ r3 → timeout
t=0.21 write(k=y) blocked (quorum)
t=0.44 heal → re-sync 2 keys

What you're looking at

The three overlapping circles are the trilemma: Consistency, Availability, and Partition tolerance. The "Pick two" buttons let you select properties, and the Venn fill brightens for whatever is on. Try to turn on all three and one quietly switches off — the control refuses to hold more than two checked at once, which is the whole theorem expressed as a UI constraint. The event log below replays a concrete timeline: a write, its replication, a partition that cuts replica three off, then reads and writes that either time out or block.

Start by toggling C and A together, then reach for P. Watch which letter the control drops to make room. The labels on the diagram show where real systems sit — etcd and HBase choose CP, Cassandra and Dynamo choose AP, and plain single-node SQL gets CA only because it never faces a partition. What should surprise you is that "pick two" is misleading in practice: partitions are not optional, so every distributed store is really choosing between C and A while keeping P.


What is the CAP theorem?

When the network breaks, two replicas have to disagree.

The CAP theorem states that a distributed data store can guarantee at most two of three properties (Consistency, Availability, and Partition tolerance) when faced with a network partition. Eric Brewer stated it in a 2000 keynote; Seth Gilbert and Nancy Lynch proved it formally in 2002. The intuition is unavoidable: when two replicas can't reach each other, every write forces a choice between accepting it (and going inconsistent) or refusing it (and going unavailable). Below, the same idea worked through with two replicas and a single transatlantic write.

Imagine your service runs on two replicas: one in New York, one in London. The replicas synchronise every write so that customers in either city see the same data. Most of the time the link between them works fine; a write in New York shows up in London a few milliseconds later, and the two stay in step. The architecture survives a single-machine crash because either replica can carry the load alone.

One day the transatlantic cable hiccups. For thirty seconds, the New York replica cannot talk to London and London cannot talk to New York. Each replica is healthy on its own; the network between them is the failure. While the cable is down, a customer in New York types a new entry. Your service has a choice (and it has to make it now, not later) between two answers, both bad.

The first option: accept the write. Take the new value, store it locally, return success. The customer is happy; the New York replica has the new data. But London does not. If a customer in London reads the same key right now, they see the old value. Your two replicas have diverged. They are available to their callers but inconsistent with each other.

The second option: refuse the write. Tell the customer "service unavailable; please try again later." Both replicas stay in sync because neither has accepted the new value. Your two replicas are consistent with each other but no longer available to the caller in New York.

There is no third option. You cannot keep the replicas in sync without talking to both of them, and the network does not let you talk to both of them. The argument is not philosophical; it is a theorem. Eric Brewer first stated it in a 2000 keynote and Seth Gilbert and Nancy Lynch proved it formally in 2002. Phrased as a slogan: of the three properties Consistency, Availability, and tolerance to network Partitions, a distributed service can guarantee at most two. Network partitions happen, so what you actually choose is between CP (consistent under partition, may go unavailable) and AP (available under partition, may serve inconsistent data).

The Venn diagram in the simulator shows the trilemma. The simulator's event log walks through the timeline: write succeeds, partition occurs, the system has to choose. Modern stores name their choice up front: etcd is CP, Cassandra is AP, MongoDB lets you pick per write. The next parts unpack the proof, the precise definitions of each letter, and the messier reality of running these systems in production.

PARTITION OCCURS NOW · THE FORCED CHOICEt →cable cutsNY ✓LDN ✓APNY: hat ✓LDN: stale→ divergedCPNY: refusedLDN: refused→ unavailableno third option exists; you pick which guarantee to give up.

Origins of the CAP theorem (Brewer's 2000 PODC keynote)

A keynote slide that became canon.

Eric Brewer first stated CAP in a July 2000 keynote at the ACM Symposium on Principles of Distributed Computing (PODC) in Portland, Oregon. The slide was almost a throwaway: a Venn diagram of three properties (Consistency, Availability, Partition tolerance) with the claim that a distributed service can guarantee at most two. Brewer was the lead architect of Inktomi at the time and had spent the late 1990s wrestling with replication trade-offs in real production search clusters. The slide encoded a working engineer's intuition; he never imagined it would still be cited a quarter century later.

The intuition needed a proof, and two years later Seth Gilbert and Nancy Lynch at MIT supplied one. Their 2002 SIGACT News paper, Brewer's Conjecture and the Feasibility of Consistent, Available, Partition-Tolerant Web Services, defined the three properties precisely (Consistency as linearizability, Availability as “every request to a non-failing node receives a response,” Partition tolerance as “the network is allowed to drop or delay messages arbitrarily”) and showed by an asynchronous-network adversary argument that no algorithm can satisfy all three. The proof fits on roughly two pages of careful indistinguishability reasoning. It is also one of the cleanest negative results in distributed systems, alongside Lynch and Fischer's 1985 FLP impossibility (which CAP echoes but does not subsume).

The framing (pick two of three) was sticky and slightly misleading. In the partition-free case, modern systems happily provide both linearizability and availability; the choice is forced only when packets stop flowing. Brewer himself returned to the topic in 2012 with a piece in IEEE Computer, CAP Twelve Years Later: How the “Rules” Have Changed, walking back the “two of three” reading. He argued that the real design question is what a service does during a partition: serve stale or unavailable data, and how to reconcile when the partition heals. That nuance has become the canonical reading among system builders, even as “CAP says pick two” persists in interview question banks.

The historical context matters. In 2000 the dominant programming model was a single-master relational database with a hot standby; by 2007, with Werner Vogels' Dynamo paper at SOSP and Cassandra's open-sourcing, eventual consistency had a name and an audience. CAP was the vocabulary that let architects justify, in a sentence, why their shopping cart did not insist on a global lock.

Brewer's slide also predates the modern cloud. AWS launched S3 in March 2006 and EC2 in August 2006; multi-AZ replication, regional failover, and the entire vocabulary of “availability zones” arrived after the keynote. By the time those primitives were generally available, CAP had already been internalised as a design grammar: one of the few pieces of distributed-systems theory that escaped the academy and reached every architecture review whiteboard.


PACELC: extending CAP to the no-partition case

CAP plus the no-partition case.

Daniel Abadi, then at Yale, published Consistency Tradeoffs in Modern Distributed Database System Design in IEEE Computer in February 2010, introducing PACELC. The mnemonic reads: if Partition, choose Availability or Consistency; Else, choose Latency or Consistency. CAP only constrains the partition case; PACELC names the more frequent reality: that even when the network is healthy, replicating a write to a quorum of nodes in another availability zone costs round trips, and that latency is itself a form of unavailability if your service-level objective is tight.

The taxonomy clusters real systems into four cells. PA/EL (available under partition, latency-favoring otherwise) covers Cassandra, Riak, DynamoDB with eventually consistent reads, Voldemort. PC/EC (consistency at all costs) covers HBase, BigTable, VoltDB, FoundationDB. PA/EC (available during partition, consistent when healthy) describes default MongoDB before WiredTiger and majority writes, plus several CRDT-based systems. The fourth cell, PC/EL, is rare in practice because giving up consistency under partition while paying latency tax in the steady state is the worst of both worlds.

Spanner (Google, OSDI 2012) is the famous boundary case. Spanner advertises external consistency (strict serializability) over a globally distributed database, which on a literal CAP reading is the impossible third corner. The trick is TrueTime: GPS and atomic-clock receivers in every datacenter bound clock uncertainty to roughly 7 ms, and the protocol simply waits out that uncertainty before committing. CockroachDB's HLC (hybrid logical clocks), YugabyteDB's similar approach, and FoundationDB's centralized timestamp oracle all chase the same goal with different hardware budgets. None of these systems repeal CAP; they spend latency to recover linearizability when packets are flowing, and they degrade to unavailable (the C choice) when packets are not.

Abadi's framing has aged better than CAP itself. Real architects almost never face a sustained partition between, say, two AZs in us-east-1; they face microsecond jitter, momentary degradations, and round-trip latency between regions. PACELC captures the bill that gets paid every day, not just on incident days. The 2018 follow-up paper by Abadi, “Consistency Tradeoffs in Modern Distributed Database System Design (revised)”, broadened the catalogue to include CRDT-based stores and bounded-staleness systems like Azure Cosmos DB, whose five named consistency levels (strong, bounded staleness, session, consistent prefix, eventual) can be read as a graduated PACELC dial exposed directly in the SDK.

PACELC · two-phase decision tree (Abadi, 2010)partition?P · yesE · noAvailability or Consistency?Latency or Consistency?PACassandra, RiakPCetcd, HBaseELDynamoDB (eventual)ECSpanner

Linearizability vs serializability vs strict serializability

Three "consistencies" that mean different things.

The C in CAP is linearizability, defined by Maurice Herlihy and Jeannette Wing in their 1990 TOPLAS paper Linearizability: A Correctness Condition for Concurrent Objects. It is a single-object property: every operation appears to take effect atomically at some instant between its invocation and response, and the order respects real time. Once a write returns, every subsequent read on any node must observe that write or a later one. The cost is a coordination round trip on writes; the win is that clients never see a write disappear or reorder.

Serializability is a different animal: the strongest isolation level in transactional databases, defined long before CAP by Bernstein, Hadzilacos and Goodman's 1987 textbook. It says any concurrent execution of transactions is equivalent to some serial order. It is a multi-object property and says nothing about real time. Two transactions running concurrently can be serialized in either order as long as no anomaly is visible. Most relational databases default to weaker isolation (Postgres' default is read committed; MySQL InnoDB's is repeatable read); SERIALIZABLE is opt-in.

Strict serializability (sometimes strong serializability) combines the two: serializable, plus the serial order respects real time. This is what Spanner, FoundationDB, and CockroachDB advertise; CockroachDB confusingly calls it SERIALIZABLE, which is true but understates what they deliver. Aphyr's Jepsen consistency model lattice, hosted at jepsen.io/consistency, is the canonical reference for sorting through the dozen-plus weaker variants: sequential consistency, causal+, snapshot isolation, read-your-writes, monotonic reads.

Why the vocabulary matters: vendors choose words deliberately. “Strongly consistent” in the AWS DynamoDB API means linearizable per-key; “eventually consistent” means convergence with no upper bound on staleness. “Causal consistency” in MongoDB sessions means reads observe the writer's own preceding writes plus their causal predecessors, but other clients may not. “Read-after-write” in S3 (since December 2020) is a per-key linearizable promise for new object creation, not for overwrites. Each phrase encodes a different cost profile and a different bug surface; conflating them is the source of half the postmortems Aphyr writes about.

Linearizable ≠ serializable

Linearizability is a real-time guarantee on a single object. Serializability is a multi-object guarantee with no real-time component. Strict serializability is the conjunction. A database can be serializable but not linearizable (transactions reorder freely), or linearizable per-key but not serializable across keys. CAP's "C" is the first one only — the multi-object cost lives elsewhere.


CAP classification of major databases (etcd, Cassandra, Spanner, MongoDB)

Named systems, named choices.

The CP camp prefers refusing service to serving stale data. etcd, the keystore at the heart of every Kubernetes cluster since 2014, runs Raft (Ongaro & Ousterhout, USENIX ATC 2014) over an odd-numbered ensemble (usually three or five nodes) and refuses writes when a leader cannot reach a quorum. A partition that isolates the minority side blocks all writes there until it heals. The benefit: no Kubernetes object will ever be observed in two contradictory states, which is exactly what you want when the object decides whether a Pod runs. Consul, ZooKeeper, HashiCorp Vault, and CockroachDB's replication layer make the same choice.

HBase and the older Bigtable inherit the CP posture from their single-region-master design: each tablet has exactly one server responsible for it, with HDFS replicating the underlying files. If the region server's lease in ZooKeeper lapses during a network blip, writes to that key range stall until the master reassigns the region. That stall is typically tens of seconds. The Apache HBase reference guide lists this as the “mean time to recovery” trade-off; users accept it for the benefit of single-row strong consistency.

The AP camp prefers serving stale data to refusing service. Cassandra, derived from the ideas in Werner Vogels' 2007 Dynamo paper at SOSP, lets every replica accept writes; conflicts resolve via last-write-wins on a per-cell timestamp, augmented by lightweight transactions (Paxos) when the workload needs it. A partition between two data centers means each side keeps writing; on heal, anti-entropy via Merkle trees reconciles the divergence. Riak, Voldemort, DynamoDB (in eventual mode), CouchDB, and Amazon's own internal SimpleDB ancestor are in this lineage.

MongoDB sits awkwardly: since the PV1 election protocol shipped in version 3.2 (December 2015), a replica set elects a primary via Raft-like majority; the primary handles writes, secondaries replicate. Writes with w:majority are linearizable per-document (CP), but the default w:1 can lose data on primary failover (PA/EC behaviour). The choice is per-write, not per-cluster, which is unusually flexible — and unusually easy to get wrong.

PARTITION TIMELINE · same network event, two posturesCPhealthy · writes ✓partition · minority side REJECTS writesheal · catch upAPhealthy · writes ✓partition · BOTH sides ACCEPT writes (diverge)heal · merge / reconcilet₁ · partitiont₂ · healCP pays availability for consistency; AP pays consistency for availability. The bill is itemized at heal time.
Posture Systems During partition
CPetcd, ZooKeeper, Consul, HBase, BigTable, Spanner, FoundationDB, CockroachDBminority side stops accepting writes
APCassandra, Riak, DynamoDB (eventual), CouchDB, Voldemortboth sides accept; reconcile on heal
CA (impossible)single-node SQL — only as available as the one boxno partition tolerance; one switch fails, service down
Per-write tunableMongoDB w:1 vs w:majority, Cassandra LOCAL_QUORUMchoice deferred to the call site

Failure detection in distributed systems (timeouts, heartbeats, the φ accrual detector)

How a system decides someone is gone.

CAP's adversary is the partition, but real systems do not see partitions; they see missed heartbeats. Every CP system runs a failure detector: a component that, given a stream of pings or gossip messages, declares peers up or down. The classical Chandra-Toueg result (JACM 1996) showed that the weakest failure detector sufficient for consensus is ◊S (eventually strong) and that with crashed-only failures and an asynchronous network, no perfect failure detector exists. Production systems compromise: they tune timeouts and accept the occasional false suspicion.

The most influential practical design is the φ accrual failure detector by Naohiro Hayashibara and colleagues (SRDS 2004). Instead of a binary up/down decision, it outputs a continuous suspicion value: the negative log of the probability that the next heartbeat is still in flight given the observed inter-arrival distribution. Cassandra's gossiper uses φ accrual with a default threshold of 8.0; Akka's cluster module exposes it as akka.cluster.failure-detector.threshold at 8.0 by default. Crossing the threshold marks a peer as suspected; under steady jitter the detector adapts, and a network burp does not produce a cluster-wide stampede.

Heartbeat tuning is its own subfield. Set the timeout too low and a one-second GC pause triggers spurious failover. Set it too high and a dead node keeps holding the leader lease, freezing writes. etcd's default heartbeat-interval is 100 ms and election-timeout 1000 ms; ZooKeeper's tickTime defaults to 2000 ms with syncLimit 5, giving 10-second tolerance. Kubernetes' kubelet reports node status every 10 seconds and the controller marks a node NotReady after 40 seconds; eviction follows after another 5 minutes (--pod-eviction-timeout).

The pathological case is the asymmetric partition: A can hear B, B can hear C, C can hear A, but no full conversation. Or the transient partition: a 30-second blip during which the minority side has not yet stepped down. Aphyr's Jepsen tests have repeatedly found systems that misbehave in these regimes: etcd-3.4.3 (CVE-2020-15113 family), Redis Sentinel circa 2013, MongoDB pre-PV1, RabbitMQ pause-minority, Cassandra LWT under partition. The lesson the field has internalised is that a partition is rarely a clean cut; it is a probabilistic graph that costs you whichever guarantee you wished you had hardened.

# etcd cluster — heartbeat / election timeouts (CP posture)
# /etc/etcd/etcd.conf
name: 'node1'
data-dir: '/var/lib/etcd'
initial-cluster: 'node1=http://10.0.0.1:2380,node2=http://10.0.0.2:2380,node3=http://10.0.0.3:2380'
initial-cluster-state: 'new'

heartbeat-interval: 100      # ms · leader pings followers
election-timeout: 1000       # ms · 10x heartbeat
auto-compaction-mode: 'periodic'
auto-compaction-retention: '1h'

# Reads default to LINEARIZABLE: round-trip via leader.
# 'serializable' reads bypass the round-trip but may be stale.
client-cert-auth: true
quota-backend-bytes: 8589934592  # 8 GiB

Split-brain and fencing tokens in distributed systems

Two leaders are one too many.

The single most dangerous CAP failure mode is split brain: two nodes simultaneously believing they are the leader, both accepting writes, both convinced they speak for the cluster. The resulting divergence is not just data loss; it is data loss accompanied by acknowledgement, which destroys the trust contract a transactional system makes with its callers. The 2017 GitLab.com outage, the 2012 GitHub MySQL incident, and any number of earlier MySQL replication horror stories trace to this exact pattern.

The defence is quorum. A leader is only legitimate while a strict majority of voters acknowledge its term. Raft, Paxos, and ZAB (ZooKeeper Atomic Broadcast) all enforce this: the deposed leader, on the minority side of a partition, may continue to think it is leader for one heartbeat interval, but its writes will not commit because they cannot collect a majority of acks. When it finally hears from the new term, it steps down. The sole remaining hazard is a write that the old leader accepted but never committed. The protocol's job is to ensure no such write is ever observed.

The complementary defence is fencing tokens, named in Martin Kleppmann's blog post “How to do distributed locking” (February 2016). When a coordinator hands out a lock or a leader lease, it includes a monotonic token. Every downstream operation under that lock must present the token; the storage layer rejects any operation whose token is older than the highest one it has seen. Even if the old leader survives a partition long enough to issue a write, the storage layer refuses it. Google's Chubby (Burrows, OSDI 2006) was the first widely cited system to formalise this; HDFS NameNode HA, Kubernetes lease objects, and Consul sessions all use the same pattern.

The grim corollary is STONITH (“shoot the other node in the head”), a Linux-HA term for power-fencing a suspected former leader via IPMI or a managed PDU. Pacemaker clusters and GFS2 use STONITH because the file-system or shared-disk contract is too brittle to tolerate even a transient split brain. A datacenter that runs STONITH has accepted that CAP's ugly truth applies all the way down to the BMC.

For systems where physical fencing is impossible (multi-region cloud deployments, services running on shared Kubernetes), the pattern shifts to cooperative fencing: leases stored in etcd or DynamoDB with monotonic version counters, lightweight transactions in Cassandra (CAS via Paxos rounds), or the conditional-write primitive in S3 (added 2024). The general rule is that any operation that can outlive its issuer's lease must be made idempotent or token-checked; otherwise CAP's lurking split-brain returns under a different name.


CAP in production: what Jepsen actually finds

When the marketing meets the chaos.

Kyle Kingsbury, writing as Aphyr, started Jepsen in 2013 with the post “Call me maybe: Redis,” which demonstrated that Redis Sentinel could lose acknowledged writes during a five-second partition. Jepsen has since tested forty-plus distributed systems; the project's findings are the closest the industry has to a public empirical record of CAP and PACELC compromises. The methodology is straightforward: run a cluster, drive it with a workload generator, induce partitions and clock skews using Linux iptables and libfaketime, then check the linearizability of the observed history with the Knossos / Elle checkers (Kingsbury & Alvaro, VLDB 2020).

The notable findings have shaped vendor behaviour. etcd 0.4 (2014) was found to lose writes; etcd 3.x switched to a hardened Raft implementation and has been retested cleanly. FoundationDB passed Jepsen tests on first review (an outlier) thanks to its deterministic-simulation testing approach (Will Wilson's 2014 talk “Testing Distributed Systems w/ Deterministic Simulation” is the source). CockroachDB has been re-tested every couple of releases and consistently meets its strict-serializability claim, with caveats around clock-uncertainty configuration. MongoDB pre-3.4 lost writes under partition with default settings; the PV1 protocol fixed the worst cases. Cassandra LWT (lightweight transactions) has had repeated issues under partition that the project has gradually closed.

Beyond Jepsen, the public postmortem record is rich. The 2015 AWS DynamoDB US-East outage (September 20) was a metadata-service overload, not a CAP violation, but it illustrated the cost of consistency: customers experienced elevated error rates because the system chose to refuse rather than serve possibly-stale partitions. The 2017 GitLab.com incident was a multi-hour partition between primary and secondary plus an operator error; the public log is, by the team's own account, the most candid postmortem in the genre. The 2021 Roblox 73-hour outage was an etcd/Consul-style failure detector storm under heavy load.

The lesson across these reports: CAP is not a wall you hit once during architecture review. It is a budget that gets spent every time the network jitters, every time a node garbage-collects for 800 ms, every time a release introduces a slower request path that interacts badly with timeouts. Production CAP discipline is mostly about keeping that budget legible: named in the configuration, observable in the metrics, and exercised in chaos drills.

Modern shops have institutionalised this discipline. Netflix's Chaos Monkey (2011) graduated into the Simian Army, then into Gremlin and AWS's own Fault Injection Simulator (2021); Stripe's Game Days rehearse partition scenarios on a quarterly cadence; Cloudflare's monthly “disaster recovery” drills cut a region offline and observe how the global control plane recovers. The implicit message in every postmortem (from the November 2020 Kinesis outage through the December 2021 us-east-1 cascading control-plane failure) is that CAP-level decisions made years earlier resurface during the worst hour of an incident, and the team that has rehearsed the failure mode resolves it in minutes rather than hours.


Further reading on the CAP theorem

Primary sources, in order.

Found this useful?