04 / 20

Topics / 04

CAP and PACELC

Brewer's CAP theorem is the most-cited and most-misread result in distributed systems. The honest reading is narrow: during a network partition, a replicated system must choose between consistency and availability. PACELC (Abadi, 2010) adds the part most production systems actually live with: the latency-vs-consistency tradeoff when the network is healthy, which is almost all the time.

The original CAP claim

Eric Brewer, PODC 2000 keynote. The slide read roughly: of consistency, availability, and partition tolerance, a distributed system can guarantee at most two. The talk was a working engineer's rule of thumb from Inktomi's experience, not a formal theorem, but the three-letter mnemonic took over the conversation for a decade.

"Consistency" here meant linearisability: every read returns the most recent completed write across the whole system. "Availability" meant every request from a non-failed node eventually returns a non-error response. "Partition tolerance" meant the system keeps working when the network drops messages between subsets of nodes. Brewer's claim was that you could pick CA, CP, or AP, never all three.

What CAP actually says

Seth Gilbert and Nancy Lynch formalised CAP in 2002, and the precise statement is narrower than the marketing version. It is a statement about behaviour during a partition: if the network splits the cluster, a replicated register cannot be both linearisable and available. At least one side must either block or return stale data.

The reframing that matters: partition tolerance is not an option you trade away. Networks partition. Switches reboot, BGP flaps, TCP connections hit retransmission timeouts. A system that doesn't tolerate partitions doesn't stop being subject to them. It just behaves badly when they happen. So the honest reading of CAP is not "pick two of three" but "during a partition, pick C or A. There is no third option".

Why "CA" is a category error. A single-datacentre Postgres primary isn't a CA system that has escaped CAP. It's a CP system whose availability collapses when the primary's switch fails. The partition is still there. You've chosen consistency over availability, and you've simply bought a small partition risk.

Why CAP is widely misread

Vendors marketed databases as "CP" or "AP" as if it were a fixed attribute, like horsepower. In reality, almost every replicated system is fully available when the network is healthy, and only makes a CAP-style choice during the small slice of time it is partitioned. Calling a database "AP" is shorthand for "during a partition, it will keep accepting writes and reconcile later". It says nothing about its steady-state behaviour.

The other common misread is that CAP applies to whole systems. It applies to a single read or write against a single replicated object. A database can be CP for its primary-key lookups, AP for its session store, and have a separate analytical replica that is neither. The "CP database" label hides this.

PACELC — the part you live with daily

Daniel Abadi's 2010 refinement adds CAP's missing clause. If partitioned, choose between Availability and Consistency. Else (no partition), trade off Latency and Consistency. Hence PACELC: PA/EL, PA/EC, PC/EL, PC/EC.

The "ELC" half is where most engineering decisions actually happen. Strong consistency on a geo-replicated dataset costs a cross-region round-trip on every write, and often on reads too: single-digit to low-hundreds of milliseconds depending on geography. A system that relaxes consistency in steady state (PA/EL or PC/EL) can serve from the nearest replica without coordination, usually a 10-100x latency improvement.

PACELC quadrant   Behaviour during partition   Behaviour when healthy
---------------   --------------------------   ----------------------
PA / EL           keep serving, diverge        local reads, low latency, weak consistency
PA / EC           keep serving, diverge        coordinate on writes for strong consistency
PC / EL           refuse writes on minority    local reads, low latency
PC / EC           refuse writes on minority    coordinate always — strongest, slowest

Worked examples

The labels get more concrete with real systems. Cassandra is the textbook PA/EL system: writes go to any replica with tunable consistency levels (ONE, QUORUM, ALL), and in steady state most deployments run at LOCAL_QUORUM or ONE to keep latency under 10 ms. During a partition it keeps accepting writes on both sides and reconciles with last-write-wins once the partition heals.

MongoDB with the default majority write concern is closer to PC/EC: writes block until a majority of the replica set acknowledges, and the minority side stops accepting writes during a partition. CockroachDB is squarely PC/EC by design. Every write is a Raft round-trip across a majority of replicas, and minority partitions go read-only for affected ranges.

Amazon DynamoDB and the original Dynamo paper are PA/EL: eventually consistent reads by default, with an option to pay double for a strongly-consistent read that does a leader round-trip. Google Spanner is the interesting case: externally consistent (linearisable) across the planet, so PC/EC, but TrueTime's bounded clock uncertainty lets commits wait out the uncertainty window in single-digit milliseconds instead of a full cross-region round-trip. Spanner doesn't beat CAP. It makes the EC latency tolerable.

System	CAP	PACELC	Notes
Cassandra	AP	PA / EL	Tunable per-query; LWW reconciliation
DynamoDB	AP	PA / EL	Strong reads available at 2x cost
Riak	AP	PA / EL	Vector clocks; siblings on conflict
MongoDB (majority WC)	CP	PC / EC	Minority side rejects writes during partition
CockroachDB	CP	PC / EC	Per-range Raft; serialisable by default
etcd / ZooKeeper	CP	PC / EC	Coordination service; small data, strong guarantees
Spanner	CP	PC / EC	TrueTime bounds commit-wait to single-digit ms

Harvest and yield

Armando Fox and Eric Brewer's 1999 paper "Harvest, Yield, and Scalable Tolerant Systems" came before CAP and is, in some ways, more useful. Instead of a binary CA choice, they treat harvest (the fraction of the requested data the response actually reflects) and yield (the fraction of requests answered successfully) as continuous, independent dimensions.

A search engine that returns results from 95% of its shards because one shard timed out has harvest = 0.95, yield = 1.0. A bank that refuses transactions on its minority partition has yield < 1.0 but harvest = 1.0 on what it does serve. Most real engineering decisions are about which knob to give up first, and at what point the degraded response stops being useful, not about a clean "C or A" toggle.

How often partitions actually happen

A common dismissal of CAP is "we don't have partitions, we run on AWS". The measurements say otherwise. In a multi-AZ deployment in us-east-1, inter-AZ partitions of more than a few seconds happen on the order of once a month; AWS's own postmortems for the 2011, 2015, and 2017 EBS-and-network events are public. Transient TCP-level micro-partitions, connections that stall for 5-30 seconds while retransmits drain, happen every few minutes inside a single rack under load.

Coda Hale's 2010 essay You Can't Sacrifice Partition Tolerance made the point as bluntly as it can be made: a system that "chooses CA" is one that pretends partitions don't happen, which means it has no defined behaviour when they do. Pick C or A, and define what the other side does.

The non-obvious rule. Run partition-injection tests in CI. The number of production systems whose owners cannot describe what happens during a 60-second partition is far higher than the number whose owners can. Jepsen-style testing shows the difference between "we chose CP" and "we hoped for CP".

The Jepsen taxonomy

Since 2013, Kyle Kingsbury's Jepsen project has tested dozens of databases under induced partitions and clock skew. A recurring finding: the consistency a system advertises is usually stronger than the consistency it delivers under partition. MongoDB's majority write concern lost acknowledged writes for years before the protocol was rewritten. Cassandra's lightweight transactions (LWT), the Paxos-based compare-and-set, degraded silently under load.

etcd, CockroachDB, FoundationDB, and recent versions of MongoDB hold up well in Jepsen runs. Many "AP" systems hold up fine too, because they don't claim much. The interesting failures are systems advertising PC/EC that deliver something weaker once the test use starts dropping packets. The lesson is not that distributed databases are broken. It is that the marketing label is rarely the operational guarantee.

# Pattern for partition-injection in a test cluster
# (Jepsen uses iptables; you can do the same in a docker network.)

iptables -A INPUT  -s 10.0.0.2 -j DROP
iptables -A OUTPUT -d 10.0.0.2 -j DROP
# ...drive workload for 60s, recording acks and observed reads
iptables -D INPUT  -s 10.0.0.2 -j DROP
iptables -D OUTPUT -d 10.0.0.2 -j DROP
# then check: are all acked writes visible? any lost? any read a value never written?

When to pick C, when to pick A

The decision is almost never global. It's per-operation. A shopping-cart "add item" should pick A: a briefly duplicated item is annoying, a failed checkout is revenue lost. The original Dynamo paper used this exact example. Reconcile later, deduplicate on display.

A bank account debit should pick C: two concurrent debits that both succeed during a partition are an overdraft you can't undo. For this reason most ledger systems use a single Raft group per account or per shard, and accept that the minority side refuses transactions until the partition heals.

A user profile is the boring case in the middle. Reading a stale display name is fine; writing concurrent updates from two devices and losing one is annoying but usually recoverable. Either choice works. Pick the one that costs less to operate.

Operation	Pick	Why
Shopping-cart add	A	Lost write = lost revenue; duplicates are recoverable
Bank account debit	C	Double-spend is unrecoverable
Session token validate	A	Stale auth for 30s tolerable; outage isn't
Inventory reservation	C	Overselling has real cost
User profile update	either	Operational cost dominates
Distributed lock	C	The whole point is mutual exclusion

Common misconceptions

"Spanner solves CAP." No. Spanner is a PC/EC system that refuses writes on a minority partition like any other CP system. TrueTime doesn't change the CAP tradeoff. It lowers the steady-state latency cost of being EC, which is a PACELC "E" improvement. Spanner's own documentation says it is CP.

"NoSQL means AP." False generalisation. HBase, MongoDB with majority write concern, etcd, and FoundationDB are all CP. The AP-by-default systems (Cassandra, Riak, Dynamo) made specific design choices about Dynamo-style eventual consistency. That isn't a feature of being "NoSQL".

"CAP applies to the whole system." It applies to a single read or write against one replicated object. Treat it as a per-operation property and the marketing labels stop carrying weight.

"You can have CA if you never partition." You will partition. The frequency is set by your operator's network engineering, not by your wishes. Design what happens during the partition, even if it's "go read-only and page an on-call".

CAP and PACELC

The original CAP claim

What CAP actually says

Why CAP is widely misread

PACELC — the part you live with daily

Worked examples

Harvest and yield

How often partitions actually happen

The Jepsen taxonomy

When to pick C, when to pick A

Common misconceptions

Further reading

05 — Consistency models