Quorums & majorities
A quorum is the smallest set of nodes you need to hear from before you trust a decision. The core idea is geometric: if every quorum overlaps every other quorum by at least one node, no two quorums can disagree unless that overlapping node lies to one of them. Paxos, Raft, Dynamo, PBFT, and Spanner all build on this single property. The differences are about how big the quorums are and what "lying" means.
The intuition
Pick any rule for committing a decision that needs "a majority" of N nodes to agree. With N = 5, a majority is 3. Any two majorities of size 3 drawn from a group of 5 must share at least one node. There are only 5 nodes to choose from, and 3 + 3 = 6 > 5, so the pigeonhole principle forces an overlap.
That overlap is the whole game. If decision A was approved by majority Q1, and decision B was approved by majority Q2, then some node x sits in both Q1 and Q2. Node x either approved both A and B, or it lied to one of the proposers. As long as nodes don't lie (the crash-fault model) two conflicting decisions can never both be committed.
The geometric proof, generalised
For crash faults, quorums of size N/2 + 1 overlap by at least one node and
tolerate f = (N-1)/2 failures. For Byzantine faults, where nodes can lie
freely, overlap by one isn't enough, because the one overlapping node might be
the liar. You need the overlap to contain at least one honest node.
Work it through. With N nodes and up to f Byzantine, two quorums of size q overlap by at
least 2q − N nodes. Of those, at most f can be Byzantine, so you need
2q − N ≥ f + 1, i.e. q ≥ (N + f + 1) / 2. Combine with the
requirement that the quorum itself contains a majority of honest nodes
(q − f ≥ f + 1, so q ≥ 2f + 1), and the cleanest solution is
N ≥ 3f + 1 with q = 2f + 1. With N = 4 you tolerate
f = 1; with N = 7 you tolerate f = 2.
Majority quorums (Paxos, Raft)
The classical setting: crash faults only, quorum size ⌊N/2⌋ + 1. For N = 3
that's 2; for N = 5, three; for N = 7, four. Failure tolerance is
f = ⌊(N − 1) / 2⌋. The cluster keeps making progress as long as a majority
is alive and can talk to each other.
Odd N is preferred because adding a fifth node to a 4-node cluster raises failure tolerance from one to two for the cost of one more replica, and the quorum size grows from 3 to 3, not 3 to 4, so write latency doesn't get worse. A 4-node cluster is strictly worse than 3: you pay for an extra replica, tolerate the same one failure, and risk split-votes during leader election that 3 or 5 don't.
Dynamo-style flexible quorums (R + W > N)
Amazon's 2007 Dynamo paper generalised quorums to two tunable numbers: W,
how many replicas a write must reach, and R, how many a read must
query. As long as R + W > N, every read overlaps every write by at least
one replica, so a read is guaranteed to see at least one copy with the latest written
value, which it can spot by version vector.
This decouples consistency from a fixed majority. You pick the trade-off:
| Setting | R + W > N? | Behaviour |
|---|---|---|
| N=3, R=2, W=2 | Yes (4 > 3) | Balanced strong consistency; Dynamo default |
| N=3, R=1, W=3 | Yes (4 > 3) | Fast reads, slow writes; read-heavy workloads |
| N=3, R=3, W=1 | Yes (4 > 3) | Fast writes, slow reads; write-heavy workloads |
| N=3, R=1, W=1 | No (2 ≤ 3) | Eventual consistency; lowest latency, may serve stale |
The Dynamo paper measures the cost: at N = 3 with R = W = 2, write latency is set by the second-slowest of three replicas. With R = 1, reads cost one network hop; with W = 1, writes do.
Flexible Paxos — overlap only where it matters
Heidi Howard, Dahlia Malkhi, and Alexander Spiegelman, 2016. The observation: Paxos's safety argument only needs the Phase 1 quorum (the prepare quorum, Q1) to intersect every Phase 2 quorum (the accept quorum, Q2). It does not require Q2 to be a majority, nor does it require Q1 ∩ Q2 to contain a majority.
So you can set |Q1| + |Q2| > N instead of demanding both be majorities. With
N = 6, Q1 = 5 and Q2 = 2 is safe: any prepare-quorum of 5 overlaps any accept-quorum of 2.
That cuts the steady-state write cost from waiting on 4 nodes (majority of 6) to waiting
on 2, at the cost of expensive leader elections. Worth it when leader churn is rare,
which is the common case.
Howard's "Paxos vs Raft" line of work pushed this further. CASPaxos, EPaxos, and the "Flexible Paxos" library all build on the same insight: classical Paxos overpaid on steady-state writes for a generality nobody actually needed.
Witness replicas — voting without data
A witness replica votes in the quorum but doesn't store the data, only metadata: log positions, version vectors, commit indices. Spanner uses witnesses in some geo-distributed configurations; Cassandra's lightweight transactions can use a similar role; Azure Cosmos's multi-region setups use them too.
The math is the same. With N = 5 where 3 are full replicas and 2 are witnesses, a majority of 3 still commits a write, but only 3 nodes need disk space. You pay full storage cost for 3 instead of 5, and the witnesses still let you tolerate 2 failures and avoid split-brain. The catch: witnesses can vote but can't serve reads, so during a failover you may need to repair the data side from a surviving full replica before fully resuming service.
Byzantine quorums — 3f + 1
Castro & Liskov's PBFT (1999) made Byzantine consensus practical. With N = 3f + 1 nodes
and quorums of size 2f + 1, any two quorums overlap in
2(2f+1) − (3f+1) = f + 1 nodes, and since at most f of those can be
Byzantine, at least one is honest. That honest overlap carries the prior decision
forward, just as in crash-fault Paxos.
Numbers worth knowing: N = 4 tolerates f = 1 Byzantine; N = 7 tolerates f = 2; N = 10 tolerates f = 3. The cluster gets expensive fast: you pay for three replicas per tolerated faulty node, plus the message complexity of PBFT (O(N²) per agreement, three communication phases). Modern BFT protocols like HotStuff (used by Diem/Aptos) keep the 3f + 1 bound but cut message complexity to O(N) per phase by pipelining and using threshold signatures.
Read quorums vs write quorums
Two different guarantees. Write quorum (W) is about durability. Once a write has reached W replicas, you've committed enough copies that losing fewer than W of them doesn't lose the write. Read quorum (R) is about freshness. Reading from R replicas guarantees that at least one of them took part in the most recent overlapping write.
The dial moves cost between the two paths. R = 1, W = N is the extreme read-optimised setup: every read is a single hop, but every write has to land on every replica before it commits, so a single slow node stalls all writes. R = N, W = 1 flips it: writes are fast and cheap, reads have to scan every replica and reconcile by version.
Most production systems sit near R = W = ⌈(N+1)/2⌉ because it balances read and write latency around the same percentile of replica response times. Tail-latency-sensitive workloads sometimes push W higher (fewer surprises on later reads) at the cost of write latency.
Real-world tunings
Each system exposes the quorum dial differently. Cassandra calls them "consistency levels"; MongoDB calls them "write concerns" and "read concerns"; Dynamo (the AWS product, not the paper) hides them but uses similar mechanics under the hood.
-- Cassandra CQL — change consistency per session
CONSISTENCY QUORUM;
SELECT * FROM users WHERE id = ?;
-- Or per query (driver API). Common levels:
-- ONE — fastest, may read stale
-- QUORUM — N/2 + 1, strong consistency with R = W = QUORUM
-- LOCAL_QUORUM — quorum within the local datacentre only
-- ALL — all replicas, no failure tolerance
-- EACH_QUORUM — quorum in every datacentre (writes only)MongoDB write concerns: {w: 1} means acknowledge after the primary
only (fast, can lose writes on failover); {w: "majority"} waits for a
majority of the replica set (default for production); {w: 2} or
{w: 3} name explicit counts; {w: "all"} waits for
every replica (rare; one slow node stalls everything).
// MongoDB — durable write
db.orders.insertOne(
{ id: 42, total: 199.00 },
{ writeConcern: { w: "majority", j: true, wtimeout: 5000 } }
);
// j: true forces fsync on the primary before ack;
// "majority" requires N/2 + 1 acknowledgements.Defaults are revealing. Cassandra ships with consistency level ONE; the operator picks QUORUM if they need it. Dynamo (the paper) ran most production tables at R = 2, W = 2, N = 3. Spanner runs Paxos majorities per shard and combines them with TrueTime for external consistency. The differences reflect what each system optimises for: Cassandra for raw throughput, Dynamo for shopping-cart-style merge-friendly data, Spanner for global transactions.
Quorum failure under partition
When a network partition splits N nodes into a majority side and a minority side, only the majority side can form a quorum — by definition. The minority side has two choices:
- CP: block. Reject writes (and often reads) until the partition heals. Raft and majority-quorum systems do this. The minority side stays consistent with the majority by refusing to act.
- AP: accept and reconcile. Take writes on both sides, then merge once the partition heals (last-write-wins, CRDTs, or app-level resolution). Dynamo and Cassandra at R = W = 1 do this.
This is the CAP choice made concrete by the quorum settings. Spanner picks CP and pays for it with a globally distributed TrueTime clock. Cassandra defaults to AP and pays for it with eventual consistency and hinted handoff. Neither is wrong. They're optimised for different failure modes and different definitions of correct.
Quorum strategies compared
| Strategy | Quorum size | Tolerates | Used by |
|---|---|---|---|
| Single-master | 1 (the master) | 0 master failures | Classic MySQL primary, single-node Postgres |
| Paxos / Raft majority | ⌊N/2⌋ + 1 | ⌊(N−1)/2⌋ crashes | etcd, Consul, ZooKeeper, Spanner shards |
| Dynamo R + W > N | tunable R and W | varies by tuning | Cassandra, Riak, original Dynamo |
| Flexible Paxos | |Q1| + |Q2| > N | varies; faster steady state | Research (Howard et al.), CASPaxos variants |
| Witness replicas | majority including witnesses | same as majority, less storage | Spanner geo, Azure Cosmos, Cassandra LWT |
| Byzantine 2f + 1 (PBFT, HotStuff) | 2f + 1, with N = 3f + 1 | f Byzantine faults | Diem/Aptos, Tendermint, blockchain BFT |
Further reading
- Lamport — Paxos Made Simple — the original majority-quorum protocol, written in the gentlest form Lamport could manage.
- DeCandia et al. — Dynamo: Amazon's Highly Available Key-value Store — where R + W > N was popularised, with production numbers from Amazon.com.
- Howard, Malkhi, Spiegelman — Flexible Paxos: Quorum Intersection Revisited — the 2016 paper that loosened the majority requirement.
- Gilbert & Lynch — Brewer's Conjecture and the Feasibility of CAP — the formal proof that connects quorum choices to CAP trade-offs.
- Castro & Liskov — Practical Byzantine Fault Tolerance — PBFT, the first Byzantine consensus protocol that ran at production speed.
- Corbett et al. — Spanner: Google's Globally-Distributed Database — per-shard Paxos majorities combined with TrueTime; the production playbook for global CP.