08 / 20

Topics / 08

Quorums & majorities

A quorum is the smallest set of nodes you need to hear from before you trust a decision. The core idea is geometric: if every quorum overlaps every other quorum by at least one node, no two quorums can disagree unless that overlapping node lies to one of them. Paxos, Raft, Dynamo, PBFT, and Spanner all build on this single property. The differences are about how big the quorums are and what "lying" means.

The intuition

Pick any rule for committing a decision that needs "a majority" of N nodes to agree. With N = 5, a majority is 3. Any two majorities of size 3 drawn from a group of 5 must share at least one node. There are only 5 nodes to choose from, and 3 + 3 = 6 > 5, so the pigeonhole principle forces an overlap.

That overlap is the whole game. If decision A was approved by majority Q1, and decision B was approved by majority Q2, then some node x sits in both Q1 and Q2. Node x either approved both A and B, or it lied to one of the proposers. As long as nodes don't lie (the crash-fault model) two conflicting decisions can never both be committed.

Why this matters. "Majority" isn't just a vote-counting convention. It's a geometric guarantee that any two committed decisions share a witness, and the witness carries the truth forward. Every consensus protocol since Paxos leans on this.

The geometric proof, generalised

For crash faults, quorums of size N/2 + 1 overlap by at least one node and tolerate f = (N-1)/2 failures. For Byzantine faults, where nodes can lie freely, overlap by one isn't enough, because the one overlapping node might be the liar. You need the overlap to contain at least one honest node.

Work it through. With N nodes and up to f Byzantine, two quorums of size q overlap by at least 2q − N nodes. Of those, at most f can be Byzantine, so you need 2q − N ≥ f + 1, i.e. q ≥ (N + f + 1) / 2. Combine with the requirement that the quorum itself contains a majority of honest nodes (q − f ≥ f + 1, so q ≥ 2f + 1), and the cleanest solution is N ≥ 3f + 1 with q = 2f + 1. With N = 4 you tolerate f = 1; with N = 7 you tolerate f = 2.

Majority quorums (Paxos, Raft)

The classical setting: crash faults only, quorum size ⌊N/2⌋ + 1. For N = 3 that's 2; for N = 5, three; for N = 7, four. Failure tolerance is f = ⌊(N − 1) / 2⌋. The cluster keeps making progress as long as a majority is alive and can talk to each other.

Odd N is preferred because adding a fifth node to a 4-node cluster raises failure tolerance from one to two for the cost of one more replica, and the quorum size grows from 3 to 3, not 3 to 4, so write latency doesn't get worse. A 4-node cluster is strictly worse than 3: you pay for an extra replica, tolerate the same one failure, and risk split-votes during leader election that 3 or 5 don't.

The non-obvious rule. N = 3 tolerates 1 failure. N = 5 tolerates 2. N = 7 tolerates 3. Each extra pair of nodes buys one more failure of headroom. Most etcd and Consul clusters run at 3 or 5; beyond 7 the write-path round-trip cost stops being worth it.

Dynamo-style flexible quorums (R + W > N)

Amazon's 2007 Dynamo paper generalised quorums to two tunable numbers: W, how many replicas a write must reach, and R, how many a read must query. As long as R + W > N, every read overlaps every write by at least one replica, so a read is guaranteed to see at least one copy with the latest written value, which it can spot by version vector.

This decouples consistency from a fixed majority. You pick the trade-off:

Setting	R + W > N?	Behaviour
N=3, R=2, W=2	Yes (4 > 3)	Balanced strong consistency; Dynamo default
N=3, R=1, W=3	Yes (4 > 3)	Fast reads, slow writes; read-heavy workloads
N=3, R=3, W=1	Yes (4 > 3)	Fast writes, slow reads; write-heavy workloads
N=3, R=1, W=1	No (2 ≤ 3)	Eventual consistency; lowest latency, may serve stale

The Dynamo paper measures the cost: at N = 3 with R = W = 2, write latency is set by the second-slowest of three replicas. With R = 1, reads cost one network hop; with W = 1, writes do.

Flexible Paxos — overlap only where it matters

Heidi Howard, Dahlia Malkhi, and Alexander Spiegelman, 2016. The observation: Paxos's safety argument only needs the Phase 1 quorum (the prepare quorum, Q1) to intersect every Phase 2 quorum (the accept quorum, Q2). It does not require Q2 to be a majority, nor does it require Q1 ∩ Q2 to contain a majority.

So you can set |Q1| + |Q2| > N instead of demanding both be majorities. With N = 6, Q1 = 5 and Q2 = 2 is safe: any prepare-quorum of 5 overlaps any accept-quorum of 2. That cuts the steady-state write cost from waiting on 4 nodes (majority of 6) to waiting on 2, at the cost of expensive leader elections. Worth it when leader churn is rare, which is the common case.

Howard's "Paxos vs Raft" line of work pushed this further. CASPaxos, EPaxos, and the "Flexible Paxos" library all build on the same insight: classical Paxos overpaid on steady-state writes for a generality nobody actually needed.

Witness replicas — voting without data

A witness replica votes in the quorum but doesn't store the data, only metadata: log positions, version vectors, commit indices. Spanner uses witnesses in some geo-distributed configurations; Cassandra's lightweight transactions can use a similar role; Azure Cosmos's multi-region setups use them too.

The math is the same. With N = 5 where 3 are full replicas and 2 are witnesses, a majority of 3 still commits a write, but only 3 nodes need disk space. You pay full storage cost for 3 instead of 5, and the witnesses still let you tolerate 2 failures and avoid split-brain. The catch: witnesses can vote but can't serve reads, so during a failover you may need to repair the data side from a surviving full replica before fully resuming service.

When to pick witnesses. Geo-distributed deployments where bandwidth to a third region is precious. Put two full replicas in your primary region, one full in a backup region, and a witness somewhere cheap. You keep the quorum math and halve the cross-region replication bytes.

Byzantine quorums — 3f + 1

Castro & Liskov's PBFT (1999) made Byzantine consensus practical. With N = 3f + 1 nodes and quorums of size 2f + 1, any two quorums overlap in 2(2f+1) − (3f+1) = f + 1 nodes, and since at most f of those can be Byzantine, at least one is honest. That honest overlap carries the prior decision forward, just as in crash-fault Paxos.

Numbers worth knowing: N = 4 tolerates f = 1 Byzantine; N = 7 tolerates f = 2; N = 10 tolerates f = 3. The cluster gets expensive fast: you pay for three replicas per tolerated faulty node, plus the message complexity of PBFT (O(N²) per agreement, three communication phases). Modern BFT protocols like HotStuff (used by Diem/Aptos) keep the 3f + 1 bound but cut message complexity to O(N) per phase by pipelining and using threshold signatures.

Read quorums vs write quorums

Two different guarantees. Write quorum (W) is about durability. Once a write has reached W replicas, you've committed enough copies that losing fewer than W of them doesn't lose the write. Read quorum (R) is about freshness. Reading from R replicas guarantees that at least one of them took part in the most recent overlapping write.

The dial moves cost between the two paths. R = 1, W = N is the extreme read-optimised setup: every read is a single hop, but every write has to land on every replica before it commits, so a single slow node stalls all writes. R = N, W = 1 flips it: writes are fast and cheap, reads have to scan every replica and reconcile by version.

Most production systems sit near R = W = ⌈(N+1)/2⌉ because it balances read and write latency around the same percentile of replica response times. Tail-latency-sensitive workloads sometimes push W higher (fewer surprises on later reads) at the cost of write latency.

Real-world tunings

Each system exposes the quorum dial differently. Cassandra calls them "consistency levels"; MongoDB calls them "write concerns" and "read concerns"; Dynamo (the AWS product, not the paper) hides them but uses similar mechanics under the hood.

-- Cassandra CQL — change consistency per session
CONSISTENCY QUORUM;
SELECT * FROM users WHERE id = ?;

-- Or per query (driver API). Common levels:
--   ONE       — fastest, may read stale
--   QUORUM    — N/2 + 1, strong consistency with R = W = QUORUM
--   LOCAL_QUORUM — quorum within the local datacentre only
--   ALL       — all replicas, no failure tolerance
--   EACH_QUORUM — quorum in every datacentre (writes only)

MongoDB write concerns: {w: 1} means acknowledge after the primary only (fast, can lose writes on failover); {w: "majority"} waits for a majority of the replica set (default for production); {w: 2} or {w: 3} name explicit counts; {w: "all"} waits for every replica (rare; one slow node stalls everything).

// MongoDB — durable write
db.orders.insertOne(
  { id: 42, total: 199.00 },
  { writeConcern: { w: "majority", j: true, wtimeout: 5000 } }
);
// j: true forces fsync on the primary before ack;
// "majority" requires N/2 + 1 acknowledgements.

Defaults are revealing. Cassandra ships with consistency level ONE; the operator picks QUORUM if they need it. Dynamo (the paper) ran most production tables at R = 2, W = 2, N = 3. Spanner runs Paxos majorities per shard and combines them with TrueTime for external consistency. The differences reflect what each system optimises for: Cassandra for raw throughput, Dynamo for shopping-cart-style merge-friendly data, Spanner for global transactions.

Quorum failure under partition

When a network partition splits N nodes into a majority side and a minority side, only the majority side can form a quorum — by definition. The minority side has two choices:

CP: block. Reject writes (and often reads) until the partition heals. Raft and majority-quorum systems do this. The minority side stays consistent with the majority by refusing to act.
AP: accept and reconcile. Take writes on both sides, then merge once the partition heals (last-write-wins, CRDTs, or app-level resolution). Dynamo and Cassandra at R = W = 1 do this.

This is the CAP choice made concrete by the quorum settings. Spanner picks CP and pays for it with a globally distributed TrueTime clock. Cassandra defaults to AP and pays for it with eventual consistency and hinted handoff. Neither is wrong. They're optimised for different failure modes and different definitions of correct.

The trade-off in one line. A quorum cluster trades availability during partition for consistency across partition. The looser the quorum (R + W ≤ N), the more availability you get and the more reconciliation you owe later.

Quorum strategies compared

Strategy	Quorum size	Tolerates	Used by
Single-master	1 (the master)	0 master failures	Classic MySQL primary, single-node Postgres
Paxos / Raft majority	⌊N/2⌋ + 1	⌊(N−1)/2⌋ crashes	etcd, Consul, ZooKeeper, Spanner shards
Dynamo R + W > N	tunable R and W	varies by tuning	Cassandra, Riak, original Dynamo
Flexible Paxos	\|Q1\| + \|Q2\| > N	varies; faster steady state	Research (Howard et al.), CASPaxos variants
Witness replicas	majority including witnesses	same as majority, less storage	Spanner geo, Azure Cosmos, Cassandra LWT
Byzantine 2f + 1 (PBFT, HotStuff)	2f + 1, with N = 3f + 1	f Byzantine faults	Diem/Aptos, Tendermint, blockchain BFT

Quorums & majorities

The intuition

The geometric proof, generalised

Majority quorums (Paxos, Raft)

Dynamo-style flexible quorums (R + W > N)

Flexible Paxos — overlap only where it matters

Witness replicas — voting without data

Byzantine quorums — 3f + 1

Read quorums vs write quorums

Real-world tunings

Quorum failure under partition

Quorum strategies compared

Further reading

07 — Leases & fencing