Read/Write Quorum Simulator: tune N, R, W, make consistency a knob.

Replicate a value to N nodes. Read from R; write to W. If R + W > N, every read sees the latest write — Dynamo's foundational trick.

Alive
5 / 5
R+W > N
STRONG

N (replicas)
R (read quorum)
R = 3
W (write quorum)
W = 3
Replicas · click to toggle up/down
Read pass
Write pass
Verdict
overlap
Recent
— quiet —

What you're looking at

The top row sets N replicas and slides R (how many must answer a read) and W (how many must ack a write). The verdict badge reads STRONG or EVENTUAL straight off whether R + W > N. The green tiles are the replicas; click one to mark it DOWN. Each Read or Write button press checks how many replicas are still up against the quorum it needs, logs OK or FAIL with the count reached, and updates the running read-pass and write-pass percentages.

Start with N=5, R=3, W=3, so R + W = 6 and the badge says STRONG. Kill one node, then a second: reads and writes still pass, because three of five remain. Kill a third and watch both stall, since neither quorum can be met. Now drop R to 1 and W to 1: the badge flips to EVENTUAL, and operations keep succeeding even with most nodes down. The surprise is the cost of that availability — with R + W no longer over N the read set and write set can miss each other entirely, so a read may return a value older than the last write.


What is a read/write quorum?

Three copies of the same value, and they disagree.

A read/write quorum is the minimum number of replicas that must respond before a read or write is considered successful. With N replicas, choose R (read quorum) and W (write quorum) such that R + W > N — that overlap guarantees every read sees at least one replica that has the latest write. David Gifford introduced the technique at Xerox PARC in 1979; Cassandra, DynamoDB, Riak, and most quorum-based systems still use it.

Imagine you store the user's shopping cart on three different machines. You replicate so that if one machine catches fire, the cart survives on the other two. The bargain seems straightforward — keep three copies, sleep at night. The complication shows up the first time someone changes the cart.

The user adds a hat. Your write reaches replica 1, reaches replica 2, and times out on its way to replica 3. Now two replicas have the new cart and one has the old. A second later, the user refreshes the page. The read goes to whichever replica the load balancer picked — and that turns out to be replica 3. The user sees the cart without the hat. They add it again. Now replica 3 has a cart with two hats while the others have a cart with one. Each subsequent refresh might show one or two hats depending on which replica is reached. The site appears to have a memory problem; the user thinks your software is broken.

The naive fixes both fail. Always write to all three. If replica 3 is briefly unreachable, the write fails and the user can't add the hat at all — you have traded a stale-data bug for an availability outage. Always read from all three and pick the most recent. Now reads stall whenever any replica is slow, and a single sluggish machine drags down the whole site. Both extremes are uncomfortable, and most production systems live somewhere in between.

The trick the field eventually settled on is the quorum. Pick a number W of replicas that must acknowledge every write before you call the write successful, and a number R of replicas you contact on every read. As long as you pick those numbers so that R + W > N (where N is the total replica count), every read is guaranteed to touch at least one replica that holds the most recent write. The set the reader contacts and the set the writer wrote to must share a node, by the pigeonhole principle. That overlap node carries the new value into the read, and the user always sees the cart with the hat.

Better still, the knob is tunable. Want fast reads and rare-but-okay slow writes? Set R = 1 and W = N. Want fast writes? Invert. Want both reads and writes to tolerate one node being down? Pick R = W = ⌈N/2⌉ + 1, the classic majority quorum. The simulator above lets you slide all three knobs and watch what happens. The two parts that follow explain the mathematics of why this works and the production machinery — hinted handoff, vector clocks, CRDTs — that makes it survive in real distributed systems.


Why R + W > N forces overlap — the quorum invariant

Why R + W > N forces overlap.

The argument is a one-line application of the pigeonhole principle. Suppose a value is replicated on N nodes. A write places the new value on W of them and acknowledges. A subsequent read contacts R nodes. The two sets together have R + W elements, drawn from a population of N. If R + W exceeds N then the two sets cannot be disjoint, so at least one replica must appear in both. That replica holds the new value, and the read sees it.

The argument tells the reader a value is in the response set; it does not tell the reader which one is the latest. To pick the latest the reader needs a version, a timestamp, or a tie-breaking rule, and getting that right is its own engineering problem. Dynamo used vector clocks to track causality so the reader could detect when two responses were concurrent (neither one strictly more recent than the other) versus when one was a successor of the other. Cassandra defaults to last-write-wins on physical timestamp, which is simpler but loses information when clocks are skewed. Riak shipped with both options and a notion of sibling values that the application has to resolve.

The constraint is symmetric in R and W. Pushing one up pulls the other down, with the floor at one and the ceiling at N. A read-heavy workload sets R low and W high — every read is one round-trip while every write is many — so reads are cheap and writes are slow. A write-heavy workload inverts this. A balanced choice of R equal to W equal to a majority gives both reads and writes the same latency profile and tolerates a minority of the replicas being unreachable.

The other branch of the trade — R + W not exceeding N — gives up the strong-overlap guarantee in exchange for higher availability and lower latency. Reads can return without contacting any replica that holds the latest write. The system becomes eventually consistent: the new value will propagate via background read repair or anti-entropy, and a reader who pauses long enough will eventually see it. The convergence delay is workload-dependent. For some applications — caches, social-graph counters, leaderboards — this is the right trade. For others — bank ledgers, inventory of scarce goods — it is unacceptable, and the constraint must be enforced.

N=5 · W=3 (write set) · R=3 (read set) · R+W=6 > 5 ⇒ overlapwriter commits to W=3reader contacts R=3 n0 n0 n1 n1 n2 n2 n3 n3 n4 n4overlapn2pigeonhole: R + W > N forces ≥ 1 replica to appear in both sets — that replica carries the latest write into the read.

Origins — Gifford 1979, weighted voting at Xerox PARC

Gifford, weighted voting, Xerox PARC.

The quorum idea entered distributed-systems literature through David Gifford's 1979 paper Weighted Voting for Replicated Data, presented at the seventh SOSP and written while Gifford was at Xerox PARC. Gifford was working on a replicated file system and needed a principled way to reason about which subsets of replicas a read or write had to contact. His insight was to assign each replica a number of votes and to require that a read collect at least R votes and a write collect at least W votes, with the constraint that R + W exceed the total vote count. The constraint forces every read quorum to overlap every write quorum on at least one replica, and that overlap is what carries the most recent write into the read.

The framing generalises majority voting. With every replica carrying a single vote and a total vote count of N, the constraint R + W > N reduces to "the read and write sets together cover more than the population, so they must intersect." Gifford allowed unequal weighting, which lets a system give more weight to some replicas than others — useful for asymmetric topologies where some replicas are more durable than the rest. Lamport's later work on Paxos can be read as another generalisation: a quorum is any set of acceptors large enough that every two quorums share an acceptor, and Gifford's R + W > N is one specific instantiation of that property.

Quorum systems sat as a theoretical tool in the 1980s and 1990s. The structure showed up in academic systems — Coda, Ficus, Bayou — but was not central to commercial databases until the next decade. The classical relational engines used primary-copy replication and synchronous logs rather than voting; the consistency model was strong-by-construction and the durability story leaned on the underlying hardware. Quorum-based replication remained a research topic, useful for reasoning about the limits of what was possible but not the operational default.

What changed was the arrival of internet-scale workloads where availability under partition mattered more than the strongest consistency model. Amazon's shopping-cart service famously could not tolerate a write being rejected because a single replica was unreachable; the cost of an unavailable add-to-cart was measurable in lost revenue. The 2007 paper Dynamo: Amazon's Highly Available Key-value Store, by Giuseppe DeCandia and a long author list, took the Gifford framing off the shelf and built a production system around it. Dynamo's contribution was less the quorum mathematics — that was forty years old — than the engineering machinery around it.


Sloppy quorums and hinted handoff — when assigned nodes are unreachable

When the assigned nodes are unreachable, use neighbours.

The simplest read-write quorum scheme assumes every key has a fixed assignment of N replicas. A write only counts if it lands on the specific W replicas in that assignment; if any of those are unreachable, the write fails. The Dynamo paper noted that this strictness has an unwelcome consequence in production: any time a network partition isolates the canonical replicas, even briefly, writes to those keys stop succeeding even though plenty of healthy nodes are sitting idle elsewhere in the cluster. The fix is to relax the strictness.

A sloppy quorum writes to whichever W nodes are reachable, even if some of them are not in the canonical replica set for the key. The off-replica nodes record the write as a hint: a temporary record indicating that the data really belongs to a node that was unreachable at the time. When the canonical replica recovers, the holder of the hint replays it, transferring the write to its proper home. Cassandra calls this mechanism hinted handoff; the same idea appears in Riak and in DynamoDB's internal replication protocol.

The trade is real. With sloppy quorums the strict R + W > N overlap guarantee no longer holds during periods of partition: a read that hits the canonical replicas may miss writes that landed on hint-holders, and the reader sees stale data. Once hints replay, the inconsistency closes. The window of inconsistency is bounded by the hinted-handoff replay rate, which depends on the size of the hint queue and the available bandwidth between nodes. Cassandra's max_hint_window_in_ms bounds the time a hint can age before being dropped; if a recovering node was down longer than that, the hints are gone and consistency must be restored by anti-entropy.

Anti-entropy is the second mechanism that keeps replicas converging in a sloppy-quorum system. Cassandra's nodetool repair performs Merkle-tree comparisons between replicas to find divergent ranges and copy the missing data; DynamoDB runs a similar background process inside its replication tier. Read repair is the third: when a read sees disagreement among the responding replicas, the system writes the most-recent value back to the laggards, opportunistically catching up. Together, these three mechanisms — hinted handoff, anti-entropy, and read repair — give sloppy quorums their characteristic eventual-consistency profile: usually fresh, occasionally stale, always converging.

SLOPPY QUORUM · canonical n3 unreachable; n6 holds a hintwriterkey=k W=3n1 · ackn2 · ackn3 · DOWNn6 · hint"for n3, k=v"stand-in Wreplay on recovern3 · UPcaught upthree mechanisms keep sloppy quorums converging: hinted handoff, anti-entropy (Merkle repair), read repair.

Conflict resolution — last-write-wins, vector clocks, CRDTs

Last-write-wins, vector clocks, CRDTs.

Quorums tell you that the read set overlaps the write set. They do not tell you how to interpret what you see when two writes have happened concurrently. A reader that contacts R replicas may see two different values, both with the same logical recency, neither obviously the winner. Some scheme has to break the tie. The choice of conflict-resolution scheme has consequences that ripple through the application logic above the database.

The simplest scheme is last-write-wins based on a physical timestamp. Each write carries the wall-clock time at the writer; the reader picks the value with the highest timestamp. The scheme is operationally trivial and works well when clocks are tightly synchronised and concurrent writes are rare. Cassandra defaults to this. The failure mode is clock skew: a write performed earlier in real time but on a faster-clock node wins over a write performed later on a slower-clock node, silently dropping the more-recent value. NTP keeps clocks within tens of milliseconds; for higher-rate workloads this is not enough.

The second scheme is vector clocks. Each replica maintains a per-key counter, and every write increments the writer's component. A reader can compare two values' vector clocks: if one strictly dominates the other, that one is more recent; if neither dominates, the writes are concurrent and a real conflict exists. Dynamo used vector clocks; Riak shipped them as the default. The resolution of concurrent writes is delegated to the application — the database returns sibling values and the application picks one or merges them. This is operationally heavier but loses no information.

The third scheme is to use data types that resolve conflicts mathematically. Conflict-free Replicated Data Types (CRDTs), formalised by Shapiro et al. in 2011, are data structures whose merge operation is associative, commutative, and idempotent. Two replicas can apply updates in any order and the merged state is the same. G-counters, OR-sets, LWW-element-sets, and JSON CRDTs all give the database a way to combine concurrent writes deterministically without involving the application. Riak shipped CRDTs as a first-class data type, Redis Enterprise has them as a feature, and Automerge and Yjs have brought them to client-side collaborative-editing tools.

Quorums find the value, not the winner

An R + W > N overlap guarantees the latest write is in the response set — but it doesn't say which response that is. Pick a tie-breaker before you ship: physical timestamp (Cassandra's LWW), vector clocks with sibling resolution (Riak), or a CRDT that merges concurrent writes mathematically. Skipping this decision means whichever value the network happens to return first becomes truth.


Quorums in Paxos, Raft, and flexible-quorum consensus

Paxos, Raft, flexible quorums.

Read-write quorums in the Dynamo style sit alongside a different family of quorum systems used in consensus protocols. Paxos, introduced by Lamport in The Part-Time Parliament (1998), uses a quorum of acceptors to commit a value and a quorum of acceptors to read the committed value, with the constraint that any two quorums must intersect. The classical instantiation uses majority quorums: any majority intersects any other majority. Raft, published by Ongaro and Ousterhout in 2014, makes the same choice with a slightly different vocabulary — leader elections require a majority and log commits require a majority, and the same intersection guarantee carries through.

The two families are related but not identical. A Dynamo-style quorum cares about per-key overlap between read and write sets and treats the database as a key-value map. A consensus-style quorum cares about agreement on a sequence of operations across the entire replica group; the operations themselves can be arbitrary. The performance profiles differ accordingly: a Dynamo quorum read is one round trip with no leader, while a Raft read either goes through the leader (for strong consistency) or is served from any replica (for sequential or read-your-writes consistency).

Flexible Paxos, introduced by Heidi Howard and her co-authors in 2016, weakens the constraint to "every leader-election quorum must intersect every commit quorum, but two leader-election quorums need not intersect each other and two commit quorums need not intersect each other." The weakening allows commit quorums smaller than a majority — for example, with five replicas, a commit quorum of two and an election quorum of four still satisfies the intersection. The trade is that elections become more expensive while writes become cheaper, which is a sensible trade for systems where elections are rare and commits are common.

Egalitarian Paxos (EPaxos, Moraru et al., 2013) goes further by removing the leader entirely; any replica can propose any command, and conflicting commands are ordered using a dependency graph. Witness replicas in CockroachDB and Spanner serve a related role: a witness participates in voting but does not store a full copy of the data, reducing storage cost without weakening the quorum. These designs all sit in the same algebraic landscape Gifford laid out, with different choices of which quorums must intersect for a given safety property to hold.

N R W Consistency Best for
313strong (R+W=4)read-heavy, expensive writes
331strong (R+W=4)write-heavy, slower reads
322strong (R+W=4)balanced, tolerates 1 down
311eventual (R+W=2)cache, counters, low-latency
533strong (R+W=6)tolerates 2 down each side
551strong (R+W=6)single-writer durability, slow read

Quorum tuning in Cassandra and MongoDB

Cassandra ONE, QUORUM, ALL — and Mongo write concern.

Cassandra exposes the quorum constraint as a per-operation consistency level. ONE waits for a single replica to respond; TWO and THREE wait for the corresponding count; QUORUM waits for a strict majority; ALL waits for every assigned replica; LOCAL_QUORUM waits for a majority within the local datacenter, sidestepping the cost of a cross-region round trip. The choice is per-statement, which lets the application use stronger consistency for some operations (a payment confirmation) and weaker for others (a view-count increment). The combination of read level and write level determines whether the workload sees strong-consistency or eventual-consistency semantics, and the application can mix and match.

MongoDB exposes a similar knob through write concern and read concern. w:1 means a write is acknowledged after the primary persists it; w:majority means a majority of voting members must have applied the write; w:<number> waits for a fixed count. Read concern includes local, majority, and linearizable, with progressively stronger guarantees and progressively higher latency. The same trade Dynamo articulated reappears here: stronger semantics require touching more replicas which means more latency and lower availability under partition.

DynamoDB itself has moved on from the classical Dynamo design. Its current architecture uses Multi-Paxos for replication within a partition, with three replicas per partition and quorum writes. The exposed semantics are eventually consistent reads (cheap) and strongly consistent reads (read from the leader, more expensive); the underlying mechanics are no longer the original sloppy-quorum sloppy-handoff design. The shift reflects a general industry trajectory: Dynamo-style sloppy quorums won the distributed-systems debate of the late 2000s, but the engineering preference has migrated toward stronger primitives backed by consensus protocols once the operational cost of those protocols dropped.

Geo-distributed quorums are where the latency cost of strong consistency becomes uncomfortable. A quorum that spans two continents has a wall-clock floor on every operation equal to the inter-continental round trip — somewhere on the order of a hundred milliseconds. Spanner's TrueTime API uses bounded-uncertainty clocks plus Paxos quorums to give external consistency at this latency floor; CockroachDB uses HLC clocks plus Raft to give serialisable transactions with similar latency characteristics. The point is that no amount of clever quorum design escapes the speed of light; the question is just whether you pay for it on every operation, on a subset of operations, or via a careful split into local-and-global tiers.

Operators should also remember that the published latency knobs interact with topology. A five-replica deployment with three replicas in one datacenter and two in another has very different quorum-latency behaviour than a five-replica deployment with one replica in each of five regions. The same configured consistency level can mean a one-millisecond write or a two-hundred-millisecond write depending on which replicas the system happens to contact. The lesson is that quorum tuning is a shape problem as much as a number problem; the placement of replicas matters at least as much as the choice of R and W.

-- Cassandra (CQL) — per-statement consistency level
CONSISTENCY ONE;             -- 1 replica, fastest, eventual
CONSISTENCY QUORUM;          -- ⌈N/2⌉+1, classic strong
CONSISTENCY LOCAL_QUORUM;    -- majority within local DC, no cross-region RTT
CONSISTENCY EACH_QUORUM;     -- majority in EVERY DC (writes only)
CONSISTENCY ALL;             -- all N replicas — least available

INSERT INTO orders (id, total) VALUES (?, ?)
  USING CONSISTENCY LOCAL_QUORUM;

SELECT * FROM orders WHERE id = ?
  USING CONSISTENCY LOCAL_QUORUM;
-- LOCAL_QUORUM read + LOCAL_QUORUM write ⇒ R+W>N within DC

When the quorum knob is the wrong control surface

Cases where the quorum knob is the wrong control surface.

Quorum-based replication is the right tool when you need a tunable trade between consistency and availability across a small number of replicas, when individual writes are independent, and when the operations are key-value-shaped. Several other shapes do not fit. Workloads that need cross-key transactional semantics — debit one account, credit another, atomically — cannot be implemented on top of per-key quorums alone. Banking systems use SQL with serialisable isolation; cross-shard transactions in Spanner and CockroachDB use two-phase commit on top of consensus quorums. The quorum gives the per-key durability; the transaction gives the across-key ordering.

Workloads where the read-vs-write ratio is highly asymmetric often do better with a different replication shape. Read-mostly workloads with very strong freshness requirements may prefer a leader-follower replication scheme where every read goes to the leader; the leader's local state is always strongly consistent, and the followers exist for fault tolerance rather than for read scaling. Write-mostly workloads with weak freshness requirements may prefer log-structured replication where every replica accepts writes locally and the data converges via a log-shipping protocol.

Quorum schemes interact subtly with operational tasks. A backup that captures a single replica may miss writes that have not yet propagated; a backup that captures a quorum sees a consistent snapshot but pays the cost of coordinating across replicas. Schema migrations that need every replica to agree on the new schema require a different mechanism (typically a coordinated rolling upgrade); they cannot be expressed as a single quorum operation. Adding or removing a replica changes N, and the system needs a careful protocol to avoid violating the R + W > N invariant during the transition.

The final consideration is operational simplicity. A system that uses a single tunable knob — Cassandra's consistency level — is easy to reason about per query but hard to reason about as a whole, because every query in the application potentially uses a different setting and the global behaviour is the union of those settings. A system that bakes in one strong-consistency guarantee — Spanner's external consistency, CockroachDB's serialisable default — is harder to scale to weaker semantics for the few operations that need them, but easier to operate because the default is unambiguous. The right choice depends on whether the application's developers are willing to think about consistency on a per-query basis or whether they want the database to make that choice for them.

One last reminder. Quorums protect against replica failure and network partition; they do not protect against a logical bug in the application that writes the wrong value. Every replica in a quorum dutifully stores the corrupted record, and every reader sees it. The defences against that failure mode — point-in-time recovery, append-only audit logs, idempotent retries with carefully chosen keys — sit at a different layer of the stack, and quorum tuning never substitutes for them. The strongest read-after-write consistency in the world cannot rescue a write that was wrong on entry.


Further reading on read/write quorums

Primary sources, in order.

Found this useful?