How it works · Storage · Distribution

How Cassandra replicates data.

Eight nodes on a ring, no leader, and a consistency knob you turn per query. The application — not the cluster — decides how strong each read and write needs to be.

Nodes alive0 / 8 Replication factorRF = 3 Pending hints0

CL
RF
Murmur3 hash ringclick a node to kill / revive
Try a write
type a key + value above, pick a CL, hit write. Click a node to kill it; write again with one or two nodes down to watch quorum behaviour and hinted handoff.
coordinator
replica (alive)
node down
hint stored

Cassandra inherits from Dynamo and Bigtable

Two papers, one database.

Cassandra is what you get when you crossbreed two of the most influential systems papers ever written. From Amazon's Dynamo paper (DeCandia, Hastorun, Jampani et al., 2007) it inherits the ring, the consistent hashing, the leaderless replication, the R + W > N consistency arithmetic, and the recovery toolkit of hinted handoff, read repair, and Merkle-tree anti-entropy. From Google's Bigtable paper (Chang, Dean, Ghemawat et al., 2006) it inherits the storage layout — memtables in RAM, log-structured merge trees on disk, immutable SSTables compacted in the background. Two ideas, one database.

Avinash Lakshman and Prashant Malik wrote the first version inside Facebook in 2008 to power inbox search. Lakshman had worked on Dynamo at Amazon before joining Facebook, which is why the genetic resemblance to Dynamo is so close. Facebook open-sourced it that same year; the Apache Foundation took it over in 2009; DataStax (founded by Jonathan Ellis and Matt Pfeil) commercialised it from 2010 onward. The pattern of ring + LSM went on to define a whole family of stores: ScyllaDB, the C++ rewrite by Avi Kivity (better known for writing the KVM hypervisor), claiming roughly 10× the throughput of Cassandra on the same hardware via a shard-per-core architecture. DynamoDB, Amazon's managed cousin, which differs in detail (each partition has a single Paxos-elected leader) but inherits the partitioning math. Cosmos DB, Microsoft's multi-API store, which exposes the Cassandra wire protocol as one of its faces. Astra, DataStax's managed Cassandra-as-a-service.

Naming the lineage matters because the design decisions that look strange in Cassandra (no leader; per-query consistency; last-write-wins) are not arbitrary — they are 2007 Dynamo decisions, made for a shopping cart that absolutely could not return errors during Black Friday. Read the Dynamo paper and the Cassandra design suddenly becomes self-evident.


The hash ring and consistent hashing

Every node owns a slice of the keyspace.

The ring is the entire 264 keyspace of the partitioner, laid out as a circle. Every node owns a range — the slice between its predecessor's token and its own token, walking clockwise. A key is hashed (Murmur3 by default since Cassandra 1.2; the older RandomPartitioner used MD5); the hash lands at a single point on the ring; the node immediately clockwise of that point owns the primary replica. The next N − 1 distinct nodes clockwise own the secondary replicas, where N is the replication factor.

Naïve consistent hashing has a problem: with one token per node, an eight-node cluster has eight large arcs, and adding or removing a node lopsides the ring. Modern Cassandra uses virtual nodes (vnodes) — each physical node owns roughly 256 random tokens by default (the num_tokens setting). The physical node still has the same data volume, but its responsibility is scattered across 256 small arcs instead of one big one. Adding a node now claims 256 small ranges from many neighbours rather than half the data from one neighbour. The streaming is parallelised; the hot spots smooth out.

The vnodes choice has trade-offs. With 256 vnodes per physical node and a 100-node cluster, every node is a replica for almost every other node — a single down node creates a small repair load on almost everyone. Operators of very large clusters sometimes drop num_tokens to 16 to reduce the blast radius of any single repair. Default settings are tuned for clusters in the tens of nodes, not the thousands.


Consistency levels are per-query

R + W > N is the only formula you need.

This is the most misunderstood part of Cassandra. The cluster does not have a consistency setting. Every read and every write carries its own consistency level. CL=ONE: the coordinator returns success the moment one replica acks the write (or returns the value for a read). CL=QUORUM: the coordinator waits for ⌈RF/2⌉ + 1 acks — two of three at RF=3, three of five at RF=5. CL=ALL: the coordinator waits for all RF replicas; if any is down, the operation fails. CL=EACH_QUORUM: a per-DC quorum in a multi-datacenter cluster, useful when you want to survive an entire DC loss without losing committed writes.

The rule for getting consistent reads — meaning a read always sees the value of the most recent successful write — is the inequality R + W > N, where R is the read CL count, W is the write CL count, and N is the replication factor. With RF=3, QUORUM read + QUORUM write gives 2 + 2 = 4 > 3 — consistent. CL=ONE for both reads and writes gives 1 + 1 = 2, not greater than 3 — fast but eventually consistent, with windows of stale reads. CL=ALL on either side trivially satisfies the inequality but loses the leaderless property: any node down breaks the operation.

The reason this knob is per-query is that not every read needs the same guarantee. A user's session token needs strong consistency or they get logged out at random; a like-count on a post can be eventually consistent and nobody cares. The classic Dynamo argument, restated by Werner Vogels in his 2008 ACM Queue essay Eventually Consistent: forcing a system into one consistency mode wastes either correctness or performance. Let the call site decide. Latency numbers reflect the choice: CL=ONE writes typically land in 1–5 ms p99 on a healthy cluster; CL=QUORUM in 5–20 ms; CL=EACH_QUORUM across two distant DCs can be 50–200 ms because the slowest WAN round-trip dominates.


Hinted handoff, read repair, anti-entropy

Three mechanisms keep replicas converging.

Without active repair, eventual consistency drifts toward permanent inconsistency. Cassandra runs three overlapping mechanisms, each catching a different failure mode. The first is hinted handoff. When a write should go to a replica that the coordinator cannot reach — timeout, crashed JVM, network partition — the coordinator does not just drop the write. It writes a hint, a tiny record containing the missed mutation and the target node ID, into its own local hints directory. When the target node returns and is seen alive via gossip (default gossip interval: 1 second), the coordinator replays the hints. The default hinted-handoff window is three hours; if the dead node is gone longer than that, hints expire and a full anti-entropy repair is needed instead.

The second is read repair. When a read is performed at CL ≥ 2, the coordinator queries multiple replicas in parallel, compares their responses, and resolves any disagreement by picking the cell with the latest timestamp (last-write-wins). Before returning the value to the client, the coordinator asynchronously writes the latest version back to any replica that returned stale data. Read traffic itself becomes a repair mechanism; hot keys self-heal continuously. A configurable percentage of reads at CL=ONE also trigger an asynchronous background read-repair against the other replicas — the read_repair_chance setting, defaulting to 0.1.

The third is anti-entropy, the operator-driven safety net. Run as nodetool repair on a schedule (typical operations playbooks say weekly, certainly within the gc_grace_seconds window of 10 days, or deleted data can resurrect). Two replicas build Merkle trees over their data ranges and exchange hashes; mismatched ranges are streamed and reconciled. This is expensive — a multi-terabyte node's full repair can take days — so modern Cassandra splits it into smaller subrange repairs (Cassandra Reaper, originally from Spotify, automates this). Skip anti-entropy for long enough and you guarantee anomalies; Aphyr's Jepsen reports against early Cassandra found exactly this class of bug, and the fix in every case was either better repair or tighter timestamp handling.


Last-write-wins is a clock problem

Without a global clock, the database believes the timestamp it is told.

Conflict resolution in Cassandra is last-write-wins on a per-cell basis, decided by timestamp. The timestamp is either client-supplied (via the USING TIMESTAMP clause in CQL) or coordinator-supplied (the default — the coordinator stamps the write with its own wall clock at the moment it accepts the request). There is no global clock in a Cassandra cluster, and there is no Spanner-style TrueTime hardware to bound the uncertainty. NTP gets you within tens of milliseconds on a healthy fleet, but a misbehaving NTP daemon or a virtual machine fresh out of suspend can land tens of seconds off.

A misconfigured client setting timestamps from a wrong wall clock is the standard production horror story. A client whose clock is two years in the future writes a value; the value becomes the "newest" forever; no later legitimate write can overwrite it. A client whose clock is set to 1970 writes a value; every other replica believes its existing data is newer and refuses to overwrite it; the broken write looks like a successful no-op. The fix is to never let clients set timestamps in production code — always use the coordinator-supplied default — and to monitor NTP skew across the fleet aggressively. Kyle Kingsbury's Jepsen tests of Cassandra (2013 and later) repeatedly found LWW anomalies traceable to clock issues; subsequent releases tightened the timestamp story but the fundamental constraint is unchanged: a leaderless system without a global clock will always be at the mercy of the clocks it is told.

This is why some applications layer Paxos-based lightweight transactions (Cassandra calls them LWT, the IF clause in CQL) on top for the small fraction of writes that need linearizable semantics — account creation, idempotent unique-username checks. They are roughly 4× slower than ordinary writes because every LWT runs a four-round Paxos handshake. Use them for the few writes that need correctness; let the rest ride on last-write-wins.


Cassandra vs DynamoDB vs ScyllaDB vs Cosmos DB

One paper, four products, slightly different deltas.

DynamoDB is the managed cousin Amazon kept for itself. The wire protocol and data model are different from Cassandra, but the partitioning math is the same — a key is hashed, mapped to a partition, replicated three times across availability zones. Unlike Cassandra it runs a single leader per partition (elected via internal Paxos), which is why DynamoDB transactions are possible and Cassandra transactions are not. Strongly-consistent reads cost double the read units of eventually-consistent ones — the consistency knob lives at the API call level, same idea, different vocabulary.

ScyllaDB is the C++ rewrite by Avi Kivity and Dor Laor, founded 2013. Same wire protocol as Cassandra, same data model, same operational mental model — drop-in compatible. The difference is the engine: built on the Seastar framework with a shared-nothing shard-per-core design, pinned threads, asynchronous I/O end to end, and no JVM. ScyllaDB's benchmarks claim roughly 10× the throughput per node, which lets a four-node Scylla cluster replace a thirty-node Cassandra cluster on the same workload. The C++ code makes operations harder to debug but the cost curve is hard to argue with.

Cosmos DB is Microsoft's multi-API answer: SQL, MongoDB, Cassandra, Gremlin and table APIs all backed by one engine. Its claim to fame is five named consistency levels (Strong, Bounded Staleness, Session, Consistent Prefix, Eventual) with explicit SLAs on each — formal versions of the Cassandra knob with money-back guarantees if violated. The underlying engine still inherits the partitioning idea from Dynamo. Choosing between these four is operational, not architectural: managed versus self-hosted, JVM versus C++, Amazon versus Azure versus your own racks. The replication math underneath is the same in all of them.

Found this useful?