Dynamo, eventually available.
Amazon's 2007 paper described the highly-available key-value store powering the shopping cart, the catalog, and most read-heavy services. It traded strong consistency for availability under partition — and codified the eventually-consistent KV pattern that defined a decade of distributed databases.
TL;DR
Dynamo is Amazon's internal eventually-consistent key-value store, described at SOSP 2007 by DeCandia et al. It combines consistent hashing, vector clocks, sloppy quorums, hinted handoff, and Merkle-tree anti-entropy to keep writes always-accepting — even when nodes fail or the network partitions. Amazon ran on the order of a hundred Dynamo instances by the time the paper was written, powering the shopping cart and other product surfaces where availability mattered more than serializable reads.
The problem
Amazon spent the early 2000s trying to scale a monolithic relational database that fronted the shopping cart and a long tail of other services. The reliability ceiling was the database — during the 2004 holiday peak, the cart's RDBMS brought the storefront down. Each outage cost real money, and the postmortems kept landing on the same root cause: a single primary, a single failure domain, and a consistency model that required all replicas to coordinate before accepting a write.
The team wanted three things the existing stack didn't give them. Always-on writes, so that adding an item to a cart never fails even during a partition or a node loss. Incremental scalability, so that capacity grows linearly when you add a node — no re-sharding, no offline migrations. And zero-downtime operations: nodes joining, leaving, and being replaced should be a routine event, not a deploy window.
The key idea
Dynamo's central decision was to weaken the consistency contract. For the product surfaces Amazon cared about most, accepting a write while partitioned and reconciling the resulting divergence later was better than rejecting the write and showing the user an error page. Everything else in the system follows from this choice.
Read it next to Brewer's CAP framing: under a partition, a system can keep responding (AP) or keep agreeing (CP), but not both. Dynamo picks AP and pushes the reconciliation work outward — onto vector clocks at the storage layer and onto the application code that has to merge conflicting versions when it reads.
Contributions
The paper is mostly an integration story: every piece was known in the literature, but Dynamo wired them together into a single coherent design and ran it in production. The pieces:
- Consistent hashing on a ring. The keyspace is the output range of
a hash function laid out as a ring. Each physical node owns an arc; a key's owner
is the first node clockwise from
hash(k). Adding or removing a node only re-mapsK/Nkeys (whereKis total keys andNis node count), instead of theK(N-1)/Na naive mod-N partitioning would force. - Virtual nodes for load balance. Each physical node owns roughly 100–200 small ring positions ("tokens") rather than one big arc. This smooths out hot spots, lets heterogeneous hardware carry proportional load, and makes re-balance after a node loss spread across many peers instead of one neighbour.
- Quorum knobs (N, R, W). Each key is replicated to N nodes
(typically 3). A read contacts R replicas; a write contacts W. When
R + W > Nyou get a "strong-ish" guarantee that any read sees the most recent write. Dynamo's default is N=3, R=W=2 — fast enough for the cart's latency budget, durable enough to survive one node loss without blocking. - Vector clocks for version tracking. Every put attaches a vector clock — a map from coordinator ID to a counter. When two clocks are not ordered (neither dominates the other), Dynamo returns both versions to the client and lets the application merge. For the shopping cart, merge is "union the items"; for objects where one writer wins, it's last-write-wins via wall-clock.
- Sloppy quorum + hinted handoff. Strict quorum would refuse a write when a replica is unreachable. Sloppy quorum writes to the next available node down the ring with a "hint" recording the intended home. When the intended replica comes back, the hint-holder forwards the data. This is what keeps writes always-on through transient failures.
- Anti-entropy via Merkle trees. Replicas periodically exchange Merkle-tree hashes of their key ranges. If the top hash matches, nothing diverged; if it doesn't, the protocol walks down the tree, comparing children, until it finds the leaves that disagree. The bandwidth cost is logarithmic in the size of the diverged range, not the size of the dataset.
- Gossip-based membership and failure detection. No central coordinator. Each node periodically exchanges membership state with a random peer; failed nodes are inferred from missed responses on the data path, not from a separate heartbeat. Removes a single point of failure but introduces a short window where the cluster disagrees on who's in.
Criticisms and limitations
Two decades on, the design is widely admired and widely scolded. Three lines of criticism worth knowing:
- Eventual consistency is hard to reason about. Application developers have to write merge functions and tolerate seeing stale or divergent data. The famous example, reported by Amazon's own teams, is the "deleted item comes back to the cart" bug: a vector-clock branch contains an older version that still has the item; merging by union resurrects it. The general failure pattern — deletes that lose to concurrent writes — is intrinsic to last-write-wins on eventually-consistent stores.
- Vector clocks grow without bound. Every coordinator that touches a key adds an entry. In a sloppy-quorum world that can be many nodes over time. The paper handwaves this with truncation, but in practice it caused real problems. Riak — the closest open-source Dynamo descendant — eventually replaced plain vector clocks with dotted version vectors (DVVs) precisely to bound their size while preserving causal ordering.
- R+W>N is theoretical, not practical. The paper sells the (N, R, W) tunable as a knob you turn per-request. Production Dynamo systems almost always pick one durability profile and stick with it; turning consistency knobs per-call requires application code that almost no team writes correctly. The Cassandra community came to the same conclusion: tunable consistency is a deploy-time choice, not a per-query one.
Worth adding: the paper measures latency in the 99.9th percentile and reports sub-300ms reads on its production workload. That bar moved a lot in the years that followed — modern KV stores expect sub-10ms p99 — but the framing (measure tails, not means) was itself a contribution.
Where it shows up today
| System | Relationship to Dynamo |
|---|---|
| Cassandra | Facebook's open-source descendant, now Apache. Uses consistent hashing, gossip-based membership, tunable (N, R, W), and Merkle-tree anti-entropy. Adds an SSTable / log-structured storage engine inspired by Bigtable — Dynamo's ring plus Bigtable's bottom half. |
| Riak | Basho's open-source clone. The most faithful re-implementation: same ring, same vector clocks, same sloppy quorum and hinted handoff. Later replaced vector clocks with dotted version vectors. |
| Voldemort | LinkedIn's Dynamo-style store, used heavily for the social graph in the early 2010s. |
| Aerospike | Commercial Dynamo-lineage store optimised for SSDs and low-latency ad-tech workloads. |
| DynamoDB | Amazon's 2012 managed service. Confusingly named — it shares the brand and the partitioning ideas, but the consistency model is different: DynamoDB defaults to eventually consistent and offers strongly consistent reads as an option. The 2022 USENIX ATC paper "Amazon DynamoDB: A Scalable, Predictably Performant, and Fully Managed NoSQL Database Service" describes the production system in detail. |
The deeper inheritance is conceptual. Anyone who has read this paper recognises the moves: hash-ring partitioning in every modern sharded cache and KV store; gossip plus heartbeats in every cluster manager; conflict resolution pushed to the application in every CRDT-based collaborative editor. Dynamo's specific design has been superseded many times, but its vocabulary became the vocabulary of distributed storage.
Follow-up reading
- DeCandia et al — Dynamo: Amazon's Highly Available Key-value Store (SOSP 2007) — the paper itself. Section 4 (system architecture) is the dense part; section 6 (experiences) is the part with the production numbers.
- Werner Vogels — Eventually Consistent (CACM, 2008) — the framing piece. Defines BASE, read-your-writes, monotonic reads, and the other guarantees that sit between linearizable and "no consistency at all".
- Basho — Dotted Version Vectors — Riak's fix for the unbounded vector-clock problem. Worth reading if you're building anything that needs causal history without runaway metadata.
- Cassandra docs — Dynamo-style replication — how the design choices translated into a long-lived production codebase, and where Cassandra deviated.
- Lamport — Time, Clocks, and the Ordering of Events — the vector-clock ancestor, and the right paper to read before this one.
- Spanner — the opposite trade — Google's bet that hardware clocks plus Paxos can give you strong consistency at global scale. Reading Dynamo and Spanner back-to-back is the cleanest way to see what consistency actually costs.
- Bigtable — the strong-consistency contemporary — Google's single-master, strongly-consistent table store, published a year before Dynamo. The same problem space, the opposite design choice.
- CRDTs — eventually-consistent merge by construction — what happened when the research community took Dynamo's "application merges conflicts" idea and tried to make the merge automatic.