A distributed key-value store
The most theoretical design in the playbook. Dynamo-shaped — consistent hashing for placement, sloppy-quorum reads and writes, vector clocks for conflict detection, hinted handoff for short-term failures, anti-entropy for long-term ones, gossip for membership. Walk this carefully; the same patterns reappear in news-feed sharding, notification fan-out, and any design where "what happens during a partition" is the hard question.
1 · Clarifying questions
| What's the access pattern? | Pure KV — get(key), put(key, value), delete(key). No range scans. The entire design rests on this. |
| Value sizes? | Average 1 KB, hard cap 1 MB. Bigger values warrant a different design (object storage). |
| Consistency model? | Eventually consistent with tunable read/write quorum. The product can pick (R,W,N) per call. |
| Read/write ratio? | ~10:1, but neither side is small — 100K writes/s and 1M reads/s peak. |
| Availability target? | 99.99% on writes, 99.999% on reads. Multi-region active-active is mandatory. |
| Latency budget? | P99 ≤ 10 ms in-region read, ≤ 50 ms write. Cross-region reads can be 100 ms. |
| Durability? | 11 nines (S3's claim). Three replicas in two AZs, plus async cross-region. |
| Conflict resolution? | Vector clocks; resolution is application-driven via "siblings" — the client decides on read. |
2 · Capacity math, on a napkin
| Number | Calculation | Result |
|---|---|---|
| Keys (initial) | given | 1B |
| Avg value size | given | 1 KB |
| Logical data | 1B × 1 KB | ~1 TB |
| With 3× replication | ×3 | ~3 TB |
| With overhead (vector clocks, metadata) | ×1.3 | ~4 TB |
| Per-node storage (32 nodes) | 4 TB / 32 | ~128 GB / node |
| Write QPS (peak) | given | 100K / s |
| Replication QPS | writes × 3 | 300K / s |
| Per-node write QPS | 300K / 32 | ~10K / s |
| Read QPS (peak) | given | 1M / s |
| Per-node read QPS | 1M × 1 (no replica fanout) / 32 | ~31K / s |
| Concurrency, reads | Little's: 1M × 0.005 s | ~5K in-flight |
32 nodes is enough for 4 TB at this load with comfortable headroom. The choice scales linearly with virtual nodes (vnodes) — at 256 vnodes per physical, we can grow to 256 nodes without re-hashing.
3 · API and data model
Endpoints
PUT /v1/keys/{key} # write
Body: raw bytes, ≤ 1 MB
Headers: X-Quorum-W: 2 # require 2 acks
X-Vector-Clock: <ctx> # optional, returned on prior get
→ 200 {"vc": "<ctx>"} # opaque vector clock context
→ 409 {"siblings": [...]} # if conflict and W met but quorum split
GET /v1/keys/{key} # read
Headers: X-Quorum-R: 2 # require 2 replicas
→ 200 {"value": <bytes>, "vc": "<ctx>"}
→ 200 {"siblings": [{value, vc}, ...]} # caller resolves
→ 404 {"vc": "<tombstone-ctx>"}
DELETE /v1/keys/{key} # tombstone, GC after grace period
→ 200The API exposes the consistency knobs (R, W) per call. Most product code calls with R=2, W=2, N=3 — the Dynamo defaults. Strong-read code calls R=3 (read quorum equals replica count).
On-disk layout, per node
Storage engine: RocksDB (LSM, native-code, well-instrumented)
key bucket value
─────── ─────
md5(user-key) → vnode -> SST files (sorted by key, compacted in tiers)
Per record:
key (≤ 256 B)
value (≤ 1 MB)
vector_clock (~8 B per actor, ~64 B typical)
timestamp (8 B, last-write tiebreaker)
tombstone_flag (1 B)4 · High-level architecture
A coordinator (any node) hashes the key into the ring and forwards the request to the N replicas. Each node is identical software; the differences are which vnodes they own, what's in their hint queue, and what the gossip layer has told them about peers.
5 · The hard part — placement, replication, quorum
Consistent hashing with vnodes
Consistent hashing puts each node and each key on a ring. A key belongs to the next N nodes clockwise from its hash. Vnodes — many tokens per physical node — even out the load and make rebalancing cheap.
| Choice | Value | Why |
|---|---|---|
| Hash function | MD5 (legacy) or xxHash64 (new) | Cheap, well-distributed. Cryptographic strength not needed. |
| Vnodes per physical node | 256 | Smooths load; standard deviation drops as 1/√v. |
| Replicas per key (N) | 3 | Three is the magic number — 2 fails too easily, 4 doubles cost. |
| Replica placement | Walk the ring, skipping vnodes that share a fault domain (rack / AZ) | One node down ≠ data unavailable. Multi-AZ is a hard requirement. |
- 1 node 3 @ 69.81%
- 2 node 0 @ 70.15%
- 3 node 2 @ 72.94%
Sloppy quorum: R + W > N
Each request specifies how many replicas must ack — R for reads, W for writes — against the N replicas. The classic settings:
| (N, W, R) | What it gives you | Used for |
|---|---|---|
| (3, 1, 1) | Fast, weakly consistent | Caches, telemetry, anything tolerant of lost writes |
| (3, 2, 2) | Quorum (R + W > N) — strong-ish | Default for most product calls |
| (3, 3, 1) | Read-fast, write-slow | Read-heavy, ack-when-everyone-wrote |
| (3, 1, 3) | Write-fast, read-slow | Write-heavy with consistent reads on demand |
"Sloppy" means the W replicas don't have to be the canonical N — if a primary is down, a substitute (next clockwise) accepts the write and adds a hint. This improves availability under partial failure at the cost of a brief inconsistency window.
6 · Vector clocks and conflict resolution
Concurrent writes can produce siblings — values that neither dominates the other in the partial order. Vector clocks track causality so the system knows when this has happened.
Time: T0 T1 T2 T3
Alice writes A, Bob writes B,
both see {} sees {Alice:1} Coordinator merge — concurrent: siblings
VC(A) = {Alice:1}
VC(B) = {Bob:1, Alice:1} Bob saw Alice's prior write — B dominates A
VC(B) = {Bob:1} Bob did NOT see A — concurrent → siblings
On read:
if exactly one value has the maximal VC → return it.
else → return ALL siblings; client resolves and writes back with combined VC.In production, the VC for a key with N=3 has at most a few entries (one per frequently-writing actor). Garbage-collect entries past a depth bound to keep the VC small.
7 · Hinted handoff and anti-entropy
Hinted handoff (short-term)
A node can't reach replica N5 for key k. The coordinator picks a substitute (N6), which durably stores the write plus a hint: "this belongs to N5; deliver when you can". When N5 returns, N6 hands the data over and deletes the hint.
Bounded by hint-queue size. If the queue overflows, the system falls back to anti-entropy. The hint queue itself is durable — a node restarting must not lose pending handoffs.
Anti-entropy with Merkle trees (long-term)
Periodically, replica peers compare Merkle trees of their key ranges. Differing subtrees are walked top-down until the divergent keys are isolated, then the missing values are pulled. This catches:
- Hint-queue overflows that dropped writes.
- Bit-rot or silent corruption.
- Replicas that fell behind during long partitions.
Tune the comparison cadence (every N hours per range) and the leaf granularity (rows vs. blocks). Too frequent → wasted bandwidth. Too sparse → recovery is slow.
8 · Membership and gossip
Nodes join, leave, and crash. The cluster needs an eventually-consistent view of membership without a coordinator. Gossip — each node periodically picks a random peer, exchanges versioned membership state, picks the higher version — converges in O(log N) rounds.
| State | Format | Propagation |
|---|---|---|
| Membership | node → {address, vnode tokens, version, status} | Gossip every 1 s; convergence ~5 s for 100-node cluster |
| Failure detection | φ-accrual detector (Hayashibara) | Per-node phi score; threshold-based suspect/dead |
| Schema / config | Versioned config blob | Same gossip plumbing |
φ-accrual is more nuanced than a fixed timeout: it computes a confidence score that the peer is dead, accommodating noisy networks. A node with phi above ~8 is marked suspect; above ~15, dead.
9 · Failure modes & runbook
| Failure | Symptom | Mitigation |
|---|---|---|
| One node crash | Hint queues fill on its peers; reads to that vnode 33% slower (1 of 3 replicas dead) | Auto-detected via gossip in ~5 s; coordinator routes around. Hints drained after restart. |
| One AZ down | 1/3 of replicas unreachable; reads see W − 1 acks | Quorum still met (2 of 3); no impact on reads. Writes get hints for the unreachable replicas. Restore handoff when AZ returns. |
| Region partition | Cross-region replication paused | In-region reads/writes fine. Cross-region replication backlog drains when the partition heals; conflicts resolved via vector clocks at read time. |
| Two nodes in same range crash | W=2 may not be reachable; quorum violated | Sloppy quorum picks substitutes; with N=3, R=W=2 we tolerate one loss but not two. Page on this. |
| Hot key | One vnode at 100% CPU; others idle | Detect by per-key QPS percentile; promote the hot key to in-process cache on the coordinator; or split the key across vnodes (rare; usually a product fix). |
| Hint-queue overflow | Disk filling on a node holding many hints for a long-down peer | Force-drop hints, fall back to AAE on peer return. Trigger AAE earlier when hint queue > threshold. |
| Anti-entropy bandwidth blowup | Replica catch-up saturating cross-AZ link | Rate-limit AAE per peer; throttle during peak traffic windows. |
10 · Cost & SLOs
| Line | Estimate | Note |
|---|---|---|
| Compute (32 nodes, 8 vCPU, 32 GB RAM) | ~$8K / month | m6i.2xlarge equivalents, reserved |
| Storage (32 × 500 GB NVMe) | ~$2K / month | Headroom for 4× growth |
| Inter-AZ traffic | ~$3K / month | Replication + AAE; the main hidden cost |
| Cross-region replication | ~$4K / month | Async; rate-shaped |
| Total | ~$17K / month per region | Two regions: ~$34K. About $0.34 per million ops. |
SLOs
- Read availability: 99.999%. 5 minutes/year. Multi-AZ + region failover make this achievable.
- Write availability: 99.99%. ~52 min/year. Slightly tighter because writes need quorum.
- Read P99 in-region: 10 ms. Three replicas, network + RocksDB miss is the worst case.
- Write P99 in-region: 50 ms. W=2 acks; the slowest of the two.
- Eventual-consistency window: P99 ≤ 1 s in-region, ≤ 10 s cross-region. Tracked via injected probe writes.
11 · Trade-offs & "what would you change at 10×"
| If… | Then… |
|---|---|
| 10× data (40 TB) | Add nodes (vnodes redistribute automatically). Move from m6i.2xlarge to i4i.2xlarge for direct-attached NVMe. |
| Strict consistency required | Switch to a Raft-per-shard design (CockroachDB-style). You give up some availability under partition for linearizable reads. |
| Active-active multi-region | Already true here. The CRDT crowd would push toward last-writer-wins or operational transforms; vector clocks + sibling resolution is the more honest answer. |
| Range scans needed | Different design. Layer a B-tree on top (Bigtable) or move to a partitioned-by-row-key system (HBase, FoundationDB). |
| Strong global ordering needed | Spanner-style with TrueTime and 2PC over Raft groups. ~10× the cost; rarely the right answer. |
| "What would a more senior answer add?" | The operations story: rolling upgrades without downtime, secondary-index repair, repair throttling, the disaster-recovery runbook. The senior answer "it self-heals" is true at small scale; at 1000+ nodes the operations are a full sub-system. |
Further reading
- DeCandia et al — "Dynamo: Amazon's Highly Available Key-value Store" (2007). The paper. Read it. Twice. The architectural blueprint for everything in this design.
- Lakshman & Malik — "Cassandra: A Decentralized Structured Storage System" (2010). Dynamo + Bigtable's row model. The most-deployed Dynamo descendant.
- Hayashibara et al — "The φ Accrual Failure Detector" (2004). The math behind the failure detector mentioned above. ~10 pages.
- Shapiro et al — "A comprehensive study of CRDTs" (2011). The other route to conflict-free replication. Useful contrast with vector clocks.
- Brewer — "Pushing the CAP" (2012). A decade after CAP, the tradeoff in production language. Worth re-reading before any design with quorum in it.
- Adjacent: Replication. The theoretical foundation of this section.
- Adjacent: Consistent hashing. The placement scheme, with diagrams.
- Adjacent: Vector clocks. A standalone walkthrough.