03 / 19
Playbook / 03

A distributed key-value store

The most theoretical design in the playbook. Dynamo-shaped — consistent hashing for placement, sloppy-quorum reads and writes, vector clocks for conflict detection, hinted handoff for short-term failures, anti-entropy for long-term ones, gossip for membership. Walk this carefully; the same patterns reappear in news-feed sharding, notification fan-out, and any design where "what happens during a partition" is the hard question.


1 · Clarifying questions

What's the access pattern?Pure KV — get(key), put(key, value), delete(key). No range scans. The entire design rests on this.
Value sizes?Average 1 KB, hard cap 1 MB. Bigger values warrant a different design (object storage).
Consistency model?Eventually consistent with tunable read/write quorum. The product can pick (R,W,N) per call.
Read/write ratio?~10:1, but neither side is small — 100K writes/s and 1M reads/s peak.
Availability target?99.99% on writes, 99.999% on reads. Multi-region active-active is mandatory.
Latency budget?P99 ≤ 10 ms in-region read, ≤ 50 ms write. Cross-region reads can be 100 ms.
Durability?11 nines (S3's claim). Three replicas in two AZs, plus async cross-region.
Conflict resolution?Vector clocks; resolution is application-driven via "siblings" — the client decides on read.

2 · Capacity math, on a napkin

NumberCalculationResult
Keys (initial)given1B
Avg value sizegiven1 KB
Logical data1B × 1 KB~1 TB
With 3× replication×3~3 TB
With overhead (vector clocks, metadata)×1.3~4 TB
Per-node storage (32 nodes)4 TB / 32~128 GB / node
Write QPS (peak)given100K / s
Replication QPSwrites × 3300K / s
Per-node write QPS300K / 32~10K / s
Read QPS (peak)given1M / s
Per-node read QPS1M × 1 (no replica fanout) / 32~31K / s
Concurrency, readsLittle's: 1M × 0.005 s~5K in-flight

32 nodes is enough for 4 TB at this load with comfortable headroom. The choice scales linearly with virtual nodes (vnodes) — at 256 vnodes per physical, we can grow to 256 nodes without re-hashing.

Interactive · KV store — adjust and watch Drag the sliders, watch the math.
Presets
100.0K
1.0 KB
10:1
10 ms
×3
10%
1 yr
Throughput
Write QPS, peak300.0K
Read QPS, peak3.0M
Read concurrency (Little's)30.0K
Write concurrency3.0K
Storage & cache
Raw / day8.0 TB
Raw / horizon2.87 PB
With ×3 replication8.60 PB
Hot-set cache (RAM)293.7 TB
Read egress / day80.5 TB
Formulas. Concurrency = QPS × P99 (Little's law). Storage / day = QPS × 86 400 × payload. Peak factor 3× over avg. Hot-set is the RAM working set you'd need for that read percentage.

3 · API and data model

Endpoints

PUT /v1/keys/{key} # write
 Body: raw bytes, ≤ 1 MB
 Headers: X-Quorum-W: 2 # require 2 acks
 X-Vector-Clock: <ctx> # optional, returned on prior get
→ 200 {"vc": "<ctx>"} # opaque vector clock context
→ 409 {"siblings": [...]} # if conflict and W met but quorum split

GET /v1/keys/{key} # read
 Headers: X-Quorum-R: 2 # require 2 replicas
→ 200 {"value": <bytes>, "vc": "<ctx>"}
→ 200 {"siblings": [{value, vc}, ...]} # caller resolves
→ 404 {"vc": "<tombstone-ctx>"}

DELETE /v1/keys/{key} # tombstone, GC after grace period
→ 200

The API exposes the consistency knobs (R, W) per call. Most product code calls with R=2, W=2, N=3 — the Dynamo defaults. Strong-read code calls R=3 (read quorum equals replica count).

On-disk layout, per node

Storage engine: RocksDB (LSM, native-code, well-instrumented)

key bucket value
─────── ─────
md5(user-key) → vnode -> SST files (sorted by key, compacted in tiers)

Per record:
 key (≤ 256 B)
 value (≤ 1 MB)
 vector_clock (~8 B per actor, ~64 B typical)
 timestamp (8 B, last-write tiebreaker)
 tombstone_flag (1 B)

4 · High-level architecture

A coordinator (any node) hashes the key into the ring and forwards the request to the N replicas. Each node is identical software; the differences are which vnodes they own, what's in their hint queue, and what the gossip layer has told them about peers.

5 · The hard part — placement, replication, quorum

Consistent hashing with vnodes

Consistent hashing puts each node and each key on a ring. A key belongs to the next N nodes clockwise from its hash. Vnodes — many tokens per physical node — even out the load and make rebalancing cheap.

ChoiceValueWhy
Hash functionMD5 (legacy) or xxHash64 (new)Cheap, well-distributed. Cryptographic strength not needed.
Vnodes per physical node256Smooths load; standard deviation drops as 1/√v.
Replicas per key (N)3Three is the magic number — 2 fails too easily, 4 doubles cost.
Replica placementWalk the ring, skipping vnodes that share a fault domain (rack / AZ)One node down ≠ data unavailable. Multi-AZ is a hard requirement.
Interactive · Consistent hash ring Replicas walk clockwise · skipping same-node vnodes
1 2 3 keyuser:42hash 69.27%
4
16
Replicas (3-way, clockwise)
  1. 1 node 3 @ 69.81%
  2. 2 node 0 @ 70.15%
  3. 3 node 2 @ 72.94%
Load distribution (% of ring)
node 0
26.3%
node 1
25.2%
node 2
26.0%
node 3
22.5%
Try this. Set vnodes = 1 — the load bars become uneven. Bump vnodes to 32 — they smooth out (standard deviation drops as 1/√v). Drop a node from 4 to 3 and only the keys that landed on the dropped node's vnodes move; the rest stay put. That's the "consistent" in consistent hashing.

Sloppy quorum: R + W > N

Each request specifies how many replicas must ack — R for reads, W for writes — against the N replicas. The classic settings:

(N, W, R)What it gives youUsed for
(3, 1, 1)Fast, weakly consistentCaches, telemetry, anything tolerant of lost writes
(3, 2, 2)Quorum (R + W > N) — strong-ishDefault for most product calls
(3, 3, 1)Read-fast, write-slowRead-heavy, ack-when-everyone-wrote
(3, 1, 3)Write-fast, read-slowWrite-heavy with consistent reads on demand

"Sloppy" means the W replicas don't have to be the canonical N — if a primary is down, a substitute (next clockwise) accepts the write and adds a hint. This improves availability under partial failure at the cost of a brief inconsistency window.

6 · Vector clocks and conflict resolution

Concurrent writes can produce siblings — values that neither dominates the other in the partial order. Vector clocks track causality so the system knows when this has happened.

Time: T0 T1 T2 T3
Alice writes A, Bob writes B,
both see {} sees {Alice:1} Coordinator merge — concurrent: siblings

VC(A) = {Alice:1}
VC(B) = {Bob:1, Alice:1} Bob saw Alice's prior write — B dominates A
VC(B) = {Bob:1} Bob did NOT see A — concurrent → siblings

On read:
 if exactly one value has the maximal VC → return it.
 else → return ALL siblings; client resolves and writes back with combined VC.

In production, the VC for a key with N=3 has at most a few entries (one per frequently-writing actor). Garbage-collect entries past a depth bound to keep the VC small.

The shopping-cart classic. Dynamo's example: two siblings of a cart, one with item X, one with item Y. The application-side resolution is "union of items". The point is the system doesn't try to be cleverer than the application — it surfaces conflicts and lets the product decide.

7 · Hinted handoff and anti-entropy

Hinted handoff (short-term)

A node can't reach replica N5 for key k. The coordinator picks a substitute (N6), which durably stores the write plus a hint: "this belongs to N5; deliver when you can". When N5 returns, N6 hands the data over and deletes the hint.

Bounded by hint-queue size. If the queue overflows, the system falls back to anti-entropy. The hint queue itself is durable — a node restarting must not lose pending handoffs.

Anti-entropy with Merkle trees (long-term)

Periodically, replica peers compare Merkle trees of their key ranges. Differing subtrees are walked top-down until the divergent keys are isolated, then the missing values are pulled. This catches:

  • Hint-queue overflows that dropped writes.
  • Bit-rot or silent corruption.
  • Replicas that fell behind during long partitions.

Tune the comparison cadence (every N hours per range) and the leaf granularity (rows vs. blocks). Too frequent → wasted bandwidth. Too sparse → recovery is slow.

8 · Membership and gossip

Nodes join, leave, and crash. The cluster needs an eventually-consistent view of membership without a coordinator. Gossip — each node periodically picks a random peer, exchanges versioned membership state, picks the higher version — converges in O(log N) rounds.

StateFormatPropagation
Membershipnode → {address, vnode tokens, version, status}Gossip every 1 s; convergence ~5 s for 100-node cluster
Failure detectionφ-accrual detector (Hayashibara)Per-node phi score; threshold-based suspect/dead
Schema / configVersioned config blobSame gossip plumbing

φ-accrual is more nuanced than a fixed timeout: it computes a confidence score that the peer is dead, accommodating noisy networks. A node with phi above ~8 is marked suspect; above ~15, dead.

9 · Failure modes & runbook

FailureSymptomMitigation
One node crashHint queues fill on its peers; reads to that vnode 33% slower (1 of 3 replicas dead)Auto-detected via gossip in ~5 s; coordinator routes around. Hints drained after restart.
One AZ down1/3 of replicas unreachable; reads see W − 1 acksQuorum still met (2 of 3); no impact on reads. Writes get hints for the unreachable replicas. Restore handoff when AZ returns.
Region partitionCross-region replication pausedIn-region reads/writes fine. Cross-region replication backlog drains when the partition heals; conflicts resolved via vector clocks at read time.
Two nodes in same range crashW=2 may not be reachable; quorum violatedSloppy quorum picks substitutes; with N=3, R=W=2 we tolerate one loss but not two. Page on this.
Hot keyOne vnode at 100% CPU; others idleDetect by per-key QPS percentile; promote the hot key to in-process cache on the coordinator; or split the key across vnodes (rare; usually a product fix).
Hint-queue overflowDisk filling on a node holding many hints for a long-down peerForce-drop hints, fall back to AAE on peer return. Trigger AAE earlier when hint queue > threshold.
Anti-entropy bandwidth blowupReplica catch-up saturating cross-AZ linkRate-limit AAE per peer; throttle during peak traffic windows.

10 · Cost & SLOs

LineEstimateNote
Compute (32 nodes, 8 vCPU, 32 GB RAM)~$8K / monthm6i.2xlarge equivalents, reserved
Storage (32 × 500 GB NVMe)~$2K / monthHeadroom for 4× growth
Inter-AZ traffic~$3K / monthReplication + AAE; the main hidden cost
Cross-region replication~$4K / monthAsync; rate-shaped
Total~$17K / month per regionTwo regions: ~$34K. About $0.34 per million ops.

SLOs

  • Read availability: 99.999%. 5 minutes/year. Multi-AZ + region failover make this achievable.
  • Write availability: 99.99%. ~52 min/year. Slightly tighter because writes need quorum.
  • Read P99 in-region: 10 ms. Three replicas, network + RocksDB miss is the worst case.
  • Write P99 in-region: 50 ms. W=2 acks; the slowest of the two.
  • Eventual-consistency window: P99 ≤ 1 s in-region, ≤ 10 s cross-region. Tracked via injected probe writes.

11 · Trade-offs & "what would you change at 10×"

If…Then…
10× data (40 TB)Add nodes (vnodes redistribute automatically). Move from m6i.2xlarge to i4i.2xlarge for direct-attached NVMe.
Strict consistency requiredSwitch to a Raft-per-shard design (CockroachDB-style). You give up some availability under partition for linearizable reads.
Active-active multi-regionAlready true here. The CRDT crowd would push toward last-writer-wins or operational transforms; vector clocks + sibling resolution is the more honest answer.
Range scans neededDifferent design. Layer a B-tree on top (Bigtable) or move to a partitioned-by-row-key system (HBase, FoundationDB).
Strong global ordering neededSpanner-style with TrueTime and 2PC over Raft groups. ~10× the cost; rarely the right answer.
"What would a more senior answer add?"The operations story: rolling upgrades without downtime, secondary-index repair, repair throttling, the disaster-recovery runbook. The senior answer "it self-heals" is true at small scale; at 1000+ nodes the operations are a full sub-system.

Further reading

  • DeCandia et al — "Dynamo: Amazon's Highly Available Key-value Store" (2007). The paper. Read it. Twice. The architectural blueprint for everything in this design.
  • Lakshman & Malik — "Cassandra: A Decentralized Structured Storage System" (2010). Dynamo + Bigtable's row model. The most-deployed Dynamo descendant.
  • Hayashibara et al — "The φ Accrual Failure Detector" (2004). The math behind the failure detector mentioned above. ~10 pages.
  • Shapiro et al — "A comprehensive study of CRDTs" (2011). The other route to conflict-free replication. Useful contrast with vector clocks.
  • Brewer — "Pushing the CAP" (2012). A decade after CAP, the tradeoff in production language. Worth re-reading before any design with quorum in it.
  • Adjacent: Replication. The theoretical foundation of this section.
  • Adjacent: Consistent hashing. The placement scheme, with diagrams.
  • Adjacent: Vector clocks. A standalone walkthrough.
Found this useful?