03 / 19

Playbook / 03

A distributed key-value store

The most theoretical design in the playbook. Dynamo-shaped — consistent hashing for placement, sloppy-quorum reads and writes, vector clocks for conflict detection, hinted handoff for short-term failures, anti-entropy for long-term ones, gossip for membership. Walk this carefully; the same patterns reappear in news-feed sharding, notification fan-out, and any design where "what happens during a partition" is the hard question.

1 · Clarifying questions

What's the access pattern?	Pure KV — `get(key)`, `put(key, value)`, `delete(key)`. No range scans. The entire design rests on this.
Value sizes?	Average 1 KB, hard cap 1 MB. Bigger values warrant a different design (object storage).
Consistency model?	Eventually consistent with tunable read/write quorum. The product can pick (R,W,N) per call.
Read/write ratio?	~10:1, but neither side is small — 100K writes/s and 1M reads/s peak.
Availability target?	99.99% on writes, 99.999% on reads. Multi-region active-active is mandatory.
Latency budget?	P99 ≤ 10 ms in-region read, ≤ 50 ms write. Cross-region reads can be 100 ms.
Durability?	11 nines (S3's claim). Three replicas in two AZs, plus async cross-region.
Conflict resolution?	Vector clocks; resolution is application-driven via "siblings" — the client decides on read.

2 · Capacity math, on a napkin

Number	Calculation	Result
Keys (initial)	given	1B
Avg value size	given	1 KB
Logical data	1B × 1 KB	~1 TB
With 3× replication	×3	~3 TB
With overhead (vector clocks, metadata)	×1.3	~4 TB
Per-node storage (32 nodes)	4 TB / 32	~128 GB / node
Write QPS (peak)	given	100K / s
Replication QPS	writes × 3	300K / s
Per-node write QPS	300K / 32	~10K / s
Read QPS (peak)	given	1M / s
Per-node read QPS	1M × 1 (no replica fanout) / 32	~31K / s
Concurrency, reads	Little's: 1M × 0.005 s	~5K in-flight

32 nodes is enough for 4 TB at this load with comfortable headroom. The choice scales linearly with virtual nodes (vnodes) — at 256 vnodes per physical, we can grow to 256 nodes without re-hashing.

Interactive · KV store — adjust and watch Drag the sliders, watch the math.

Presets

Write QPS 100.0K

Payload (KB) 1.0 KB

R / W ratio 10:1

P99 (ms) 10 ms

Replication ×3

Hot set % 10%

Horizon 1 yr

Throughput

Write QPS, peak300.0K

Read QPS, peak3.0M

Read concurrency (Little's)30.0K

Write concurrency3.0K

Storage & cache

Raw / day8.0 TB

Raw / horizon2.87 PB

With ×3 replication8.60 PB

Hot-set cache (RAM)293.7 TB

Read egress / day80.5 TB

Formulas. Concurrency = QPS × P99 (Little's law). Storage / day = QPS × 86 400 × payload. Peak factor 3× over avg. Hot-set is the RAM working set you'd need for that read percentage.

3 · API and data model

Endpoints

PUT /v1/keys/{key} # write
 Body: raw bytes, ≤ 1 MB
 Headers: X-Quorum-W: 2 # require 2 acks
 X-Vector-Clock: <ctx> # optional, returned on prior get
→ 200 {"vc": "<ctx>"} # opaque vector clock context
→ 409 {"siblings": [...]} # if conflict and W met but quorum split

GET /v1/keys/{key} # read
 Headers: X-Quorum-R: 2 # require 2 replicas
→ 200 {"value": <bytes>, "vc": "<ctx>"}
→ 200 {"siblings": [{value, vc}, ...]} # caller resolves
→ 404 {"vc": "<tombstone-ctx>"}

DELETE /v1/keys/{key} # tombstone, GC after grace period
→ 200

The API exposes the consistency knobs (R, W) per call. Most product code calls with R=2, W=2, N=3 — the Dynamo defaults. Strong-read code calls R=3 (read quorum equals replica count).

On-disk layout, per node

Storage engine: RocksDB (LSM, native-code, well-instrumented)

key bucket value
─────── ─────
md5(user-key) → vnode -> SST files (sorted by key, compacted in tiers)

Per record:
 key (≤ 256 B)
 value (≤ 1 MB)
 vector_clock (~8 B per actor, ~64 B typical)
 timestamp (8 B, last-write tiebreaker)
 tombstone_flag (1 B)

4 · High-level architecture

A coordinator (any node) hashes the key into the ring and forwards the request to the N replicas. Each node is identical software; the differences are which vnodes they own, what's in their hint queue, and what the gossip layer has told them about peers.

5 · The hard part — placement, replication, quorum

Consistent hashing with vnodes

Consistent hashing puts each node and each key on a ring. A key belongs to the next N nodes clockwise from its hash. Vnodes — many tokens per physical node — even out the load and make rebalancing cheap.

Choice	Value	Why
Hash function	MD5 (legacy) or xxHash64 (new)	Cheap, well-distributed. Cryptographic strength not needed.
Vnodes per physical node	256	Smooths load; standard deviation drops as 1/√v.
Replicas per key (N)	3	Three is the magic number — 2 fails too easily, 4 doubles cost.
Replica placement	Walk the ring, skipping vnodes that share a fault domain (rack / AZ)	One node down ≠ data unavailable. Multi-AZ is a hard requirement.

Interactive · Consistent hash ring Replicas walk clockwise · skipping same-node vnodes

Nodes 4

Vnodes / node 16

Key

Replicas (3-way, clockwise)

1 node 3 @ 69.81%
2 node 0 @ 70.15%
3 node 2 @ 72.94%

Load distribution (% of ring)

node 0

26.3%

node 1

25.2%

node 2

26.0%

node 3

22.5%

Try this. Set vnodes = 1 — the load bars become uneven. Bump vnodes to 32 — they smooth out (standard deviation drops as 1/√v). Drop a node from 4 to 3 and only the keys that landed on the dropped node's vnodes move; the rest stay put. That's the "consistent" in consistent hashing.

Sloppy quorum: R + W > N

Each request specifies how many replicas must ack — R for reads, W for writes — against the N replicas. The classic settings:

(N, W, R)	What it gives you	Used for
(3, 1, 1)	Fast, weakly consistent	Caches, telemetry, anything tolerant of lost writes
(3, 2, 2)	Quorum (R + W > N) — strong-ish	Default for most product calls
(3, 3, 1)	Read-fast, write-slow	Read-heavy, ack-when-everyone-wrote
(3, 1, 3)	Write-fast, read-slow	Write-heavy with consistent reads on demand

"Sloppy" means the W replicas don't have to be the canonical N — if a primary is down, a substitute (next clockwise) accepts the write and adds a hint. This improves availability under partial failure at the cost of a brief inconsistency window.

6 · Vector clocks and conflict resolution

Concurrent writes can produce siblings — values that neither dominates the other in the partial order. Vector clocks track causality so the system knows when this has happened.

Time: T0 T1 T2 T3
Alice writes A, Bob writes B,
both see {} sees {Alice:1} Coordinator merge — concurrent: siblings

VC(A) = {Alice:1}
VC(B) = {Bob:1, Alice:1} Bob saw Alice's prior write — B dominates A
VC(B) = {Bob:1} Bob did NOT see A — concurrent → siblings

On read:
 if exactly one value has the maximal VC → return it.
 else → return ALL siblings; client resolves and writes back with combined VC.

In production, the VC for a key with N=3 has at most a few entries (one per frequently-writing actor). Garbage-collect entries past a depth bound to keep the VC small.

The shopping-cart classic. Dynamo's example: two siblings of a cart, one with item X, one with item Y. The application-side resolution is "union of items". The point is the system doesn't try to be cleverer than the application — it surfaces conflicts and lets the product decide.

7 · Hinted handoff and anti-entropy

Hinted handoff (short-term)

A node can't reach replica N5 for key k. The coordinator picks a substitute (N6), which durably stores the write plus a hint: "this belongs to N5; deliver when you can". When N5 returns, N6 hands the data over and deletes the hint.

Bounded by hint-queue size. If the queue overflows, the system falls back to anti-entropy. The hint queue itself is durable — a node restarting must not lose pending handoffs.

Anti-entropy with Merkle trees (long-term)

Periodically, replica peers compare Merkle trees of their key ranges. Differing subtrees are walked top-down until the divergent keys are isolated, then the missing values are pulled. This catches:

Hint-queue overflows that dropped writes.
Bit-rot or silent corruption.
Replicas that fell behind during long partitions.

Tune the comparison cadence (every N hours per range) and the leaf granularity (rows vs. blocks). Too frequent → wasted bandwidth. Too sparse → recovery is slow.

8 · Membership and gossip

Nodes join, leave, and crash. The cluster needs an eventually-consistent view of membership without a coordinator. Gossip — each node periodically picks a random peer, exchanges versioned membership state, picks the higher version — converges in O(log N) rounds.

State	Format	Propagation
Membership	node → {address, vnode tokens, version, status}	Gossip every 1 s; convergence ~5 s for 100-node cluster
Failure detection	φ-accrual detector (Hayashibara)	Per-node phi score; threshold-based suspect/dead
Schema / config	Versioned config blob	Same gossip plumbing

φ-accrual is more nuanced than a fixed timeout: it computes a confidence score that the peer is dead, accommodating noisy networks. A node with phi above ~8 is marked suspect; above ~15, dead.

9 · Failure modes & runbook

Failure	Symptom	Mitigation
One node crash	Hint queues fill on its peers; reads to that vnode 33% slower (1 of 3 replicas dead)	Auto-detected via gossip in ~5 s; coordinator routes around. Hints drained after restart.
One AZ down	1/3 of replicas unreachable; reads see W − 1 acks	Quorum still met (2 of 3); no impact on reads. Writes get hints for the unreachable replicas. Restore handoff when AZ returns.
Region partition	Cross-region replication paused	In-region reads/writes fine. Cross-region replication backlog drains when the partition heals; conflicts resolved via vector clocks at read time.
Two nodes in same range crash	W=2 may not be reachable; quorum violated	Sloppy quorum picks substitutes; with N=3, R=W=2 we tolerate one loss but not two. Page on this.
Hot key	One vnode at 100% CPU; others idle	Detect by per-key QPS percentile; promote the hot key to in-process cache on the coordinator; or split the key across vnodes (rare; usually a product fix).
Hint-queue overflow	Disk filling on a node holding many hints for a long-down peer	Force-drop hints, fall back to AAE on peer return. Trigger AAE earlier when hint queue > threshold.
Anti-entropy bandwidth blowup	Replica catch-up saturating cross-AZ link	Rate-limit AAE per peer; throttle during peak traffic windows.

10 · Cost & SLOs

Line	Estimate	Note
Compute (32 nodes, 8 vCPU, 32 GB RAM)	~$8K / month	m6i.2xlarge equivalents, reserved
Storage (32 × 500 GB NVMe)	~$2K / month	Headroom for 4× growth
Inter-AZ traffic	~$3K / month	Replication + AAE; the main hidden cost
Cross-region replication	~$4K / month	Async; rate-shaped
Total	~$17K / month per region	Two regions: ~$34K. About $0.34 per million ops.

SLOs

Read availability: 99.999%. 5 minutes/year. Multi-AZ + region failover make this achievable.
Write availability: 99.99%. ~52 min/year. Slightly tighter because writes need quorum.
Read P99 in-region: 10 ms. Three replicas, network + RocksDB miss is the worst case.
Write P99 in-region: 50 ms. W=2 acks; the slowest of the two.
Eventual-consistency window: P99 ≤ 1 s in-region, ≤ 10 s cross-region. Tracked via injected probe writes.

11 · Trade-offs & "what would you change at 10×"

If…	Then…
10× data (40 TB)	Add nodes (vnodes redistribute automatically). Move from m6i.2xlarge to i4i.2xlarge for direct-attached NVMe.
Strict consistency required	Switch to a Raft-per-shard design (CockroachDB-style). You give up some availability under partition for linearizable reads.
Active-active multi-region	Already true here. The CRDT crowd would push toward last-writer-wins or operational transforms; vector clocks + sibling resolution is the more honest answer.
Range scans needed	Different design. Layer a B-tree on top (Bigtable) or move to a partitioned-by-row-key system (HBase, FoundationDB).
Strong global ordering needed	Spanner-style with TrueTime and 2PC over Raft groups. ~10× the cost; rarely the right answer.
"What would a more senior answer add?"	The operations story: rolling upgrades without downtime, secondary-index repair, repair throttling, the disaster-recovery runbook. The senior answer "it self-heals" is true at small scale; at 1000+ nodes the operations are a full sub-system.

A distributed key-value store

1 · Clarifying questions

2 · Capacity math, on a napkin

3 · API and data model

Endpoints

On-disk layout, per node

4 · High-level architecture

5 · The hard part — placement, replication, quorum

Consistent hashing with vnodes

Sloppy quorum: R + W > N

6 · Vector clocks and conflict resolution

7 · Hinted handoff and anti-entropy

Hinted handoff (short-term)

Anti-entropy with Merkle trees (long-term)

8 · Membership and gossip

9 · Failure modes & runbook

10 · Cost & SLOs

SLOs

11 · Trade-offs & "what would you change at 10×"

Further reading

News feed