Why distributed systems exist.
One machine has finite RAM, a finite disk, a finite CPU, and a finite blast radius. Once your data won't fit, your throughput outgrows it, or your business can't survive its failure, you spread the work across many machines. The moment you do, four new problems show up: latency (the speed of light is now in your error budget), partial failure (some nodes crash while others run), time (no two clocks agree), and scale (every coordination cost grows with N).
Distributed systems is the craft of paying these costs honestly. The classic algorithms (Paxos, Raft, vector clocks, consistent hashing, quorum reads) are each a careful account of which guarantees you keep when which assumptions break. The classic papers are the receipts.
When not to distribute. Most workloads fit on one machine. Modern hardware: 1 TB of RAM, 24 TB of NVMe, 128 cores. If your data fits, your latency budget allows it, and your blast radius is acceptable, don't distribute. The operational tax of consensus, replication, partitioning, and observability is huge.
The twelve mental models you must build.
Twelve concepts cover ~95% of distributed-systems surface. Get them into your bones in the first month. Every protocol you meet (Paxos, Raft, Dynamo, Spanner, Kafka) is a recombination of them.
The same data lives on multiple nodes. The price you pay for availability and durability; the source of every consistency question that follows.
02 Partitioning / sharding Day-zeroSplit the dataset across nodes by key. Range, hash, or consistent-hash. Once a system has more than one machine, this is unavoidable.
03 Consensus PractitionerMultiple nodes agree on a value despite failures. Paxos, Raft, Zab — different surfaces, same guarantee. The protocol underneath every coordinator.
04 Time & clocks PractitionerNo global clock. Lamport timestamps, vector clocks, hybrid logical clocks, Spanner TrueTime — different ways of pretending one exists.
05 Failure detection OperatorWhen is a node "dead"? Heartbeats, gossip, phi-accrual detectors. The honest answer is "you cannot know" — only "I have not heard from it lately".
06 Idempotence Day-zeroAn operation safe to run more than once. The single most useful property in distributed code — every retry, every queue, every API depends on it.
07 CAP PractitionerUnder network partition, choose consistency or availability — not both. Important to know, often misquoted; PACELC is the better framing.
08 Eventual consistency PractitionerAll replicas converge to the same state, given no new writes and a finite delay. AP systems live here; understanding the conflict-resolution rule is the whole job.
09 Quorum PractitionerMajority subsets that overlap. With W + R > N, every read sees the latest write. The math behind Dynamo, Cassandra, and every modern KV.
10 Leader election OperatorExactly one leader, despite failures and partitions. ZooKeeper, etcd, Consul; under the hood, always a consensus protocol with a stable lease.
11 Write-ahead log PractitionerAppend-only durability primitive. Every database, every replication protocol, every event-driven architecture has one underneath. Read it before reading anything else.
12 CRDTs ResearcherConflict-free replicated data types. Algebraic structures (semilattices) where merge is associative, commutative, idempotent — convergence without coordination.
Day-zero — your first hour.
One hour. Open Designing Data-Intensive Applications and read the first two chapters. Then download MIT 6.824 Lab 1 and run the test use. The bar is simple: you have read about distributed systems and run code that calls itself one.
# 1. Read DDIA chapters 1–2 (≈ 90 minutes) # — Reliable, scalable, maintainable applications # — Data models and query languages # 2. Clone MIT 6.824 (the labs in Go) git clone git://g.csail.mit.edu/6.5840-golabs-2024 6.824 cd 6.824/src/main # 3. Run the smoke test for Lab 1 (sequential <a class="il" href="/papers/mapreduce">MapReduce</a>) go build -buildmode=plugin ../mrapps/wc.go rm mr-out* go run mrsequential.go wc.so pg*.txt sort -k1,1 mr-out-0 | head # 4. Read the lab handout — about an hour # https://pdos.csail.mit.edu/6.824/labs/lab-mr.html # 5. (Optional, satisfying) install TLA+ Toolbox # https://lamport.azurewebsites.net/tla/toolbox.html # Open the bundled Die Hard / Hour Clock specs # Run TLC. You have now model-checked something.
Done. You have read the right two chapters, run the right starter project, and seen the shape of model checking. Everything below builds on this start.
Week-1 to Month-3 — pick a track.
After the first hour you can read distributed-systems writing without bouncing off it. Spend the next three months one track at a time, depth-first. Don't try to learn consensus and TLA+ in the same two weeks. Pick the track that maps to your job and finish it.
Linearizable, sequential, causal, eventual. Jepsen consistency hierarchy. Why "Serializable" in your SQL database is almost certainly Read Committed.
→ ReferencePaxos (read it twice). Raft (read it once — Ongaro's thesis is the right artefact). Zab. Implement Raft in Go via MIT 6.824 Lab 2.
→ ReferenceLamport clocks. Vector clocks. Hybrid logical clocks. Spanner TrueTime. Each builds a different "happens-before" out of unreliable hardware.
→ ReferenceCrash-stop, crash-recovery, omission, Byzantine. The fault model is the contract — never debug a system without naming yours first.
→ ReferenceLamport's spec language. Model-check the protocol before coding it. AWS, Microsoft, MongoDB use this on real systems — not just academia.
→ ReferenceWAL, B-trees, LSM-trees, Bloom filters, MVCC, snapshot isolation. Petrov's Database Internals is the right book; pair with Bigtable + Spanner papers.
→ Semicolony assetCircuit breakers, bulkheads, retries with backoff and jitter, hedged requests, load shedding. Read every Aphyr post; subscribe to Brendan Gregg's blog.
→ Semicolony assetThe books that matter.
DDIA. The book. The narrative thread connecting storage, replication, partitioning, transactions, and consistency. Read it cover to cover.
The companion to DDIA from the storage side. B-trees, LSM-trees, the storage-engine layer DDIA gestures at; Petrov works through it.
Concise, free, and surprisingly complete. The book to hand someone before they start DDIA.
The textbook. Drier than DDIA, broader than Petrov. Useful as a reference; the formal definitions are precise.
Patterns book from the Kubernetes co-creator. Sidecars, ambassadors, work queues — patterns at the deployment level rather than the algorithm level.
Google's SRE book. Free online. Operational philosophy plus the practices that survive contact with planet-scale traffic.
A representative ops-focused book — pick one Dynamo-style database and read deeply. The lessons generalise.
Honourable mentions: Designing for Scalability with Erlang/OTP (Cesarini/Vinoski), Data Pipelines Pocket Reference (Densmore), Database Reliability Engineering (Campbell/Majors).
Courses, labs, and where to actually learn.
- MIT · 6.824 / 6.5840Distributed SystemsRobert Morris's lectures plus four labs in Go: MapReduce, Raft, fault-tolerant KV, sharded KV. Free online — the gold-standard self-study path.
- Cambridge · Martin KleppmannDistributed Systems lecture seriesEight lectures from the author of DDIA. Free on YouTube; lecture notes downloadable. The cleanest narrative of the field on video.
- Carnegie Mellon · 15-440Distributed Systems (CMU)Public syllabus, slide decks, and project descriptions. Heavier on the operating-systems side than 6.824; complementary.
- Stanford · CS244BDistributed Systems (Stanford)Ousterhout's course. The "Raft from the source" experience — Ongaro and Ousterhout designed Raft on this whiteboard.
- Bradfield CSDistributed SystemsCohort-based, paper-driven. Pricey; the curated reading list is the value, the small-group discussion is the multiplier.
- EducativeGrokking the System Design InterviewLess academic, more "build a URL shortener / Twitter / Uber". The interview-prep angle on the same material.
The paper canon.
The fourteen papers below trace the field from 1978 to 2014. Read them roughly in order. The later ones quote the earlier ones constantly. Most are 8–20 pages.
- 01 Time, Clocks, and the Ordering of Events in a Distributed System
The paper. Logical clocks, the happens-before relation, distributed mutual exclusion. Foundational.
- 02 The Byzantine Generals Problem
Agreement when nodes can lie. Lower bound: 3f+1 nodes for f Byzantine failures.
- 03 Impossibility of Distributed Consensus with One Faulty Process (FLP)
In a fully asynchronous system, no deterministic protocol solves consensus with one crash failure. The hard ceiling.
- 04 The Part-Time Parliament
Original Paxos. Famously inscrutable. Read it once for completeness, then read Paxos Made Simple.
- 05 Towards Robust Distributed Systems
The PODC keynote where CAP was first stated as a conjecture. Two pages; everyone refers to it.
- 06 Paxos Made Simple
The teachable Paxos. Read this twice; the original once.
- 07 The Google File System
GFS. The cluster-scale-storage paper that started the wave. Master + chunkservers; primary–secondary chunk replication.
- 08 MapReduce: Simplified Data Processing on Large Clusters
Map and reduce, distributed across thousands of machines. Hadoop is the open-source descendant; Spark is the descendant of Hadoop.
- 09 Bigtable: A Distributed Storage System for Structured Data
Wide-column store on top of GFS + Chubby. The grandparent of HBase, Cassandra, and DynamoDB.
- 10 Dynamo: Amazon's Highly Available Key-value Store
Consistent hashing + vector clocks + sloppy quorums + hinted handoff. The paper every NoSQL system reaches back to.
- 11 Life Beyond Distributed Transactions
An apostate's opinion. "Don't do distributed transactions. Use entities, idempotence, and message-passing instead." Aged extremely well.
- 12 ZooKeeper: Wait-free coordination for Internet-scale systems
A coordination kernel — leader election, configuration, distributed locks — built on Zab consensus. Still the reference design.
- 13 Spanner: Google's Globally-Distributed Database
Externally consistent transactions over a planet-scale database via TrueTime — bounded clock uncertainty backed by GPS + atomic clocks.
- 14 In Search of an Understandable Consensus Algorithm (Raft)
Consensus designed for teachability. The reason every Go database written after 2014 has a Raft library.
Going further: A Comprehensive Study of Convergent and Commutative Replicated Data Types (Shapiro et al, 2011 — the CRDT paper); Calvin: Fast Distributed Transactions for Partitioned Database Systems (Thomson et al, 2012); the Bigtable: A Decade Later retrospective; and every Aphyr Jepsen analysis — long-form reports on what real databases do (and do not) under partition.
Talks worth your evening.
- Leslie Lamport · Microsoft ResearchThe TLA+ video courseLamport teaches his own specification language. Long-form, rigorous, and free. The right way to start formal methods.
- Kyle Kingsbury · QConJepsen — partition-testing real databasesAphyr's QCon talks. Funny, technical, and devastating. Watch one to internalise why your database probably loses data.
- Martin Kleppmann · CambridgeDistributed systems lecture seriesEight lectures, freely available. The "I read DDIA but I want it explained on a whiteboard" experience.
- Pat HellandLife Beyond Distributed TransactionsPat Helland is the most underrated voice in the field. The talk version of the paper. Watch with notebook open.
- Eric Brewer · 2012CAP Twelve Years Later — How the Rules Have ChangedBrewer's own retrospective on CAP. "The 2-of-3 framing is misleading; the real choice is per-operation." Five-minute read.
- Various · YouTube"What I Wish I Knew When I Was Learning Distributed Systems"A genre. Multiple talks under that title. The Denise Yu and Jordan West variants are particularly good.
Hands-on environments — where to actually run things.
Theory without runnable code is fragile. Each of these is a manageable way to make distributed systems push back when you make a mistake.
| Environment | Cost | Best for |
|---|---|---|
| TLA+ | Free | Specifying and model-checking protocols. AWS, Microsoft, MongoDB use it on real systems. Lamport's video course is the right starting point. |
| MIT 6.824 labs | Free | Implement MapReduce, Raft, fault-tolerant KV, sharded KV — in Go. The single best graded experience of building distributed systems from scratch. |
| Maelstrom | Free | Aphyr's local distributed-systems test use. Build a node binary, run it under simulated partition / latency / message loss. Perfect for solo learners. |
| Jepsen | Free, open-source | The real thing. Spin up a real cluster, hammer it under chaos, check for consistency violations. Steeper than Maelstrom; closer to what production failure looks like. |
| FoundationDB simulation | Free | Read the open-source FoundationDB code. Their deterministic simulation framework is one-of-a-kind — every test runs millions of fault-injected scenarios deterministically. The reason FDB rarely loses data. |
One-card cheat sheet.
Print it, tape it to the monitor. Ten lines that carry ninety percent of the daily mental load.
| Quorum math | W + R > N ⇒ every read overlaps the latest write (strong consistency over a quorum store). |
| CAP | Under network partition, you choose Consistency or Availability — not both. PACELC adds: Else, Latency vs Consistency. |
| FLP | In a fully async system, no deterministic protocol solves consensus with even one crash. Real systems escape via timeouts (partial synchrony). |
| Paxos / Raft fault tolerance | 2f + 1 nodes tolerate f simultaneous failures. 3 nodes → 1 failure; 5 → 2. |
| Byzantine fault tolerance | 3f + 1 nodes tolerate f Byzantine (lying) failures. PBFT, Tendermint. |
| Vector clock comparison | a < b iff every component a[i] ≤ b[i] AND at least one strict. Otherwise concurrent. |
| Raft phases | Leader election → log replication → commit on majority. One leader at a time, monotonic terms. |
| Spanner TrueTime | External consistency via bounded clock uncertainty (epsilon ≈ 7ms). Commit-wait until t + 2ε. |
| Lamport timestamp | Per-node monotonic counter. On send: increment. On receive: max(local, msg) + 1. Total order via tie-break by node ID. |
| Eight fallacies | Network is reliable; latency is zero; bandwidth is infinite; topology never changes; one admin; transport cost zero; network is homogeneous; secure by default. All wrong. |
Common mistakes that ship to production.
Patterns every team writes at least once. Read these now, so you recognise the shape later, when something on-call is misbehaving and the dashboard is no help.
- Wall clocks for ordering
- NTP steps backwards. Leap seconds happen. VMs pause for seconds at a time. Never order events by Date.now() — use logical clocks, vector clocks, or HLCs.
- Timeout = failure
- Slow is not the same as dead. A partition is not the same as a crash. The kindest assumption — "node X has not responded yet" — is also the only true one.
- Eventual consistency without conflict resolution
- Eventually means "convergence to some value". You still need to define which value wins on a concurrent write. Last-Write-Wins on free-form data loses writes silently.
- "Exactly-once" delivery
- It does not exist over a network with failures. The two attainable contracts are at-most-once and at-least-once. Exactly-once is at-least-once + idempotent processing.
- Distributed locks without fencing tokens
- A client takes the lock, GCs for 30 seconds, the lock expires, another client takes it — and the first client wakes up still believing it holds the lock. Fence every lock-protected write.
- Quorum reads but no quorum writes (or vice versa)
- Asymmetric quorums break the W + R > N invariant. Either commit to strong consistency end to end or admit you are eventually consistent and design around it.
- Last-write-wins on values that don't commute
- LWW is fine for "current temperature". It is catastrophic for "set of friends" or "shopping cart". Use a CRDT, or a per-write merge function.
- Calling SQL "Serializable" when default is Read Committed
- Postgres, MySQL, Oracle all default below Serializable. Half the bugs in distributed apps are anomalies the dev assumed could not happen at the level they didn't set.
- Believing the eight fallacies
- You can recite them and still write code that assumes the network is reliable. Re-read the list each time you ship a new RPC client.
- Reaching for consensus when single-leader replication would do
- Paxos is operationally expensive. If your workload is "writes go to one node, reads can be stale-tolerant", you do not need it. Use it when you actually need agreement.
Practice deck.
Ten cards: the questions interviewers ask, the things that bite operators in production, and the details that separate "I use distributed systems" from "I understand them".
Reading progressions
Three ordered paths through this material. Pick the one that matches where you are.
The foundations: how distributed nodes agree, and how data is replicated safely.
From B-trees to LSM trees: the structures that keep databases fast and durable at scale.
The operational patterns that keep production systems alive: backpressure, retries, circuit breakers.
Keep going.
Distributed systems rewards reading and re-reading. The same Lamport paper, read on day 30 and again on day 300, gives you different things. So will DDIA. So will every Jepsen post. The field is not large. It is dense.
Pick one real system and read its source for an afternoon. etcd, FoundationDB, CockroachDB, TiKV, NATS, Kafka: all open. Pair what you read with the paper that inspired it. Then come back to your own service and re-read your own retry, timeout, and replication code. You will rewrite some of it.