How do I learn distributed systems?

Read Designing Data-Intensive Applications (Kleppmann) for the systems-engineer view. Then watch Robert Morris's MIT 6.824 lectures (free on YouTube) and do the labs — implementing Raft from scratch is the formative exercise. Read Jepsen reports (jepsen.io) to see how real systems fail. Most engineers report 2-3 years of distributed-systems work before the canonical algorithms feel intuitive.

What is the most important paper in distributed systems?

Three foundational ones: (1) Lamport's "Time, Clocks, and the Ordering of Events in a Distributed System" (1978) — the logical clocks paper; (2) Lynch and Fischer's "Impossibility of Distributed Consensus with One Faulty Process" (1985) — the FLP impossibility; (3) Gilbert and Lynch's "Brewer's Conjecture and the Feasibility of Consistent, Available, Partition-Tolerant Web Services" (2002) — the formal CAP proof.

What is consensus and why is it hard?

Consensus: a group of nodes agreeing on a single value, even when some nodes fail or messages are lost. It's hard because the network is asynchronous and unreliable — there's no shared clock, no guarantee that messages arrive, and any node can crash. Paxos (Lamport, 1989) was the first proven correct algorithm; Raft (2014) is the modern alternative, designed for understandability. Both require a strict majority (quorum) of nodes to commit any value.

Distributed Systems — Consensus, Consistency, CAP, Replication

Why distributed systems exist.

One machine has finite RAM, a finite disk, a finite CPU, and a finite blast radius. Once your data won't fit, your throughput outgrows it, or your business can't survive its failure, you spread the work across many machines. The moment you do, four new problems show up: latency (the speed of light is now in your error budget), partial failure (some nodes crash while others run), time (no two clocks agree), and scale (every coordination cost grows with N).

Distributed systems is the craft of paying these costs honestly. The classic algorithms (Paxos, Raft, vector clocks, consistent hashing, quorum reads) are each a careful account of which guarantees you keep when which assumptions break. The classic papers are the receipts.

When not to distribute. Most workloads fit on one machine. Modern hardware: 1 TB of RAM, 24 TB of NVMe, 128 cores. If your data fits, your latency budget allows it, and your blast radius is acceptable, don't distribute. The operational tax of consensus, replication, partitioning, and observability is huge.

The twelve mental models you must build.

Twelve concepts cover ~95% of distributed-systems surface. Get them into your bones in the first month. Every protocol you meet (Paxos, Raft, Dynamo, Spanner, Kafka) is a recombination of them.

01 Replication Day-zero

The same data lives on multiple nodes. The price you pay for availability and durability; the source of every consistency question that follows.

02 Partitioning / sharding Day-zero

Split the dataset across nodes by key. Range, hash, or consistent-hash. Once a system has more than one machine, this is unavoidable.

03 Consensus Practitioner

Multiple nodes agree on a value despite failures. Paxos, Raft, Zab — different surfaces, same guarantee. The protocol underneath every coordinator.

04 Time & clocks Practitioner

No global clock. Lamport timestamps, vector clocks, hybrid logical clocks, Spanner TrueTime — different ways of pretending one exists.

05 Failure detection Operator

When is a node "dead"? Heartbeats, gossip, phi-accrual detectors. The honest answer is "you cannot know" — only "I have not heard from it lately".

06 Idempotence Day-zero

An operation safe to run more than once. The single most useful property in distributed code — every retry, every queue, every API depends on it.

07 CAP Practitioner

Under network partition, choose consistency or availability — not both. Important to know, often misquoted; PACELC is the better framing.

08 Eventual consistency Practitioner

All replicas converge to the same state, given no new writes and a finite delay. AP systems live here; understanding the conflict-resolution rule is the whole job.

09 Quorum Practitioner

Majority subsets that overlap. With W + R > N, every read sees the latest write. The math behind Dynamo, Cassandra, and every modern KV.

10 Leader election Operator

Exactly one leader, despite failures and partitions. ZooKeeper, etcd, Consul; under the hood, always a consensus protocol with a stable lease.

11 Write-ahead log Practitioner

Append-only durability primitive. Every database, every replication protocol, every event-driven architecture has one underneath. Read it before reading anything else.

12 CRDTs Researcher

Conflict-free replicated data types. Algebraic structures (semilattices) where merge is associative, commutative, idempotent — convergence without coordination.

Day-zero — your first hour.

One hour. Open Designing Data-Intensive Applications and read the first two chapters. Then download MIT 6.824 Lab 1 and run the test use. The bar is simple: you have read about distributed systems and run code that calls itself one.

# 1. Read DDIA chapters 1–2 (≈ 90 minutes)
#    — Reliable, scalable, maintainable applications
#    — Data models and query languages

# 2. Clone MIT 6.824 (the labs in Go)
git clone git://g.csail.mit.edu/6.5840-golabs-2024 6.824
cd 6.824/src/main

# 3. Run the smoke test for Lab 1 (sequential <a class="il" href="/papers/mapreduce">MapReduce</a>)
go build -buildmode=plugin ../mrapps/wc.go
rm mr-out*
go run mrsequential.go wc.so pg*.txt
sort -k1,1 mr-out-0 | head

# 4. Read the lab handout — about an hour
#    https://pdos.csail.mit.edu/6.824/labs/lab-mr.html

# 5. (Optional, satisfying) install TLA+ Toolbox
#    https://lamport.azurewebsites.net/tla/toolbox.html
#    Open the bundled Die Hard / Hour Clock specs
#    Run TLC. You have now model-checked something.

Done. You have read the right two chapters, run the right starter project, and seen the shape of model checking. Everything below builds on this start.

Week-1 to Month-3 — pick a track.

After the first hour you can read distributed-systems writing without bouncing off it. Spend the next three months one track at a time, depth-first. Don't try to learn consensus and TLA+ in the same two weeks. Pick the track that maps to your job and finish it.

Consistency models

Linearizable, sequential, causal, eventual. Jepsen consistency hierarchy. Why "Serializable" in your SQL database is almost certainly Read Committed.

→ Reference

Consensus protocols

Paxos (read it twice). Raft (read it once — Ongaro's thesis is the right artefact). Zab. Implement Raft in Go via MIT 6.824 Lab 2.

→ Reference

Time & ordering

Lamport clocks. Vector clocks. Hybrid logical clocks. Spanner TrueTime. Each builds a different "happens-before" out of unreliable hardware.

→ Reference

Failure semantics

Crash-stop, crash-recovery, omission, Byzantine. The fault model is the contract — never debug a system without naming yours first.

→ Reference

TLA+ & specification

Lamport's spec language. Model-check the protocol before coding it. AWS, Microsoft, MongoDB use this on real systems — not just academia.

→ Reference

Storage internals

WAL, B-trees, LSM-trees, Bloom filters, MVCC, snapshot isolation. Petrov's Database Internals is the right book; pair with Bigtable + Spanner papers.

→ Semicolony asset

Operational patterns

Circuit breakers, bulkheads, retries with backoff and jitter, hedged requests, load shedding. Read every Aphyr post; subscribe to Brendan Gregg's blog.

→ Semicolony asset

The books that matter.

2017 · O'Reilly

Martin Kleppmann — Designing Data-Intensive Applications

DDIA. The book. The narrative thread connecting storage, replication, partitioning, transactions, and consistency. Read it cover to cover.

2019 · O'Reilly

Alex Petrov — Database Internals

The companion to DDIA from the storage side. B-trees, LSM-trees, the storage-engine layer DDIA gestures at; Petrov works through it.

2013 · free online

Mikito Takada — Distributed Systems for Fun and Profit

Concise, free, and surprisingly complete. The book to hand someone before they start DDIA.

2017 · Pearson

Tanenbaum & Van Steen — Distributed Systems: Principles and Paradigms

The textbook. Drier than DDIA, broader than Petrov. Useful as a reference; the formal definitions are precise.

2018 · O'Reilly

Brendan Burns — Designing Distributed Systems

Patterns book from the Kubernetes co-creator. Sidecars, ambassadors, work queues — patterns at the deployment level rather than the algorithm level.

2016 · O'Reilly

Beyer et al — Site Reliability Engineering

Google's SRE book. Free online. Operational philosophy plus the practices that survive contact with planet-scale traffic.

2015 · Apress

Lakshman & Khrabrov — Practical Cassandra

A representative ops-focused book — pick one Dynamo-style database and read deeply. The lessons generalise.

Honourable mentions: Designing for Scalability with Erlang/OTP (Cesarini/Vinoski), Data Pipelines Pocket Reference (Densmore), Database Reliability Engineering (Campbell/Majors).

Courses, labs, and where to actually learn.

Free

Paid (worth it)

The paper canon.

The fourteen papers below trace the field from 1978 to 2014. Read them roughly in order. The later ones quote the earlier ones constantly. Most are 8–20 pages.

01
1978 · Lamport
Time, Clocks, and the Ordering of Events in a Distributed System
The paper. Logical clocks, the happens-before relation, distributed mutual exclusion. Foundational.
02
1982 · Lamport, Shostak, Pease
The Byzantine Generals Problem
Agreement when nodes can lie. Lower bound: 3f+1 nodes for f Byzantine failures.
03
1985 · Fischer, Lynch, Paterson
Impossibility of Distributed Consensus with One Faulty Process (FLP)
In a fully asynchronous system, no deterministic protocol solves consensus with one crash failure. The hard ceiling.
04
1998 · Lamport
The Part-Time Parliament
Original Paxos. Famously inscrutable. Read it once for completeness, then read Paxos Made Simple.
05
2000 · Brewer
Towards Robust Distributed Systems
The PODC keynote where CAP was first stated as a conjecture. Two pages; everyone refers to it.
06
2001 · Lamport
Paxos Made Simple
The teachable Paxos. Read this twice; the original once.
07
2003 · Ghemawat, Gobioff, Leung
The Google File System
GFS. The cluster-scale-storage paper that started the wave. Master + chunkservers; primary–secondary chunk replication.
08
2004 · Dean & Ghemawat
MapReduce: Simplified Data Processing on Large Clusters
Map and reduce, distributed across thousands of machines. Hadoop is the open-source descendant; Spark is the descendant of Hadoop.
09
2006 · Chang et al
Bigtable: A Distributed Storage System for Structured Data
Wide-column store on top of GFS + Chubby. The grandparent of HBase, Cassandra, and DynamoDB.
10
2007 · DeCandia et al
Dynamo: Amazon's Highly Available Key-value Store
Consistent hashing + vector clocks + sloppy quorums + hinted handoff. The paper every NoSQL system reaches back to.
11
2007 · Helland
Life Beyond Distributed Transactions
An apostate's opinion. "Don't do distributed transactions. Use entities, idempotence, and message-passing instead." Aged extremely well.
12
2010 · Hunt, Konar, Junqueira, Reed
ZooKeeper: Wait-free coordination for Internet-scale systems
A coordination kernel — leader election, configuration, distributed locks — built on Zab consensus. Still the reference design.
13
2012 · Corbett et al
Spanner: Google's Globally-Distributed Database
Externally consistent transactions over a planet-scale database via TrueTime — bounded clock uncertainty backed by GPS + atomic clocks.
14
2014 · Ongaro & Ousterhout
In Search of an Understandable Consensus Algorithm (Raft)
Consensus designed for teachability. The reason every Go database written after 2014 has a Raft library.

Going further: A Comprehensive Study of Convergent and Commutative Replicated Data Types (Shapiro et al, 2011 — the CRDT paper); Calvin: Fast Distributed Transactions for Partitioned Database Systems (Thomson et al, 2012); the Bigtable: A Decade Later retrospective; and every Aphyr Jepsen analysis — long-form reports on what real databases do (and do not) under partition.

Talks worth your evening.

Hands-on environments — where to actually run things.

Theory without runnable code is fragile. Each of these is a manageable way to make distributed systems push back when you make a mistake.

Environment	Cost	Best for
TLA+	Free	Specifying and model-checking protocols. AWS, Microsoft, MongoDB use it on real systems. Lamport's video course is the right starting point.
MIT 6.824 labs	Free	Implement MapReduce, Raft, fault-tolerant KV, sharded KV — in Go. The single best graded experience of building distributed systems from scratch.
Maelstrom	Free	Aphyr's local distributed-systems test use. Build a node binary, run it under simulated partition / latency / message loss. Perfect for solo learners.
Jepsen	Free, open-source	The real thing. Spin up a real cluster, hammer it under chaos, check for consistency violations. Steeper than Maelstrom; closer to what production failure looks like.
FoundationDB simulation	Free	Read the open-source FoundationDB code. Their deterministic simulation framework is one-of-a-kind — every test runs millions of fault-injected scenarios deterministically. The reason FDB rarely loses data.

One-card cheat sheet.

Print it, tape it to the monitor. Ten lines that carry ninety percent of the daily mental load.

Quorum math	W + R > N ⇒ every read overlaps the latest write (strong consistency over a quorum store).
CAP	Under network partition, you choose Consistency or Availability — not both. PACELC adds: Else, Latency vs Consistency.
FLP	In a fully async system, no deterministic protocol solves consensus with even one crash. Real systems escape via timeouts (partial synchrony).
Paxos / Raft fault tolerance	2f + 1 nodes tolerate f simultaneous failures. 3 nodes → 1 failure; 5 → 2.
Byzantine fault tolerance	3f + 1 nodes tolerate f Byzantine (lying) failures. PBFT, Tendermint.
Vector clock comparison	a < b iff every component a[i] ≤ b[i] AND at least one strict. Otherwise concurrent.
Raft phases	Leader election → log replication → commit on majority. One leader at a time, monotonic terms.
Spanner TrueTime	External consistency via bounded clock uncertainty (epsilon ≈ 7ms). Commit-wait until t + 2ε.
Lamport timestamp	Per-node monotonic counter. On send: increment. On receive: max(local, msg) + 1. Total order via tie-break by node ID.
Eight fallacies	Network is reliable; latency is zero; bandwidth is infinite; topology never changes; one admin; transport cost zero; network is homogeneous; secure by default. All wrong.

Common mistakes that ship to production.

Patterns every team writes at least once. Read these now, so you recognise the shape later, when something on-call is misbehaving and the dashboard is no help.

Wall clocks for ordering: NTP steps backwards. Leap seconds happen. VMs pause for seconds at a time. Never order events by Date.now() — use logical clocks, vector clocks, or HLCs.
Timeout = failure: Slow is not the same as dead. A partition is not the same as a crash. The kindest assumption — "node X has not responded yet" — is also the only true one.
Eventual consistency without conflict resolution: Eventually means "convergence to some value". You still need to define which value wins on a concurrent write. Last-Write-Wins on free-form data loses writes silently.
"Exactly-once" delivery: It does not exist over a network with failures. The two attainable contracts are at-most-once and at-least-once. Exactly-once is at-least-once + idempotent processing.
Distributed locks without fencing tokens: A client takes the lock, GCs for 30 seconds, the lock expires, another client takes it — and the first client wakes up still believing it holds the lock. Fence every lock-protected write.
Quorum reads but no quorum writes (or vice versa): Asymmetric quorums break the W + R > N invariant. Either commit to strong consistency end to end or admit you are eventually consistent and design around it.
Last-write-wins on values that don't commute: LWW is fine for "current temperature". It is catastrophic for "set of friends" or "shopping cart". Use a CRDT, or a per-write merge function.
Calling SQL "Serializable" when default is Read Committed: Postgres, MySQL, Oracle all default below Serializable. Half the bugs in distributed apps are anomalies the dev assumed could not happen at the level they didn't set.
Believing the eight fallacies: You can recite them and still write code that assumes the network is reliable. Re-read the list each time you ship a new RPC client.
Reaching for consensus when single-leader replication would do: Paxos is operationally expensive. If your workload is "writes go to one node, reads can be stale-tolerant", you do not need it. Use it when you actually need agreement.

Practice deck.

Ten cards: the questions interviewers ask, the things that bite operators in production, and the details that separate "I use distributed systems" from "I understand them".

Card 1 of 10

A node has not heard from another for 5 seconds. Is the other node dead?

Suggested sequences

Reading progressions

Three ordered paths through this material. Pick the one that matches where you are.

Path 01 · Core

Consensus & replication

The foundations: how distributed nodes agree, and how data is replicated safely.

Path 02 · Storage

Storage internals

From B-trees to LSM trees: the structures that keep databases fast and durable at scale.

Path 03 · Reliability

Failure & recovery patterns

The operational patterns that keep production systems alive: backpressure, retries, circuit breakers.

Keep going.

Distributed systems rewards reading and re-reading. The same Lamport paper, read on day 30 and again on day 300, gives you different things. So will DDIA. So will every Jepsen post. The field is not large. It is dense.

Pick one real system and read its source for an afternoon. etcd, FoundationDB, CockroachDB, TiKV, NATS, Kafka: all open. Pair what you read with the paper that inspired it. Then come back to your own service and re-read your own retry, timeout, and replication code. You will rewrite some of it.

→ Semicolony's paper index → Continue with the Semicolony Handbook → The people behind these papers → Other study paths

Distributed systems,learned properly.

Why distributed systems exist.

The twelve mental models you must build.

Day-zero — your first hour.

Week-1 to Month-3 — pick a track.

The books that matter.

Courses, labs, and where to actually learn.

The paper canon.

Talks worth your evening.

Hands-on environments — where to actually run things.

One-card cheat sheet.

Common mistakes that ship to production.

Practice deck.

Reading progressions

Keep going.

Distributed systems,
learned properly.