Day-0 → Month-3 · curriculum
Study path · Distributed systems

Distributed systems,
learned properly.

One pass through the mental models, one through the paper canon, one through the labs. Replication, partitioning, consensus, time, failure: five ideas that, once they click, explain almost every database, queue, and coordination kernel you will ever touch.


Why distributed systems exist.

One machine has finite RAM, a finite disk, a finite CPU, and a finite blast radius. Once your data won't fit, your throughput outgrows it, or your business can't survive its failure, you spread the work across many machines. The moment you do, four new problems show up: latency (the speed of light is now in your error budget), partial failure (some nodes crash while others run), time (no two clocks agree), and scale (every coordination cost grows with N).

Distributed systems is the craft of paying these costs honestly. The classic algorithms (Paxos, Raft, vector clocks, consistent hashing, quorum reads) are each a careful account of which guarantees you keep when which assumptions break. The classic papers are the receipts.

When not to distribute. Most workloads fit on one machine. Modern hardware: 1 TB of RAM, 24 TB of NVMe, 128 cores. If your data fits, your latency budget allows it, and your blast radius is acceptable, don't distribute. The operational tax of consensus, replication, partitioning, and observability is huge.

The twelve mental models you must build.

Twelve concepts cover ~95% of distributed-systems surface. Get them into your bones in the first month. Every protocol you meet (Paxos, Raft, Dynamo, Spanner, Kafka) is a recombination of them.

01 Replication Day-zero

The same data lives on multiple nodes. The price you pay for availability and durability; the source of every consistency question that follows.

02 Partitioning / sharding Day-zero

Split the dataset across nodes by key. Range, hash, or consistent-hash. Once a system has more than one machine, this is unavoidable.

03 Consensus Practitioner

Multiple nodes agree on a value despite failures. Paxos, Raft, Zab — different surfaces, same guarantee. The protocol underneath every coordinator.

04 Time & clocks Practitioner

No global clock. Lamport timestamps, vector clocks, hybrid logical clocks, Spanner TrueTime — different ways of pretending one exists.

05 Failure detection Operator

When is a node "dead"? Heartbeats, gossip, phi-accrual detectors. The honest answer is "you cannot know" — only "I have not heard from it lately".

06 Idempotence Day-zero

An operation safe to run more than once. The single most useful property in distributed code — every retry, every queue, every API depends on it.

07 CAP Practitioner

Under network partition, choose consistency or availability — not both. Important to know, often misquoted; PACELC is the better framing.

08 Eventual consistency Practitioner

All replicas converge to the same state, given no new writes and a finite delay. AP systems live here; understanding the conflict-resolution rule is the whole job.

09 Quorum Practitioner

Majority subsets that overlap. With W + R > N, every read sees the latest write. The math behind Dynamo, Cassandra, and every modern KV.

10 Leader election Operator

Exactly one leader, despite failures and partitions. ZooKeeper, etcd, Consul; under the hood, always a consensus protocol with a stable lease.

11 Write-ahead log Practitioner

Append-only durability primitive. Every database, every replication protocol, every event-driven architecture has one underneath. Read it before reading anything else.

12 CRDTs Researcher

Conflict-free replicated data types. Algebraic structures (semilattices) where merge is associative, commutative, idempotent — convergence without coordination.

Day-zero — your first hour.

One hour. Open Designing Data-Intensive Applications and read the first two chapters. Then download MIT 6.824 Lab 1 and run the test use. The bar is simple: you have read about distributed systems and run code that calls itself one.

# 1. Read DDIA chapters 1–2 (≈ 90 minutes)
#    — Reliable, scalable, maintainable applications
#    — Data models and query languages

# 2. Clone MIT 6.824 (the labs in Go)
git clone git://g.csail.mit.edu/6.5840-golabs-2024 6.824
cd 6.824/src/main

# 3. Run the smoke test for Lab 1 (sequential <a class="il" href="/papers/mapreduce">MapReduce</a>)
go build -buildmode=plugin ../mrapps/wc.go
rm mr-out*
go run mrsequential.go wc.so pg*.txt
sort -k1,1 mr-out-0 | head

# 4. Read the lab handout — about an hour
#    https://pdos.csail.mit.edu/6.824/labs/lab-mr.html

# 5. (Optional, satisfying) install TLA+ Toolbox
#    https://lamport.azurewebsites.net/tla/toolbox.html
#    Open the bundled Die Hard / Hour Clock specs
#    Run TLC. You have now model-checked something.

Done. You have read the right two chapters, run the right starter project, and seen the shape of model checking. Everything below builds on this start.

Week-1 to Month-3 — pick a track.

After the first hour you can read distributed-systems writing without bouncing off it. Spend the next three months one track at a time, depth-first. Don't try to learn consensus and TLA+ in the same two weeks. Pick the track that maps to your job and finish it.

Consistency models

Linearizable, sequential, causal, eventual. Jepsen consistency hierarchy. Why "Serializable" in your SQL database is almost certainly Read Committed.

→ Reference
Consensus protocols

Paxos (read it twice). Raft (read it once — Ongaro's thesis is the right artefact). Zab. Implement Raft in Go via MIT 6.824 Lab 2.

→ Reference
Time & ordering

Lamport clocks. Vector clocks. Hybrid logical clocks. Spanner TrueTime. Each builds a different "happens-before" out of unreliable hardware.

→ Reference
Failure semantics

Crash-stop, crash-recovery, omission, Byzantine. The fault model is the contract — never debug a system without naming yours first.

→ Reference
TLA+ & specification

Lamport's spec language. Model-check the protocol before coding it. AWS, Microsoft, MongoDB use this on real systems — not just academia.

→ Reference
Storage internals

WAL, B-trees, LSM-trees, Bloom filters, MVCC, snapshot isolation. Petrov's Database Internals is the right book; pair with Bigtable + Spanner papers.

→ Semicolony asset
Operational patterns

Circuit breakers, bulkheads, retries with backoff and jitter, hedged requests, load shedding. Read every Aphyr post; subscribe to Brendan Gregg's blog.

→ Semicolony asset

The books that matter.

2017 · O'Reilly
Martin Kleppmann — Designing Data-Intensive Applications

DDIA. The book. The narrative thread connecting storage, replication, partitioning, transactions, and consistency. Read it cover to cover.

2019 · O'Reilly
Alex Petrov — Database Internals

The companion to DDIA from the storage side. B-trees, LSM-trees, the storage-engine layer DDIA gestures at; Petrov works through it.

2013 · free online
Mikito Takada — Distributed Systems for Fun and Profit

Concise, free, and surprisingly complete. The book to hand someone before they start DDIA.

2017 · Pearson
Tanenbaum & Van Steen — Distributed Systems: Principles and Paradigms

The textbook. Drier than DDIA, broader than Petrov. Useful as a reference; the formal definitions are precise.

2018 · O'Reilly
Brendan Burns — Designing Distributed Systems

Patterns book from the Kubernetes co-creator. Sidecars, ambassadors, work queues — patterns at the deployment level rather than the algorithm level.

2016 · O'Reilly
Beyer et al — Site Reliability Engineering

Google's SRE book. Free online. Operational philosophy plus the practices that survive contact with planet-scale traffic.

2015 · Apress
Lakshman & Khrabrov — Practical Cassandra

A representative ops-focused book — pick one Dynamo-style database and read deeply. The lessons generalise.

Honourable mentions: Designing for Scalability with Erlang/OTP (Cesarini/Vinoski), Data Pipelines Pocket Reference (Densmore), Database Reliability Engineering (Campbell/Majors).

Courses, labs, and where to actually learn.

Free
Paid (worth it)

The paper canon.

The fourteen papers below trace the field from 1978 to 2014. Read them roughly in order. The later ones quote the earlier ones constantly. Most are 8–20 pages.

  1. 01
    1978 · Lamport
    Time, Clocks, and the Ordering of Events in a Distributed System

    The paper. Logical clocks, the happens-before relation, distributed mutual exclusion. Foundational.

  2. 02
    1982 · Lamport, Shostak, Pease
    The Byzantine Generals Problem

    Agreement when nodes can lie. Lower bound: 3f+1 nodes for f Byzantine failures.

  3. 03
    1985 · Fischer, Lynch, Paterson
    Impossibility of Distributed Consensus with One Faulty Process (FLP)

    In a fully asynchronous system, no deterministic protocol solves consensus with one crash failure. The hard ceiling.

  4. 04
    1998 · Lamport
    The Part-Time Parliament

    Original Paxos. Famously inscrutable. Read it once for completeness, then read Paxos Made Simple.

  5. 05
    2000 · Brewer
    Towards Robust Distributed Systems

    The PODC keynote where CAP was first stated as a conjecture. Two pages; everyone refers to it.

  6. 06
    2001 · Lamport
    Paxos Made Simple

    The teachable Paxos. Read this twice; the original once.

  7. 07
    2003 · Ghemawat, Gobioff, Leung
    The Google File System

    GFS. The cluster-scale-storage paper that started the wave. Master + chunkservers; primary–secondary chunk replication.

  8. 08
    2004 · Dean & Ghemawat
    MapReduce: Simplified Data Processing on Large Clusters

    Map and reduce, distributed across thousands of machines. Hadoop is the open-source descendant; Spark is the descendant of Hadoop.

  9. 09
    2006 · Chang et al
    Bigtable: A Distributed Storage System for Structured Data

    Wide-column store on top of GFS + Chubby. The grandparent of HBase, Cassandra, and DynamoDB.

  10. 10
    2007 · DeCandia et al
    Dynamo: Amazon's Highly Available Key-value Store

    Consistent hashing + vector clocks + sloppy quorums + hinted handoff. The paper every NoSQL system reaches back to.

  11. 11
    2007 · Helland
    Life Beyond Distributed Transactions

    An apostate's opinion. "Don't do distributed transactions. Use entities, idempotence, and message-passing instead." Aged extremely well.

  12. 12
    2010 · Hunt, Konar, Junqueira, Reed
    ZooKeeper: Wait-free coordination for Internet-scale systems

    A coordination kernel — leader election, configuration, distributed locks — built on Zab consensus. Still the reference design.

  13. 13
    2012 · Corbett et al
    Spanner: Google's Globally-Distributed Database

    Externally consistent transactions over a planet-scale database via TrueTime — bounded clock uncertainty backed by GPS + atomic clocks.

  14. 14
    2014 · Ongaro & Ousterhout
    In Search of an Understandable Consensus Algorithm (Raft)

    Consensus designed for teachability. The reason every Go database written after 2014 has a Raft library.

Going further: A Comprehensive Study of Convergent and Commutative Replicated Data Types (Shapiro et al, 2011 — the CRDT paper); Calvin: Fast Distributed Transactions for Partitioned Database Systems (Thomson et al, 2012); the Bigtable: A Decade Later retrospective; and every Aphyr Jepsen analysis — long-form reports on what real databases do (and do not) under partition.

Talks worth your evening.

Hands-on environments — where to actually run things.

Theory without runnable code is fragile. Each of these is a manageable way to make distributed systems push back when you make a mistake.

EnvironmentCostBest for
TLA+FreeSpecifying and model-checking protocols. AWS, Microsoft, MongoDB use it on real systems. Lamport's video course is the right starting point.
MIT 6.824 labsFreeImplement MapReduce, Raft, fault-tolerant KV, sharded KV — in Go. The single best graded experience of building distributed systems from scratch.
MaelstromFreeAphyr's local distributed-systems test use. Build a node binary, run it under simulated partition / latency / message loss. Perfect for solo learners.
JepsenFree, open-sourceThe real thing. Spin up a real cluster, hammer it under chaos, check for consistency violations. Steeper than Maelstrom; closer to what production failure looks like.
FoundationDB simulationFreeRead the open-source FoundationDB code. Their deterministic simulation framework is one-of-a-kind — every test runs millions of fault-injected scenarios deterministically. The reason FDB rarely loses data.

One-card cheat sheet.

Print it, tape it to the monitor. Ten lines that carry ninety percent of the daily mental load.

Quorum mathW + R > N ⇒ every read overlaps the latest write (strong consistency over a quorum store).
CAPUnder network partition, you choose Consistency or Availability — not both. PACELC adds: Else, Latency vs Consistency.
FLPIn a fully async system, no deterministic protocol solves consensus with even one crash. Real systems escape via timeouts (partial synchrony).
Paxos / Raft fault tolerance2f + 1 nodes tolerate f simultaneous failures. 3 nodes → 1 failure; 5 → 2.
Byzantine fault tolerance3f + 1 nodes tolerate f Byzantine (lying) failures. PBFT, Tendermint.
Vector clock comparisona < b iff every component a[i] ≤ b[i] AND at least one strict. Otherwise concurrent.
Raft phasesLeader election → log replication → commit on majority. One leader at a time, monotonic terms.
Spanner TrueTimeExternal consistency via bounded clock uncertainty (epsilon ≈ 7ms). Commit-wait until t + 2ε.
Lamport timestampPer-node monotonic counter. On send: increment. On receive: max(local, msg) + 1. Total order via tie-break by node ID.
Eight fallaciesNetwork is reliable; latency is zero; bandwidth is infinite; topology never changes; one admin; transport cost zero; network is homogeneous; secure by default. All wrong.

Common mistakes that ship to production.

Patterns every team writes at least once. Read these now, so you recognise the shape later, when something on-call is misbehaving and the dashboard is no help.

Wall clocks for ordering
NTP steps backwards. Leap seconds happen. VMs pause for seconds at a time. Never order events by Date.now() — use logical clocks, vector clocks, or HLCs.
Timeout = failure
Slow is not the same as dead. A partition is not the same as a crash. The kindest assumption — "node X has not responded yet" — is also the only true one.
Eventual consistency without conflict resolution
Eventually means "convergence to some value". You still need to define which value wins on a concurrent write. Last-Write-Wins on free-form data loses writes silently.
"Exactly-once" delivery
It does not exist over a network with failures. The two attainable contracts are at-most-once and at-least-once. Exactly-once is at-least-once + idempotent processing.
Distributed locks without fencing tokens
A client takes the lock, GCs for 30 seconds, the lock expires, another client takes it — and the first client wakes up still believing it holds the lock. Fence every lock-protected write.
Quorum reads but no quorum writes (or vice versa)
Asymmetric quorums break the W + R > N invariant. Either commit to strong consistency end to end or admit you are eventually consistent and design around it.
Last-write-wins on values that don't commute
LWW is fine for "current temperature". It is catastrophic for "set of friends" or "shopping cart". Use a CRDT, or a per-write merge function.
Calling SQL "Serializable" when default is Read Committed
Postgres, MySQL, Oracle all default below Serializable. Half the bugs in distributed apps are anomalies the dev assumed could not happen at the level they didn't set.
Believing the eight fallacies
You can recite them and still write code that assumes the network is reliable. Re-read the list each time you ship a new RPC client.
Reaching for consensus when single-leader replication would do
Paxos is operationally expensive. If your workload is "writes go to one node, reads can be stale-tolerant", you do not need it. Use it when you actually need agreement.

Practice deck.

Ten cards: the questions interviewers ask, the things that bite operators in production, and the details that separate "I use distributed systems" from "I understand them".

Card 1 of 10
A node has not heard from another for 5 seconds. Is the other node dead?
Suggested sequences

Reading progressions

Three ordered paths through this material. Pick the one that matches where you are.

Path 01 · Core
Consensus & replication

The foundations: how distributed nodes agree, and how data is replicated safely.

  1. Write-Ahead Logging — durability baseline
  2. ACID Transactions — isolation levels
  3. Read/Write Quorum Simulator ↗
  4. Isolation Levels Simulator ↗
  5. Kafka — the distributed log
Path 02 · Storage
Storage internals

From B-trees to LSM trees: the structures that keep databases fast and durable at scale.

  1. Database Indexing — B-trees & hash indexes
  2. WAL — crash recovery in practice
  3. Database Sharding Simulator ↗
  4. Consistent Hashing Simulator ↗
  5. Database Internals study path
Path 03 · Reliability
Failure & recovery patterns

The operational patterns that keep production systems alive: backpressure, retries, circuit breakers.

  1. Message Queues — delivery semantics
  2. Service Discovery — health & failover
  3. Circuit Breaker Simulator ↗
  4. Retry Strategy Simulator ↗
  5. Load Balancing — traffic distribution

Keep going.

Distributed systems rewards reading and re-reading. The same Lamport paper, read on day 30 and again on day 300, gives you different things. So will DDIA. So will every Jepsen post. The field is not large. It is dense.

Pick one real system and read its source for an afternoon. etcd, FoundationDB, CockroachDB, TiKV, NATS, Kafka: all open. Pair what you read with the paper that inspired it. Then come back to your own service and re-read your own retry, timeout, and replication code. You will rewrite some of it.