Consensus — Paxos and Raft
Consensus is the problem of getting N nodes to agree on a single value despite arbitrary message delays, lost messages, and crashed nodes. A 1985 result by Fischer, Lynch, and Paterson proved that no protocol can always decide. Paxos (1989) and Raft (2014) ship in production anyway: they give up "always" in exchange for staying safe whenever they do decide, and the network behaves often enough that they decide most of the time.
FLP — what's impossible
Fischer, Lynch, and Paterson, 1985: in an asynchronous system with even one crash-prone process, no deterministic protocol can always reach consensus.
The trick FLP exploits: messages can be delayed arbitrarily, and a crashed process looks identical to a slow one. The protocol can never tell "did node X die" from "is node X just taking a long time to reply", and any choice it commits to can be wrong if the message it relied on shows up later.
Production protocols (Paxos, Raft, Zab) sidestep FLP by assuming partial synchrony: the network usually behaves, and when it does, consensus terminates. When it doesn't, the protocol stalls but never decides incorrectly. Safety always; liveness when the network is well-behaved.
Paxos in two phases
Lamport's 1989 protocol decides on one value among N proposers. The roles: proposer, acceptor, learner. (Often the same nodes play more than one role.)
A proposer assigns a unique proposal number n (monotonic, e.g. (round, node_id)),
then runs two phases:
- Phase 1 — Prepare(n). The proposer sends Prepare(n) to a majority of acceptors. Each acceptor that hasn't responded to a higher-numbered Prepare promises not to accept any proposal numbered < n, and replies with the highest-numbered value it has previously accepted (if any).
- Phase 2 — Accept(n, v). If a majority promised, the proposer sends Accept(n, v) — where v is its preferred value, unless Phase 1 returned a previously accepted value, in which case v must be that value (this is the "single-decree safety" rule). An acceptor accepts unless it's promised to a higher n.
A value is decided once a majority has accepted it. Learners discover the decision by listening to the acceptors.
Multi-Paxos and replicated logs
Single-decree Paxos decides on one value. Real systems decide on a sequence: a replicated log of operations. Multi-Paxos runs Paxos once per log slot, with one optimisation: elect a stable leader and skip Phase 1 for later slots, since the leader's promise carries across slots.
This is also where Paxos in practice gets messy. The original paper is thin on the details production systems care about: leader election, log compaction, reconfiguration (membership changes), batching. Each implementation answers these differently. Google's Chubby, Spanner's Paxos, ZooKeeper's Zab: each is its own interpretation. This is the gap Raft set out to close.
Raft, the teachable version
Ongaro & Ousterhout, 2014. The thesis: Paxos is too hard to teach and to implement correctly. Raft solves the same problem with the same safety guarantees, built around three sub-problems you can reason about independently: leader election, log replication, and safety.
Every node is in one of three states:
| State | What it does |
|---|---|
| Follower | Passive. Responds to RPCs from a leader or candidate. |
| Candidate | Trying to become leader. Asks for votes. |
| Leader | Handles all client requests. Replicates log entries to followers. |
Time is divided into terms, which only increase. Each term has at most one leader. A follower that doesn't hear from the leader for the election timeout (randomised, ~150–300 ms) becomes a candidate, increments the term, and asks the cluster for votes. A candidate that wins a majority becomes leader and starts sending heartbeats.
How writes commit, step by step
- Client sends a command to the leader.
- Leader appends the command to its local log.
- Leader sends
AppendEntriesRPCs to every follower in parallel. - Once a majority of followers (including the leader itself) have written the entry to disk, the leader marks the entry committed.
- Leader applies the command to its state machine and returns the result to the client.
- Followers learn an entry is committed in the next
AppendEntriesheartbeat (the leader piggybacks the commit index) and apply it to their own state machines.
A committed entry is durable: it's on a majority of disks. Even if the leader crashes, any future leader must have a log that contains every committed entry, because elections require a majority and any majority intersects with every previous majority.
The five safety properties
Raft proves five invariants. They're worth memorising, because they tell you exactly what the protocol guarantees and what it doesn't:
- Election safety. At most one leader can be elected per term.
- Leader append-only. A leader never overwrites or deletes entries in its log.
- Log matching. If two logs contain an entry with the same index and term, all preceding entries are identical.
- Leader completeness. If an entry is committed in some term, it appears in the logs of all leaders for higher-numbered terms.
- State-machine safety. If a node has applied an entry at a given index, no other node will apply a different entry at that index.
What ships in production
| System | Protocol | Notes |
|---|---|---|
| etcd | Raft | The reference implementation; what Kubernetes uses |
| Consul | Raft | HashiCorp's reference Go implementation |
| CockroachDB | Raft (per range) | Many independent Raft groups, one per data range |
| TiDB | Raft | Per-region Raft groups in TiKV |
| ZooKeeper | Zab | Pre-Raft, similar guarantees, optimised for read throughput |
| Google Chubby | Multi-Paxos | The original production Paxos system |
| Spanner | Multi-Paxos + TrueTime | Per-shard Paxos, externally consistent across shards |
| Kafka (KRaft) | Raft variant | Replaces ZooKeeper as Kafka's metadata store |
Common misunderstandings
- "Raft survives any failure." Raft survives a minority of failures. Lose a majority of nodes at once and the cluster stops accepting writes until you fix it. By design: that's how it avoids split-brain.
- "Raft is fast." Each write is a network round-trip from leader to majority plus a disk fsync on every replica. ~5–20 ms in a single datacentre, more cross-region. Batching helps.
- "Once committed, always served." A leader can crash between committing an entry and applying it to its state machine. The next leader will eventually apply it, but reads against the old leader can serve a value before its commit was known. Linearisable reads need a confirmation round-trip.
- "You only need consensus once." Membership changes (adding/removing nodes) need consensus too. That's the joint-consensus or single-server-change machinery in Raft §6, and it's where most homegrown implementations get bitten.
Further reading
- Ongaro & Ousterhout — In Search of an Understandable Consensus Algorithm — the Raft paper. Aim for the long version, including §6 on membership changes.
- Lamport — Paxos Made Simple — Lamport's own retelling, which finally made the protocol broadly understood.
- Fischer, Lynch, Paterson (1985) — Impossibility of Distributed Consensus with One Faulty Process — the FLP paper itself.
- etcd-io/raft — production-grade Raft in ~10k lines of Go.
- The Secret Lives of Data — Raft visualization — interactive walkthrough; the fastest way to internalise leader election.