Consensus — Paxos and Raft

Consensus is the problem of getting N nodes to agree on a single value despite arbitrary message delays, lost messages, and crashed nodes. A 1985 result by Fischer, Lynch, and Paterson proved that no protocol can always decide. Paxos (1989) and Raft (2014) ship in production anyway: they give up "always" in exchange for staying safe whenever they do decide, and the network behaves often enough that they decide most of the time.

FLP — what's impossible

Fischer, Lynch, and Paterson, 1985: in an asynchronous system with even one crash-prone process, no deterministic protocol can always reach consensus.

The trick FLP exploits: messages can be delayed arbitrarily, and a crashed process looks identical to a slow one. The protocol can never tell "did node X die" from "is node X just taking a long time to reply", and any choice it commits to can be wrong if the message it relied on shows up later.

Production protocols (Paxos, Raft, Zab) sidestep FLP by assuming partial synchrony: the network usually behaves, and when it does, consensus terminates. When it doesn't, the protocol stalls but never decides incorrectly. Safety always; liveness when the network is well-behaved.

Paxos in two phases

Lamport's 1989 protocol decides on one value among N proposers. The roles: proposer, acceptor, learner. (Often the same nodes play more than one role.)

A proposer assigns a unique proposal number n (monotonic, e.g. (round, node_id)), then runs two phases:

Phase 1 — Prepare(n). The proposer sends Prepare(n) to a majority of acceptors. Each acceptor that hasn't responded to a higher-numbered Prepare promises not to accept any proposal numbered < n, and replies with the highest-numbered value it has previously accepted (if any).
Phase 2 — Accept(n, v). If a majority promised, the proposer sends Accept(n, v) — where v is its preferred value, unless Phase 1 returned a previously accepted value, in which case v must be that value (this is the "single-decree safety" rule). An acceptor accepts unless it's promised to a higher n.

A value is decided once a majority has accepted it. Learners discover the decision by listening to the acceptors.

Why this works. Any two majorities intersect in at least one acceptor. That acceptor's promise carries forward, so any later proposer that completes Phase 1 must learn the previous accepted value and propose it instead. A decided value can't be undone.

Multi-Paxos and replicated logs

Single-decree Paxos decides on one value. Real systems decide on a sequence: a replicated log of operations. Multi-Paxos runs Paxos once per log slot, with one optimisation: elect a stable leader and skip Phase 1 for later slots, since the leader's promise carries across slots.

This is also where Paxos in practice gets messy. The original paper is thin on the details production systems care about: leader election, log compaction, reconfiguration (membership changes), batching. Each implementation answers these differently. Google's Chubby, Spanner's Paxos, ZooKeeper's Zab: each is its own interpretation. This is the gap Raft set out to close.

Raft, the teachable version

Ongaro & Ousterhout, 2014. The thesis: Paxos is too hard to teach and to implement correctly. Raft solves the same problem with the same safety guarantees, built around three sub-problems you can reason about independently: leader election, log replication, and safety.

Every node is in one of three states:

State	What it does
Follower	Passive. Responds to RPCs from a leader or candidate.
Candidate	Trying to become leader. Asks for votes.
Leader	Handles all client requests. Replicates log entries to followers.

Time is divided into terms, which only increase. Each term has at most one leader. A follower that doesn't hear from the leader for the election timeout (randomised, ~150–300 ms) becomes a candidate, increments the term, and asks the cluster for votes. A candidate that wins a majority becomes leader and starts sending heartbeats.

How writes commit, step by step

Client sends a command to the leader.
Leader appends the command to its local log.
Leader sends AppendEntries RPCs to every follower in parallel.
Once a majority of followers (including the leader itself) have written the entry to disk, the leader marks the entry committed.
Leader applies the command to its state machine and returns the result to the client.
Followers learn an entry is committed in the next AppendEntries heartbeat (the leader piggybacks the commit index) and apply it to their own state machines.

A committed entry is durable: it's on a majority of disks. Even if the leader crashes, any future leader must have a log that contains every committed entry, because elections require a majority and any majority intersects with every previous majority.

The five safety properties

Raft proves five invariants. They're worth memorising, because they tell you exactly what the protocol guarantees and what it doesn't:

Election safety. At most one leader can be elected per term.
Leader append-only. A leader never overwrites or deletes entries in its log.
Log matching. If two logs contain an entry with the same index and term, all preceding entries are identical.
Leader completeness. If an entry is committed in some term, it appears in the logs of all leaders for higher-numbered terms.
State-machine safety. If a node has applied an entry at a given index, no other node will apply a different entry at that index.

What ships in production

System	Protocol	Notes
etcd	Raft	The reference implementation; what Kubernetes uses
Consul	Raft	HashiCorp's reference Go implementation
CockroachDB	Raft (per range)	Many independent Raft groups, one per data range
TiDB	Raft	Per-region Raft groups in TiKV
ZooKeeper	Zab	Pre-Raft, similar guarantees, optimised for read throughput
Google Chubby	Multi-Paxos	The original production Paxos system
Spanner	Multi-Paxos + TrueTime	Per-shard Paxos, externally consistent across shards
Kafka (KRaft)	Raft variant	Replaces ZooKeeper as Kafka's metadata store

Common misunderstandings

"Raft survives any failure." Raft survives a minority of failures. Lose a majority of nodes at once and the cluster stops accepting writes until you fix it. By design: that's how it avoids split-brain.
"Raft is fast." Each write is a network round-trip from leader to majority plus a disk fsync on every replica. ~5–20 ms in a single datacentre, more cross-region. Batching helps.
"Once committed, always served." A leader can crash between committing an entry and applying it to its state machine. The next leader will eventually apply it, but reads against the old leader can serve a value before its commit was known. Linearisable reads need a confirmation round-trip.
"You only need consensus once." Membership changes (adding/removing nodes) need consensus too. That's the joint-consensus or single-server-change machinery in Raft §6, and it's where most homegrown implementations get bitten.

Consensus — Paxos and Raft

FLP — what's impossible

Paxos in two phases

Multi-Paxos and replicated logs

Raft, the teachable version

How writes commit, step by step

The five safety properties

What ships in production

Common misunderstandings

Further reading

03 — Time & clocks