02 / 20
Topics / 02

Consensus — Paxos and Raft

Consensus is the problem of getting N nodes to agree on a single value despite arbitrary message delays, lost messages, and crashed nodes. A 1985 result by Fischer, Lynch, and Paterson proved that no protocol can always decide. Paxos (1989) and Raft (2014) ship in production anyway: they give up "always" in exchange for staying safe whenever they do decide, and the network behaves often enough that they decide most of the time.


FLP — what's impossible

Fischer, Lynch, and Paterson, 1985: in an asynchronous system with even one crash-prone process, no deterministic protocol can always reach consensus.

The trick FLP exploits: messages can be delayed arbitrarily, and a crashed process looks identical to a slow one. The protocol can never tell "did node X die" from "is node X just taking a long time to reply", and any choice it commits to can be wrong if the message it relied on shows up later.

Production protocols (Paxos, Raft, Zab) sidestep FLP by assuming partial synchrony: the network usually behaves, and when it does, consensus terminates. When it doesn't, the protocol stalls but never decides incorrectly. Safety always; liveness when the network is well-behaved.

Paxos in two phases

Lamport's 1989 protocol decides on one value among N proposers. The roles: proposer, acceptor, learner. (Often the same nodes play more than one role.)

A proposer assigns a unique proposal number n (monotonic, e.g. (round, node_id)), then runs two phases:

  1. Phase 1 — Prepare(n). The proposer sends Prepare(n) to a majority of acceptors. Each acceptor that hasn't responded to a higher-numbered Prepare promises not to accept any proposal numbered < n, and replies with the highest-numbered value it has previously accepted (if any).
  2. Phase 2 — Accept(n, v). If a majority promised, the proposer sends Accept(n, v) — where v is its preferred value, unless Phase 1 returned a previously accepted value, in which case v must be that value (this is the "single-decree safety" rule). An acceptor accepts unless it's promised to a higher n.

A value is decided once a majority has accepted it. Learners discover the decision by listening to the acceptors.

Why this works. Any two majorities intersect in at least one acceptor. That acceptor's promise carries forward, so any later proposer that completes Phase 1 must learn the previous accepted value and propose it instead. A decided value can't be undone.

Multi-Paxos and replicated logs

Single-decree Paxos decides on one value. Real systems decide on a sequence: a replicated log of operations. Multi-Paxos runs Paxos once per log slot, with one optimisation: elect a stable leader and skip Phase 1 for later slots, since the leader's promise carries across slots.

This is also where Paxos in practice gets messy. The original paper is thin on the details production systems care about: leader election, log compaction, reconfiguration (membership changes), batching. Each implementation answers these differently. Google's Chubby, Spanner's Paxos, ZooKeeper's Zab: each is its own interpretation. This is the gap Raft set out to close.

Raft, the teachable version

Ongaro & Ousterhout, 2014. The thesis: Paxos is too hard to teach and to implement correctly. Raft solves the same problem with the same safety guarantees, built around three sub-problems you can reason about independently: leader election, log replication, and safety.

Every node is in one of three states:

StateWhat it does
FollowerPassive. Responds to RPCs from a leader or candidate.
CandidateTrying to become leader. Asks for votes.
LeaderHandles all client requests. Replicates log entries to followers.

Time is divided into terms, which only increase. Each term has at most one leader. A follower that doesn't hear from the leader for the election timeout (randomised, ~150–300 ms) becomes a candidate, increments the term, and asks the cluster for votes. A candidate that wins a majority becomes leader and starts sending heartbeats.

How writes commit, step by step

  1. Client sends a command to the leader.
  2. Leader appends the command to its local log.
  3. Leader sends AppendEntries RPCs to every follower in parallel.
  4. Once a majority of followers (including the leader itself) have written the entry to disk, the leader marks the entry committed.
  5. Leader applies the command to its state machine and returns the result to the client.
  6. Followers learn an entry is committed in the next AppendEntries heartbeat (the leader piggybacks the commit index) and apply it to their own state machines.

A committed entry is durable: it's on a majority of disks. Even if the leader crashes, any future leader must have a log that contains every committed entry, because elections require a majority and any majority intersects with every previous majority.

The five safety properties

Raft proves five invariants. They're worth memorising, because they tell you exactly what the protocol guarantees and what it doesn't:

  • Election safety. At most one leader can be elected per term.
  • Leader append-only. A leader never overwrites or deletes entries in its log.
  • Log matching. If two logs contain an entry with the same index and term, all preceding entries are identical.
  • Leader completeness. If an entry is committed in some term, it appears in the logs of all leaders for higher-numbered terms.
  • State-machine safety. If a node has applied an entry at a given index, no other node will apply a different entry at that index.

What ships in production

SystemProtocolNotes
etcdRaftThe reference implementation; what Kubernetes uses
ConsulRaftHashiCorp's reference Go implementation
CockroachDBRaft (per range)Many independent Raft groups, one per data range
TiDBRaftPer-region Raft groups in TiKV
ZooKeeperZabPre-Raft, similar guarantees, optimised for read throughput
Google ChubbyMulti-PaxosThe original production Paxos system
SpannerMulti-Paxos + TrueTimePer-shard Paxos, externally consistent across shards
Kafka (KRaft)Raft variantReplaces ZooKeeper as Kafka's metadata store

Common misunderstandings

  • "Raft survives any failure." Raft survives a minority of failures. Lose a majority of nodes at once and the cluster stops accepting writes until you fix it. By design: that's how it avoids split-brain.
  • "Raft is fast." Each write is a network round-trip from leader to majority plus a disk fsync on every replica. ~5–20 ms in a single datacentre, more cross-region. Batching helps.
  • "Once committed, always served." A leader can crash between committing an entry and applying it to its state machine. The next leader will eventually apply it, but reads against the old leader can serve a value before its commit was known. Linearisable reads need a confirmation round-trip.
  • "You only need consensus once." Membership changes (adding/removing nodes) need consensus too. That's the joint-consensus or single-server-change machinery in Raft §6, and it's where most homegrown implementations get bitten.

Further reading

Found this useful?