Paxos, made simpler.

Lamport's original 1989 Paxos paper was famously impenetrable. The 2001 follow-up rewrote it in 13 pages of plain English. This is the version everyone reads first.

Authors Leslie Lamport

Year 2001

Venue SIGACT News

PDF →

TL;DR

Paxos is an algorithm that lets a group of processes agree on a single value, even when some of them crash and messages arrive late or out of order. It does it in two phases — a prepare phase that locks down a proposal number, and an accept phase that commits a value — with safety guaranteed by the fact that any two majorities of N nodes must overlap by at least one node. The 2001 paper is Lamport's attempt to rederive the algorithm from first principles, in plain prose, after a decade of complaints that the 1989 version was unreadable. It's still the canonical entry point for anyone learning consensus.

The problem

Before Paxos, there was no good answer to a deceptively simple question: how do you get a set of processes to agree on a single value when they can crash, restart, drop messages, deliver them out of order, and have no shared clock? The 1985 FLP impossibility result had already shown that no deterministic algorithm can guarantee consensus in a fully asynchronous system with even one crash failure — so any practical algorithm has to give something up. Paxos gives up liveness in the worst case: it always stays safe (never lets two different values be chosen), but in pathological scheduling it may never make progress.

What didn't exist before was an algorithm that, in the common case, made progress in two message round-trips and could be proved correct under arbitrary message loss and reordering. Two-phase commit existed, but it blocks on the coordinator. Viewstamped Replication (Oki and Liskov, 1988) covered similar ground but never escaped the database community. Paxos was the first consensus protocol that the systems community broadly adopted as the reference point — and it took Lamport a second paper, twelve years later, to make it readable enough that they would.

The key idea

Paxos splits each process into three logical roles — proposers that suggest values, acceptors that vote on them, and learners that find out the chosen value — and then runs the protocol in two phases. In phase 1, a proposer picks a new, globally unique proposal number n and sends a PREPARE(n) to a majority of acceptors. Each acceptor, if it hasn't already promised to ignore proposals numbered n or higher, promises to do so from now on and reports back the highest-numbered proposal it has already accepted, if any. In phase 2, the proposer picks a value — its own if no acceptor reported a previous accept, otherwise the value from the highest-numbered previous accept — and sends ACCEPT(n, v) to a majority. Acceptors that haven't promised to ignore n accept it. Once a majority has accepted (n, v), the value v is chosen.

The whole thing rests on one geometric fact: any two majorities of N nodes must intersect in at least one node. So if a value v was chosen at proposal number n (meaning some majority accepted it), then any later PREPARE(n') with n' > n sent to a majority will hit at least one acceptor who already accepted (n, v) — and the rules force the new proposer to propose v again, not its own value. That's the entire safety argument. Once a value is chosen, every higher-numbered proposal that succeeds must propose the same value.

This is much harder to discover than it is to verify. The 2001 paper is structured as a rederivation — Lamport states the invariant he wants ("only a single value is ever chosen") and then works backward to the protocol that maintains it. By the end, the protocol feels almost forced rather than arbitrary, which is the rhetorical move that makes the paper teachable.

The intersection guarantee. For a system of N nodes, any two subsets of size ⌈(N+1)/2⌉ must share at least one node. This is why "majority" is the magic word and why Paxos tolerates up to ⌊(N−1)/2⌋ failures — three nodes tolerate one, five tolerate two, seven tolerate three. Larger quorums buy fault-tolerance at the cost of latency.

Contributions

The first contribution is the two-phase structure itself. Phase 1 is a read — the proposer learns what, if anything, has already been accepted. Phase 2 is a write — the proposer commits a value, possibly one it inherited from phase 1. Separating these so that phase 1 can be amortised across many phase-2 rounds is the seed of Multi-Paxos: elect a stable leader, run phase 1 once, then run phase 2 over and over for each new value. Single-decree Paxos (one value) becomes the kernel; Multi-Paxos (a log of values) becomes the production system.

The second is the proposal number. Every proposal carries a totally ordered, globally unique number, and acceptors only ever advance their state to higher numbers. This gives a logical clock over proposals that doesn't depend on physical time — directly echoing Lamport's earlier work on logical clocks. The proposal number is what lets a recovering acceptor compare what it sees in a PREPARE against what it has already promised or accepted, and decide deterministically what to do.

The third is the "propose what was last accepted" rule. In phase 2, the proposer is not free to send its own value if an acceptor in its quorum has already accepted something. It must inherit that value. This sounds defensive, but it's what carries the safety invariant across leader changes: even if the previous leader crashed mid-round, a chosen value (one accepted by a majority) propagates forward because at least one acceptor in the new quorum remembers it.

The fourth is the separation of roles. By naming proposers, acceptors, and learners as distinct logical participants — even though in practice every server plays all three roles — the paper lets you reason about each role's invariants in isolation. Acceptors are simple state machines with three pieces of state (min_proposal, accepted_proposal, accepted_value); proposers run the protocol; learners just observe. The clarity of role separation is why people can implement Paxos at all.

Finally, the paper makes the state machine replication link explicit: you can build a replicated log by running one instance of Paxos per slot, and you can build a replicated service by feeding that log into a deterministic state machine. That recipe — Paxos under a log under a state machine — is the architecture of essentially every modern strongly-consistent distributed system, from Chubby to Spanner to etcd (which uses Raft, but Raft is the same recipe with a different consensus core).

Criticisms and limitations

Lamport's "simplification" simplified single-decree Paxos. Multi-Paxos — what you actually run in production — is still ambiguous in several places. The paper sketches it in three pages and leaves choices about leader election, log compaction, configuration change, and read leases to the implementer. The result is that there is no single "Multi-Paxos" — Chubby, Spanner, and Megastore each ship a different dialect, and a persistent joke in the field is that there are as many Paxos variants as there are Paxos implementations. Heidi Howard and Robbert van Renesse's 2015 survey lists more than a dozen named variants.

Safety is unconditional in Paxos; liveness is not. Two proposers racing each other can keep bumping proposal numbers without either ever completing phase 2 — a behaviour the paper acknowledges and the FLP impossibility result predicts. Real systems work around this with a leader-election layer that funnels all proposals through a single proposer at a time. That layer is not part of the paper, and getting it right (with leases, fencing tokens, and split-brain handling) is most of the work in a production implementation. This gap is part of what motivated Ongaro and Ousterhout's 2014 Raft paper, which folds leader election into the consensus protocol itself and sells the result as "Paxos but teachable."

Where it shows up today

Google's Chubby (2006) is the canonical production Paxos system — a distributed lock service that uses Multi-Paxos to replicate a tiny filesystem across five nodes per cell, and which underpins GFS, Bigtable, and effectively all of Google's early infrastructure. Spanner runs a Paxos group per shard ("tablet"), with thousands of independent Paxos groups across a single deployment. Megastore runs Paxos per entity group. Apache ZooKeeper uses Zab, a Paxos-derived but distinct protocol with stronger ordering guarantees on the leader. Microsoft's Autopilot and Amazon's internal control planes use Paxos variants too.

Outside Paxos itself, the influence is everywhere. etcd and Consul and CockroachDB and TiKV use Raft, but Raft inherits Paxos's quorum structure, its proposal-number logic (renamed "term"), and its safety argument almost wholesale. If you've ever read a system described as "strongly consistent" or "linearisable" with more than one replica, you're one or two indirections away from this paper.

Follow-up reading

Lamport — The Part-Time Parliament (1998) — the original Paxos paper, framed as an archaeological reconstruction of a fictional Greek parliament. Famously hard to read; this is the one "Made Simple" was trying to replace.
Ongaro and Ousterhout — In Search of an Understandable Consensus Algorithm (2014) — the Raft paper. Designed from the ground up as a Paxos alternative that's easier to teach. Most modern open-source consensus systems use it.
Fischer, Lynch, Paterson — Impossibility of Distributed Consensus with One Faulty Process (1985) — the result that explains why Paxos can't guarantee liveness. The theoretical bedrock for everything in this area.
Lamport — Time, Clocks, and the Ordering of Events (1978) — Lamport's earlier paper on logical clocks. The proposal number in Paxos is essentially a logical clock value; reading this first makes Paxos make more sense.
Corbett et al. — Spanner: Google's Globally-Distributed Database (2012) — Paxos in production at planet scale. One Paxos group per shard, with TrueTime layered on top for external consistency.
van Renesse and Altinbuken — Paxos Made Moderately Complex (2015) — a pseudocode-level walk-through of Multi-Paxos with all the production details the original paper leaves out. The right second reading.

More annotated papers

Back to the papers index

Foundational distributed-systems and database papers, read and annotated.

↑ Distributed systems study path ← All papers

Found this useful?