Paper · Consensus
Paper · Distributed systems · Consensus

Raft, designed to be understood.

Ongaro and Ousterhout's 2014 paper argued that consensus algorithms should be designed for understandability, not just correctness. Raft has since become the consensus algorithm of choice in production — etcd, Consul, CockroachDB, TiKV, RethinkDB, and dozens more.

→ The original PDF (raft.github.io, free)

Authors Diego Ongaro & John Ousterhout Year 2014 Venue USENIX ATC PDF raft.github.io/raft.pdf

TL;DR

Raft is a consensus algorithm for managing a replicated log, with the same correctness guarantees as Multi-Paxos but a structure that's tractable to read, teach, and implement. It decomposes the problem into three sub-problems — leader election, log replication, and safety — and constrains the algorithm so that at most one leader exists per term. Within a decade of publication it became the default consensus algorithm in production distributed systems, displacing Paxos almost everywhere outside Google. The paper's real contribution is pedagogical: it showed that a consensus algorithm can be designed for human readers without sacrificing performance.

The problem

Paxos, as introduced by Lamport in 1998, is correct but famously hard to understand. The single-decree version (one value, one round) is tractable; Multi-Paxos — the version you actually need, which agrees on a sequence of values — is not. Lamport's original paper sketched Multi-Paxos in a handful of sentences and left the rest to the reader. Practitioners who tried to implement it ran into ambiguities the paper didn't resolve, and most production "Paxos" implementations are subtly different from each other and from the paper.

Ongaro and Ousterhout taught consensus at Stanford and observed that students struggled. They surveyed 43 readers of both Paxos and Raft and found Raft was rated significantly easier to understand. That sounds like a soft result until you realise what it implies: every distributed-systems engineer who has to reason about, debug, or modify a consensus implementation pays the understandability tax forever. Raft's thesis is that understandability is a first-class design goal, not a nice-to-have.

The key idea

Raft achieves understandability through two techniques: decomposition and state-space reduction. The algorithm is split into three independent sub-problems — leader election (who's in charge), log replication (how the leader gets entries onto a majority), and safety (what invariants the algorithm must preserve across leader changes). Each can be reasoned about on its own. State-space reduction means deliberately ruling out cases — for instance, Raft forbids log entries from flowing in any direction except leader-to-follower, which eliminates a whole category of conflict-resolution logic that Paxos has to handle.

Time is divided into terms, which are monotonically increasing integers. Each term begins with an election. If a candidate wins a majority, it serves as leader for that term; if the election splits the vote, the term ends with no leader and a new term begins. At most one leader exists per term — this is the central invariant that everything else hangs off. Every RPC carries a term number; a server that sees a higher term immediately steps down to follower, and a server that sees a lower term rejects the RPC.

Servers are in one of three states: follower (passive, responds to RPCs), candidate (running an election), or leader (handles all client requests). Followers start an election when they don't hear from a leader within the election timeout — typically randomised between 150 and 300 milliseconds. The randomisation matters: synchronised timeouts cause repeated split votes, while randomised ones almost always produce a clear winner within one or two rounds.

The leader-based simplification. Raft eliminates the dueling-proposers problem by ensuring at most one leader per term. All client requests go through the leader; followers are passive. The complexity that Paxos pushed onto the algorithm, Raft pushes onto leader election. Leader election is the messy part — but it's a self-contained messy part, and once a leader is established the replication path is almost trivial.

Contributions

Leader election with terms. A follower whose election timeout expires increments its current term, transitions to candidate, votes for itself, and sends RequestVote RPCs to every other server. A server grants its vote if (a) it hasn't voted in this term and (b) the candidate's log is at least as up-to-date as its own. A candidate becomes leader when it receives votes from a majority. If another server claims leadership for the same or a higher term, the candidate steps down. If the election times out with no winner, the candidate starts a new term and tries again. With 150-300 ms randomised timeouts, an election usually completes in one round; the worst case observed in the paper's measurements is under a second.

Log replication. The leader takes every client request, appends it to its own log as a new entry, and issues AppendEntries RPCs to every follower in parallel. An entry is committed once it's been replicated to a majority, including the leader. The leader then applies it to its state machine and tells followers, in subsequent AppendEntries, what the latest commit index is. Followers apply committed entries to their state machines in order. AppendEntries also acts as the heartbeat — empty AppendEntries RPCs from leader to followers, every 50 ms or so, suppress elections.

Log matching and conflict resolution. Every AppendEntries carries the index and term of the entry immediately preceding the new ones. A follower rejects the RPC if its log doesn't contain a matching entry at that position. The leader then backs up and tries again with an earlier index, until it finds the point where the logs agree. From there it overwrites any conflicting suffix on the follower with its own entries. This is the Log Matching Property: if two logs contain an entry with the same index and term, they're identical in all preceding entries.

Safety via the election restriction. The dangerous case is a leader being elected without all previously committed entries — it could overwrite them. Raft prevents this by restricting which servers can win an election: a candidate must have a log at least as up-to-date as a majority of the cluster. "Up-to-date" is defined by comparing the term and index of the last entry. Combined with the majority-overlap property (any two majorities share at least one server), this guarantees Leader Completeness: any committed entry is present in every future leader's log. From Leader Completeness, the paper derives State Machine Safety: no two state machines apply different values for the same log index. Both are proved formally in the dissertation.

Joint consensus for membership changes. Adding or removing servers can't be a single atomic step — there's no way to switch every server's view of the configuration simultaneously. Raft transitions through a joint consensus phase where decisions require a majority of both the old and new configurations. This guarantees that no split-brain can occur during the transition: any leader elected during joint consensus has agreement from a majority of both configs and therefore inherits all committed entries.

Criticisms and limitations

The leader bottleneck is real. Every write goes through one server, so write throughput tops out at what a single leader can handle — usually tens of thousands of operations per second in well-tuned systems like etcd. Multi-region or multi-leader topologies need extra machinery: CockroachDB uses per-range Raft groups so different ranges have leaders on different nodes; YugabyteDB uses leader leases to allow follower reads; some systems layer a higher-level partitioning scheme so no single Raft group bears the whole write load. The paper's single-group design is pedagogically clean but doesn't scale horizontally on its own.

Two other footnotes. First, Raft as specified assumes every log append is fsynced to stable storage before being acked — real systems often batch fsyncs (group commit) and accept a small durability risk for an order-of-magnitude latency win. Second, the membership-change protocol in the ATC paper has known correctness bugs that Ongaro's dissertation later corrected; the dissertation's single-server-at-a-time change protocol is what most implementations actually use. If you're implementing Raft, read the dissertation, not the conference paper.

Where it shows up today

Raft is the consensus algorithm in etcd, which is the metadata store for Kubernetes — every Kubernetes cluster on Earth runs Raft underneath. Consul uses it for service catalog and KV. CockroachDB runs a separate Raft group per 64 MB range, with thousands of Raft groups per node; TiKV and TiDB take the same approach. RethinkDB used it for cluster metadata; MongoDB's replication protocol since 3.2 is Raft-influenced. AWS Aurora's storage layer uses a Raft variant for quorum durability across availability zones. The Hashicorp ecosystem (Nomad, Vault) is Raft top to bottom.

There are at least a dozen production-quality libraries — hashicorp/raft in Go, etcd-io/raft (the etcd implementation, used by both etcd and Kubernetes), tikv/raft-rs in Rust, baidu/braft in C++. The fact that a graduate-student paper produced this much industrial uptake within a decade is unusual, and is the strongest empirical evidence for the understandability thesis: a clearer algorithm gets re-implemented more often, more correctly, by more people.

Follow-up reading

Found this useful?