11 min read · Guide · Distributed systems
How it works · Distributed systems · Consensus

How Raft keeps a cluster agreeing on one shared log.

Diego Ongaro and John Ousterhout wrote In Search of an Understandable Consensus Algorithm in 2014. A decade later, etcd, Consul, CockroachDB, MongoDB, RethinkDB, and Kafka's KRaft mode all run it. The understandability claim turned out to be correct.

Parts01–06 Interactive5-node cluster PrereqRPC / quorum

Raft in motion: five nodes, one log

Submit, kill, partition. The cluster does the rest.

Below: a five-node Raft cluster. One node is leader, the rest are followers. Submit a command — it lands at the leader, replicates to followers, commits when a majority acks. Kill the leader — an election starts, a candidate wins, the cluster continues. Partition the network — the minority side loses quorum and stalls on writes; the majority keeps committing.

Leadernone Term0 Committedidx -1 Tick0
Recent events
— quiet —

Raft is the algorithm behind almost everything you've used

etcd, Consul, CockroachDB, MongoDB, KRaft, and dozens more.

Anyone who has touched modern infrastructure has used a Raft implementation, usually without knowing it. Kubernetes stores its cluster state in etcd, which is Raft. HashiCorp's Consul, Nomad, and Vault all use Raft via the hashicorp/raft library. CockroachDB runs thousands of Raft groups, one per ~64 MB key range. MongoDB's replica set protocol is a Raft variant. RethinkDB ships with Raft. Kafka's KRaft mode (since 3.3) replaced ZooKeeper with an embedded Raft. TiKV uses Raft. Dragonboat is a high-performance Go implementation. The list keeps going.

Before 2014, the practical alternative was Paxos and its zoo of variants — Basic Paxos, Multi-Paxos, Fast Paxos, EPaxos, Generalized Paxos. Each variant fixed a real problem but added complexity. Google's Chubby (2006), Spanner's TrueTime-based replication, and ZooKeeper's Zab protocol all walked that road. The implementations were correct, but each one took years and a small team of distributed-systems specialists to get right.

After Raft, most new systems just used Raft. The exception is systems that need optimizations Paxos variants offer — lower latency on uncontested commits (Fast Paxos), multi-leader writes (EPaxos), or geo-distributed consensus where the leader bottleneck matters. For everything else — single-region or single-DC consensus over a small fixed group of nodes — Raft is the default. The understandability claim turned out to be correct: well-known production-quality libraries (hashicorp/raft, etcd-raft, dragonboat) exist that anyone can drop in. The equivalent for Paxos took 15 years to materialize, and is still rarer.


Leader election: randomized timeouts break ties

Three states, one leader per term.

Every Raft node is in one of three states: follower, candidate, or leader. Time is divided into terms — monotonically increasing integers, like an election year. At most one leader exists per term. The constraint is structural: a node only votes for one candidate per term, and a leader needs a majority. With five nodes, that's three votes; only one candidate can collect three from a pool of five.

Followers expect heartbeats from the current leader within an election timeout, randomized between 150 and 300 ms in the original paper (etcd defaults to 1 s, configurable). If no heartbeat arrives in that window, the follower increments its term, votes for itself, transitions to candidate, and fires off RequestVote RPCs to every peer. Each peer that hasn't yet voted in this term grants the vote if the candidate's log is at least as up-to-date as its own (more on that in Part 04). A candidate that gathers a majority becomes leader and starts sending heartbeats at the heartbeat interval — typically 50 ms.

The randomization is the elegant part. If every follower had the same timeout, every follower would become a candidate simultaneously, split the vote, and no one would win. The next round would do the same thing. With each follower picking its own timeout uniformly from a 150 ms window, one follower almost always times out first, gets its election off the ground, and collects votes before others wake up. Ties do happen — when they do, the term ends with no leader, every node restarts its timer, and the next round picks a winner. Diego Ongaro's PhD thesis measured this: even with five nodes, elections almost always converge in one or two rounds.


Log replication: the leader is the funnel

Append, replicate, ack a majority, commit.

Once a leader is established, clients send commands to it. The leader appends the command to its local log as a new entry tagged with the current term and the next index, then sends AppendEntries RPCs to every follower with the new entry. Each follower runs a consistency check (Part 04), appends the entry, and replies with a success. Once the leader has acks from a majority — including its own copy — the entry is committed. The leader applies the command to its state machine and replies to the client. Subsequent AppendEntries RPCs carry the new commit index; each follower applies any entries up to that index to its own state machine.

The log is a flat sequence of (term, command) entries. Each entry is either committed (replicated to a majority and applied) or uncommitted (in the log but not yet replicated to majority, and possibly will be rolled back if a new leader is elected without it). The leader serves all reads and writes; followers are passive in the basic protocol. This single-leader funnel is what makes Raft sequential within a term — there's no concurrent proposing, no multi-paxos round structure to reason about.

Real-world latencies: a Raft commit inside a single data center, three nodes, SSD: 1–10 ms. Cross-AZ inside one region: 5–20 ms. Cross-region (US East to Europe): 80–200 ms. The cost is one round-trip from leader to majority. CockroachDB and TiKV pipeline AppendEntries to amortize the cost across many in-flight entries — the leader doesn't wait for the previous commit before sending the next batch, just like TCP doesn't wait for acks before sending the next window.


Safety: the part that's actually hard

Five properties that make Raft correct under any failure sequence.

Raft's safety properties make it correct under any sequence of crashes, partitions, and recoveries. They are stated in the paper as five invariants, and each one is enforced by a specific mechanism:

Election safety
At most one leader per term. Guaranteed by majority voting and the rule that a node votes for at most one candidate per term — two candidates in the same term cannot both collect a majority from the same pool.
Leader append-only
A leader never overwrites or deletes entries in its own log. Followers can be forced to truncate, but the leader's log only grows.
Log matching
If two logs contain an entry with the same index and term, the logs are identical in all entries up through that index. Maintained by the AppendEntries consistency check: the leader sends prevLogIndex and prevLogTerm; the follower rejects the RPC if its entry at that index has a different term, and the leader backs off and retries with an earlier index until they agree.
Leader completeness
If an entry is committed in a given term, that entry is present in every leader's log for every higher term. This is the trickiest. Raft enforces it through the up-to-date check during voting: a candidate's last log term must be ≥ the voter's last log term, and if equal, its log must be at least as long. A node with a shorter or older log cannot win an election, which means a newly-elected leader necessarily contains every entry that was previously committed.
State machine safety
If any node has applied an entry at log index N, no other node will ever apply a different entry at index N. Falls out of the previous four — same log entries get applied in the same order on every node.

The fifth property is what the user actually cares about. The other four are scaffolding to get there. Reading the safety proof in Ongaro's thesis is a worthwhile afternoon — it shows how each property is needed to prevent a specific failure mode that would otherwise let two state machines diverge.

One subtle point: a leader cannot commit an entry from a previous term just because it's been replicated to a majority. The paper's Figure 8 shows a four-step scenario where doing that would allow a committed entry to be overwritten. The fix: a leader commits an entry from the current term only; replicating that entry implicitly commits all prior entries in its log. Every Raft implementation has a comment somewhere that points back at Figure 8.


Failures and corner cases: what actually goes wrong

Partitions, slow leaders, membership changes, log growth.

Network partition. The minority side can't elect a leader — no majority is reachable — so it can't commit any new entries. If the old leader is in the minority, it keeps sending AppendEntries that nobody acks, and writes stall. Whether the minority can serve reads depends on the read protocol: a stale follower can return stale data; a lease read on the old leader can return stale data until the lease expires; a ReadIndex protocol forces the leader to confirm leadership before reading, which prevents stale reads but costs a round-trip. The majority side keeps making progress as if nothing happened. When the partition heals, the minority's stale leader (if any) sees the higher term from the majority and steps down. Any uncommitted entries on the minority side get truncated.

Slow leader. Heartbeats arrive late or not at all; a follower times out and starts an election with a higher term; the new leader's first AppendEntries to the old leader carries that higher term; the old leader steps down. Raft tolerates this gracefully — the cost is one election cycle of unavailability, typically a few hundred milliseconds. Production systems tune the election timeout up (1 s in etcd, 1.5 s in Consul) to avoid spurious elections from GC pauses or transient network blips.

Configuration changes. Adding or removing nodes is hard, because during the change you can have two overlapping majorities — one in the old config, one in the new — that don't intersect. Two leaders could be elected at the same time. Ongaro's thesis introduces joint consensus: a transitional config where decisions require a majority in both old and new configurations, eliminating the gap. Many implementations use the simpler single-server-at-a-time variation: add or remove one node per round, where it's possible to prove the majorities always overlap. etcd uses single-server; CockroachDB uses joint consensus.

Log growth. A log that grew forever would make recovery and replication impossible. Raft includes snapshotting: each node periodically writes its state machine to disk and truncates the log up to that point. New followers (or followers that fall too far behind) receive an InstallSnapshot RPC instead of an enormous batch of AppendEntries. etcd snapshots every 10,000 entries by default; CockroachDB snapshots a range whenever it's replicated to a new replica.


What Raft doesn't solve

Sharding, cross-shard transactions, read scaling, Byzantine faults.

Raft is consensus over a single log replicated to a fixed set of nodes. That bounds what it can do, and what's left for the surrounding system to solve.

Sharding. A single Raft group is bounded by what one leader can sustain — typically tens of thousands of writes per second on modern hardware, hard-capped by the leader's CPU and the AppendEntries fan-out. To scale beyond that you run many Raft groups in parallel. CockroachDB runs thousands of groups, one per ~64 MB range. TiKV does the same. Choosing how to partition data across groups is now the hard problem — Raft doesn't help.

Cross-shard transactions. Two writes that need to atomically modify keys in different Raft groups need a coordination protocol layered on top — typically two-phase commit across the groups, or Spanner-style TrueTime to order commits without a coordinator. Raft gives you consensus inside a group; it gives you nothing across groups.

Read scaling. By default, only the leader can serve linearizable reads. Followers know what's committed but might be lagged. The fixes: follower reads (acceptable when the application tolerates bounded staleness), lease reads (leader holds a time-bounded lease and can serve reads without consulting followers), and ReadIndex (leader confirms it's still leader by sending a heartbeat round before serving the read, then waits until its applied index catches up). Each trades safety for performance differently.

Byzantine failures. Raft assumes crash-stop failures — nodes can die or be slow, but they don't lie about their state. For Byzantine fault tolerance (nodes that might be malicious or corrupted), you need PBFT, Tendermint, HotStuff, or similar protocols. The blockchain world lives there. Raft does not.

None of this diminishes Raft. It solves the consensus problem cleanly and teachably, which is the prerequisite. Everything else is layered on top — and most distributed systems engineers spend their time on the layers, not on Raft itself. That is, in a sense, the whole point.



A closing note

Lamport spent a career proving that consensus was possible. Liskov and her students at MIT built Viewstamped Replication (1988) on similar ideas at almost the same time, and it never broke out. Ongaro and Ousterhout proved that consensus was teachable. The difference between those two things has shaped a decade of infrastructure. Elect a leader. Replicate the log. Commit on majority. Step down on a higher term. That is enough.

Found this useful?