Leader election
Many distributed designs get simpler if exactly one process is the leader. Writes go through it in order, cron jobs run once, distributed locks have a clear owner. The hard part is picking exactly one, and only one, even when nodes crash and the network splits the cluster in two. The good algorithms are the ones that stay safe when the world stops cooperating.
The setup
Many distributed designs get a lot simpler if you can assume exactly one node is the leader. The leader orders writes into a single line. The leader hands out work so two workers don't grab the same task. The leader runs the nightly cron so the report doesn't get emailed twice. The leader holds the lock that guards a piece of shared state.
The hard word is "exactly". One leader is easy when nothing fails. Zero leaders is annoying but recoverable. Two leaders is a disaster. They each think they're in charge, each accept writes, and the system splits into incompatible histories that have to be reconciled later, often by hand. This is called split-brain, and most of the work on leader election is really work on preventing it.
The Bully algorithm
Garcia-Molina, 1982. Every node has a known unique ID, usually an integer, and the rule is dead simple: the highest-numbered live node is the leader. When a node suspects the leader is down (a missed heartbeat, a failed RPC), it starts an election by sending an ELECTION message to every node with a higher ID than itself.
If nobody higher answers within a timeout, the suspecting node declares itself leader and broadcasts a COORDINATOR message. If a higher node does answer, that node takes over the election and the original suspector backs off. The highest live node always wins because nobody above it will respond.
Bully is simple, terminates cleanly when failure detection is perfect, and is taught in every distributed-systems course. It is also fragile in practice. Failure detection isn't perfect, a slow node looks dead, and the protocol assumes the network lets every node reach every higher-numbered peer. Under a partition, both sides can run a Bully election and each elect a different leader.
The Ring algorithm
Chang and Roberts, 1979. Arrange the N nodes in a logical ring — each one knows its successor. To start an election, a node sends an ELECTION message containing its own ID around the ring. Each receiving node compares the carried ID with its own, replaces it if its own is higher, then forwards. When the message has gone all the way around, the surviving ID belongs to the elected leader, and a COORDINATOR message is passed around once more to announce the result.
Message complexity is O(N) per election in the best case and O(N²) when several nodes start at once. Latency is also O(N), since each step waits on the previous hop. For small clusters of half a dozen nodes that's a few milliseconds; for fifty nodes it's a noticeable pause. That's why the Ring algorithm shows up in textbooks more than in production.
Raft-style election
Raft (Ongaro & Ousterhout, 2014) replaced these textbook protocols in most modern
systems. Every node is a follower, candidate, or leader. Followers run a randomised
election timeout, typically picked uniformly from 150–300 ms. If a follower's timeout
fires without word from a leader, it bumps its term number,
becomes a candidate, votes for itself, and sends RequestVote RPCs to every peer.
A candidate that wins votes from a majority of the cluster (including its own)
becomes leader for that term.
Two safety rules close the door on split-brain. First, each node votes for at most one candidate per term. Second, a voter grants a vote only if the candidate's log is at least as up-to-date as its own. Combined with the majority-quorum requirement from consensus, these mean at most one leader can be elected per term, ever.
The randomised timeout is the clever bit. Without it, every node would time out at the same moment after a leader crash, vote-splits would be common, and the election would loop. With randomisation, one node almost always times out first and wins the term before anyone else even starts. etcd, Consul, CockroachDB, TiKV, MongoDB's replica set protocol v1, and Kafka's KRaft mode all use this pattern.
Bully and Ring failure modes
Under a network partition, Bully and Ring both happily elect a leader on each side. From inside either partition the algorithm looks like it finished correctly: the highest live node won, the COORDINATOR message reached everyone reachable, the invariant holds locally. The catch is that "everyone reachable" is now two separate sets, and each set has its own leader.
Raft prevents this by requiring a majority of the configured cluster, not a majority of the reachable nodes. A 5-node Raft cluster split 2/3 by a partition can't elect a leader on the 2-node side, because two votes fall short of the three-vote majority. The 3-node side can elect cleanly. When the partition heals, the 2-node side rejoins the existing term and catches up.
Lease-based election
A second family of approaches hands the hard part to an external lock service. Google's Chubby (Burrows, 2006) pioneered the pattern: candidates race to acquire a lease on a well-known key. The lease has a TTL, typically 10 to 60 seconds. The lease-holder is the leader. To stay leader it renews the lease before expiry; to step down (or if it crashes) the lease just expires and any other candidate can grab it.
ZooKeeper, etcd, and Consul all expose this primitive. Kubernetes' kube-controller-manager
and kube-scheduler use it through the coordination.k8s.io/Lease API,
with a 15-second default lease duration and 10-second renewal deadline. GitHub's
MySQL Orchestrator uses Raft for leadership coordination across its own nodes.
Cassandra's coordinator selection is gossip-based and not strictly an election, but
it relies on similar TTL semantics for failure detection.
The appeal of lease-based election is that the hard consensus problem gets solved once, inside the lock service, and a hundred different applications get to reuse it. The catch is that you've made the lock service a single point of failure for your leadership, which is why production lock services are themselves Raft or Paxos clusters.
Fencing
Election alone isn't enough. The classic failure: a leader holds a lease, gets paused by a stop-the-world GC for 20 seconds, the lease expires, a new leader is elected, then the old leader wakes up still sure it's the leader. It issues a write. Whatever it writes to gets corrupted, because the storage layer can't tell that this writer is no longer allowed.
Fencing tokens fix this. Every time a new leader is elected, the lock service hands out a token that only ever increases. The leader includes this token in every write it makes. The storage layer remembers the highest token it has ever seen and rejects any write carrying a smaller one. A paused old leader that wakes up holds an old token, so its write is rejected and correctness holds.
zxid, etcd's revision, and
Raft's term all play this role. See leases & fencing
for the full mechanism.Production patterns
etcd runs Raft internally and exposes its own leader to clients via the
etcdctl endpoint status command. ZooKeeper uses a leader-election
protocol called FastLeaderElection, a variant of Paxos with leader-based recovery.
Kafka's controller — the broker responsible for partition reassignment and broker
membership — is elected through a ZooKeeper znode in the classic deployment, and through
the built-in KRaft (Raft) protocol in newer versions.
Kubernetes' control-plane components — kube-controller-manager, kube-scheduler, the
cluster-autoscaler, and many CRD operators — all run with multiple replicas for HA,
but only one of each is active at a time. They coordinate through the
coordination.k8s.io/Lease API. The active replica updates the lease
object every few seconds; the standbys watch and take over when the lease goes
stale. Cassandra's "seed nodes" aren't elected at all. They're a hand-configured list
used to bootstrap gossip, a reminder that not every coordination role needs an election.
# Kubernetes lease object that kube-scheduler renews
apiVersion: coordination.k8s.io/v1
kind: Lease
metadata:
name: kube-scheduler
namespace: kube-system
spec:
holderIdentity: kube-scheduler-master-1_a3f9... # leader's identity
leaseDurationSeconds: 15 # TTL
renewTime: "2026-05-19T14:32:08.123456Z" # last heartbeat
acquireTime: "2026-05-19T11:04:22.000000Z"
leaseTransitions: 7 # election count
# The active scheduler PATCHes renewTime every ~2s.
# Standbys watch the Lease; if renewTime + 15s passes, they race
# to PATCH holderIdentity to themselves and bump leaseTransitions.
# The apiserver's optimistic-concurrency check (resourceVersion)
# ensures only one PATCH wins per round — that one is the new leader.Cost
In steady state, election costs nothing. The existing leader keeps renewing its lease or sending heartbeats, and followers stay quiet. The cost shows up when the leader dies or a partition hides it. Raft's default heartbeat interval is around 50–150 ms; the election timeout is randomised between 150 ms and 300 ms; an election itself completes in another round-trip. End to end, recovery from a clean leader crash is usually 200–500 ms.
Partitions are worse. A new leader can't be elected until enough followers have given up on the old one, which means the election timeout has to fire. During a flaky partition, intermittent connectivity can keep resetting it. Real-world failover times of 5–30 seconds are common in lease-based systems with 15–30 second TTLs. For systems where leader recovery is the main outage window (synchronous replication, distributed locks, exactly-once schedulers) the lease TTL is the single knob that trades availability for split-brain safety.
Real failures
GitHub's October 2018 outage is the canonical lesson. A 43-second network partition between data centres caused MySQL Orchestrator to fail over to a write-master in a different region; when the network returned, both sides had accepted writes for half a minute, and the resulting inconsistency took 21 hours to repair by hand. See the GitHub 2018 postmortem for the full sequence. It remains one of the clearest published examples of why fencing and majority-quorum election matter together.
Pre-fencing ZooKeeper deployments produced a long quiet tail of similar incidents: a worker would acquire a ZooKeeper lock, get GC-paused or VM-paused for longer than the session timeout, lose the session, a new worker would acquire the lock, and both would proceed to write. Kleppmann's "How to do distributed locking" essay walks through one such case in HBase and is the standard reference for why locks without fencing tokens are not safe locks. They're safe most of the time, and that is not the same thing.
Election strategies compared
| Strategy | Messages | Failure mode | Partition behaviour |
|---|---|---|---|
| Bully (Garcia-Molina 1982) | O(N²) worst case | Fragile under imperfect failure detection | Split-brain — each partition elects its own leader |
| Ring (Chang & Roberts 1979) | O(N) best, O(N²) concurrent | Slow; breaks if ring is disrupted | Split-brain on partition; ring fragments |
| Raft / Paxos majority | O(N) per election | No leader if no majority reachable | Safe — only majority side elects; minority side blocks |
| Lease-based (Chubby, ZK, etcd) | O(1) per renewal, O(N) inside lock service | Lease expiry hands election to anyone | Safe only if lock service has majority; needs fencing tokens |
Further reading
- Ongaro & Ousterhout — In Search of an Understandable Consensus Algorithm (Raft) — the 2014 paper that made consensus and leader election teachable. Section 5.2 covers election in detail.
- Burrows — The Chubby Lock Service for Loosely-Coupled Distributed Systems — Google's 2006 paper introducing lease-based election as a reusable building block.
- Kleppmann — How to do distributed locking — the essay that made fencing tokens part of the standard vocabulary. Argues that locks without fencing aren't locks.
- Codex — Leases & fencing — the companion deep dive on time-bounded locks and the fencing-token mechanism in detail.
- Codex — Consensus — Paxos and Raft from the agreement-protocol side; leader election is a special case of consensus on a single value.
- Postmortem — GitHub 2018 21h outage — the 43-second partition that caused 21 hours of inconsistency, and the lessons about fencing and quorum-aware failover.