10 / 20

Topics / 10

Leader election

Many distributed designs get simpler if exactly one process is the leader. Writes go through it in order, cron jobs run once, distributed locks have a clear owner. The hard part is picking exactly one, and only one, even when nodes crash and the network splits the cluster in two. The good algorithms are the ones that stay safe when the world stops cooperating.

The setup

Many distributed designs get a lot simpler if you can assume exactly one node is the leader. The leader orders writes into a single line. The leader hands out work so two workers don't grab the same task. The leader runs the nightly cron so the report doesn't get emailed twice. The leader holds the lock that guards a piece of shared state.

The hard word is "exactly". One leader is easy when nothing fails. Zero leaders is annoying but recoverable. Two leaders is a disaster. They each think they're in charge, each accept writes, and the system splits into incompatible histories that have to be reconciled later, often by hand. This is called split-brain, and most of the work on leader election is really work on preventing it.

Why this matters. A system that picks one leader cleanly can act like a single machine for the write path. A system that sometimes picks two leaders is worse than one with no leader at all. Silent inconsistency is harder to spot and harder to fix than plain downtime.

The Bully algorithm

Garcia-Molina, 1982. Every node has a known unique ID, usually an integer, and the rule is dead simple: the highest-numbered live node is the leader. When a node suspects the leader is down (a missed heartbeat, a failed RPC), it starts an election by sending an ELECTION message to every node with a higher ID than itself.

If nobody higher answers within a timeout, the suspecting node declares itself leader and broadcasts a COORDINATOR message. If a higher node does answer, that node takes over the election and the original suspector backs off. The highest live node always wins because nobody above it will respond.

Bully is simple, terminates cleanly when failure detection is perfect, and is taught in every distributed-systems course. It is also fragile in practice. Failure detection isn't perfect, a slow node looks dead, and the protocol assumes the network lets every node reach every higher-numbered peer. Under a partition, both sides can run a Bully election and each elect a different leader.

The Ring algorithm

Chang and Roberts, 1979. Arrange the N nodes in a logical ring — each one knows its successor. To start an election, a node sends an ELECTION message containing its own ID around the ring. Each receiving node compares the carried ID with its own, replaces it if its own is higher, then forwards. When the message has gone all the way around, the surviving ID belongs to the elected leader, and a COORDINATOR message is passed around once more to announce the result.

Message complexity is O(N) per election in the best case and O(N²) when several nodes start at once. Latency is also O(N), since each step waits on the previous hop. For small clusters of half a dozen nodes that's a few milliseconds; for fifty nodes it's a noticeable pause. That's why the Ring algorithm shows up in textbooks more than in production.

Raft-style election

Raft (Ongaro & Ousterhout, 2014) replaced these textbook protocols in most modern systems. Every node is a follower, candidate, or leader. Followers run a randomised election timeout, typically picked uniformly from 150–300 ms. If a follower's timeout fires without word from a leader, it bumps its term number, becomes a candidate, votes for itself, and sends RequestVote RPCs to every peer. A candidate that wins votes from a majority of the cluster (including its own) becomes leader for that term.

Two safety rules close the door on split-brain. First, each node votes for at most one candidate per term. Second, a voter grants a vote only if the candidate's log is at least as up-to-date as its own. Combined with the majority-quorum requirement from consensus, these mean at most one leader can be elected per term, ever.

The randomised timeout is the clever bit. Without it, every node would time out at the same moment after a leader crash, vote-splits would be common, and the election would loop. With randomisation, one node almost always times out first and wins the term before anyone else even starts. etcd, Consul, CockroachDB, TiKV, MongoDB's replica set protocol v1, and Kafka's KRaft mode all use this pattern.

Bully and Ring failure modes

Under a network partition, Bully and Ring both happily elect a leader on each side. From inside either partition the algorithm looks like it finished correctly: the highest live node won, the COORDINATOR message reached everyone reachable, the invariant holds locally. The catch is that "everyone reachable" is now two separate sets, and each set has its own leader.

Raft prevents this by requiring a majority of the configured cluster, not a majority of the reachable nodes. A 5-node Raft cluster split 2/3 by a partition can't elect a leader on the 2-node side, because two votes fall short of the three-vote majority. The 3-node side can elect cleanly. When the partition heals, the 2-node side rejoins the existing term and catches up.

Why Raft can't split-brain under partitions. Each term has at most one leader because election needs a majority, and any two majorities of N share at least one node. That shared node votes only once per term, so two candidates in the same term can't both win. Across terms, the higher term always wins. A partitioned old leader steps down the moment it sees a heartbeat from a higher term.

Lease-based election

A second family of approaches hands the hard part to an external lock service. Google's Chubby (Burrows, 2006) pioneered the pattern: candidates race to acquire a lease on a well-known key. The lease has a TTL, typically 10 to 60 seconds. The lease-holder is the leader. To stay leader it renews the lease before expiry; to step down (or if it crashes) the lease just expires and any other candidate can grab it.

ZooKeeper, etcd, and Consul all expose this primitive. Kubernetes' kube-controller-manager and kube-scheduler use it through the coordination.k8s.io/Lease API, with a 15-second default lease duration and 10-second renewal deadline. GitHub's MySQL Orchestrator uses Raft for leadership coordination across its own nodes. Cassandra's coordinator selection is gossip-based and not strictly an election, but it relies on similar TTL semantics for failure detection.

The appeal of lease-based election is that the hard consensus problem gets solved once, inside the lock service, and a hundred different applications get to reuse it. The catch is that you've made the lock service a single point of failure for your leadership, which is why production lock services are themselves Raft or Paxos clusters.

Fencing

Election alone isn't enough. The classic failure: a leader holds a lease, gets paused by a stop-the-world GC for 20 seconds, the lease expires, a new leader is elected, then the old leader wakes up still sure it's the leader. It issues a write. Whatever it writes to gets corrupted, because the storage layer can't tell that this writer is no longer allowed.

Fencing tokens fix this. Every time a new leader is elected, the lock service hands out a token that only ever increases. The leader includes this token in every write it makes. The storage layer remembers the highest token it has ever seen and rejects any write carrying a smaller one. A paused old leader that wakes up holds an old token, so its write is rejected and correctness holds.

The fencing-token rule. An elected leader is not safe to act until every resource it writes to enforces the token. "I am the leader" without "and the storage rejects writes from earlier leaders" is just wishful thinking. Chubby's sequence numbers, ZooKeeper's zxid, etcd's revision, and Raft's term all play this role. See leases & fencing for the full mechanism.

Production patterns

etcd runs Raft internally and exposes its own leader to clients via the etcdctl endpoint status command. ZooKeeper uses a leader-election protocol called FastLeaderElection, a variant of Paxos with leader-based recovery. Kafka's controller — the broker responsible for partition reassignment and broker membership — is elected through a ZooKeeper znode in the classic deployment, and through the built-in KRaft (Raft) protocol in newer versions.

Kubernetes' control-plane components — kube-controller-manager, kube-scheduler, the cluster-autoscaler, and many CRD operators — all run with multiple replicas for HA, but only one of each is active at a time. They coordinate through the coordination.k8s.io/Lease API. The active replica updates the lease object every few seconds; the standbys watch and take over when the lease goes stale. Cassandra's "seed nodes" aren't elected at all. They're a hand-configured list used to bootstrap gossip, a reminder that not every coordination role needs an election.

# Kubernetes lease object that kube-scheduler renews
apiVersion: coordination.k8s.io/v1
kind: Lease
metadata:
  name: kube-scheduler
  namespace: kube-system
spec:
  holderIdentity: kube-scheduler-master-1_a3f9...    # leader's identity
  leaseDurationSeconds: 15                            # TTL
  renewTime: "2026-05-19T14:32:08.123456Z"            # last heartbeat
  acquireTime: "2026-05-19T11:04:22.000000Z"
  leaseTransitions: 7                                 # election count

# The active scheduler PATCHes renewTime every ~2s.
# Standbys watch the Lease; if renewTime + 15s passes, they race
# to PATCH holderIdentity to themselves and bump leaseTransitions.
# The apiserver's optimistic-concurrency check (resourceVersion)
# ensures only one PATCH wins per round — that one is the new leader.

Cost

In steady state, election costs nothing. The existing leader keeps renewing its lease or sending heartbeats, and followers stay quiet. The cost shows up when the leader dies or a partition hides it. Raft's default heartbeat interval is around 50–150 ms; the election timeout is randomised between 150 ms and 300 ms; an election itself completes in another round-trip. End to end, recovery from a clean leader crash is usually 200–500 ms.

Partitions are worse. A new leader can't be elected until enough followers have given up on the old one, which means the election timeout has to fire. During a flaky partition, intermittent connectivity can keep resetting it. Real-world failover times of 5–30 seconds are common in lease-based systems with 15–30 second TTLs. For systems where leader recovery is the main outage window (synchronous replication, distributed locks, exactly-once schedulers) the lease TTL is the single knob that trades availability for split-brain safety.

Real failures

GitHub's October 2018 outage is the canonical lesson. A 43-second network partition between data centres caused MySQL Orchestrator to fail over to a write-master in a different region; when the network returned, both sides had accepted writes for half a minute, and the resulting inconsistency took 21 hours to repair by hand. See the GitHub 2018 postmortem for the full sequence. It remains one of the clearest published examples of why fencing and majority-quorum election matter together.

Pre-fencing ZooKeeper deployments produced a long quiet tail of similar incidents: a worker would acquire a ZooKeeper lock, get GC-paused or VM-paused for longer than the session timeout, lose the session, a new worker would acquire the lock, and both would proceed to write. Kleppmann's "How to do distributed locking" essay walks through one such case in HBase and is the standard reference for why locks without fencing tokens are not safe locks. They're safe most of the time, and that is not the same thing.

Election strategies compared

Strategy	Messages	Failure mode	Partition behaviour
Bully (Garcia-Molina 1982)	O(N²) worst case	Fragile under imperfect failure detection	Split-brain — each partition elects its own leader
Ring (Chang & Roberts 1979)	O(N) best, O(N²) concurrent	Slow; breaks if ring is disrupted	Split-brain on partition; ring fragments
Raft / Paxos majority	O(N) per election	No leader if no majority reachable	Safe — only majority side elects; minority side blocks
Lease-based (Chubby, ZK, etcd)	O(1) per renewal, O(N) inside lock service	Lease expiry hands election to anyone	Safe only if lock service has majority; needs fencing tokens

Leader election

The setup

The Bully algorithm

The Ring algorithm

Raft-style election

Bully and Ring failure modes

Lease-based election

Fencing

Production patterns

Cost

Real failures

Election strategies compared

Further reading

11 — Failure detectors