09 / 20

Topics / 09

Leases & fencing

A distributed mutex is harder than it looks. A client can grab a lock, get descheduled for 30 seconds by a GC pause or a swap storm, and wake up still believing it holds the lock, while a second client has already taken over. Leases bound the damage window; fencing tokens make sure the late writer's write is rejected outright.

Why distributed mutex is broken without expiry

The naive picture: client A calls lock("widget-42"), gets back OK, does some work, calls unlock("widget-42"). The lock-server keeps a table of who holds what. This works fine right up until A pauses.

Pauses happen for boring reasons. A JVM stop-the-world GC on a 30 GB heap can take tens of seconds. A page fault on a swapped-out process can stall a thread for a minute. A noisy neighbour saturating local disk can freeze an fsync for seconds. A container hitting CPU throttling. A virtual machine being live-migrated. The application code can't detect any of these.

So A pauses for 30 seconds while holding the lock. The lock-server, with no expiry, waits forever. Or, worse, the lock-server does have an expiry but A has no way to know it expired. A wakes up, sends a write to the database believing it still holds the lock, and meanwhile B has acquired the same lock and is writing too. Two writers, one resource, data corruption.

The lease primitive

A lease is a lock with a deadline. The lock-server hands out a lease of duration D. It's yours until time now + D, and after that the server treats it as expired and free for someone else. The client knows D up front, and has to either finish its work before D elapses or renew before D elapses.

This one change bounds the damage window. If A pauses for longer than D, A's lease expires and B can safely take it. The new contract is: if you've been paused longer than D, assume your lease is gone. That's a checkable condition. The client compares its monotonic-clock reading at lease grant to its reading now.

Picking D is a trade-off. Shorter D means a smaller damage window if the lease-holder dies (failover happens sooner) but more renewal traffic and tighter assumptions about pause length. Longer D is cheaper but lets a dead leader's lease linger. Production systems pick 5–30 seconds for batch workloads and sub-second leases for latency-critical leader election.

Fencing tokens — Kleppmann's fix

Martin Kleppmann's 2016 post How to Do Distributed Locking made the argument that became the consensus position: a lease alone is not enough. A paused client that wakes up after expiry doesn't know it was paused. It happily sends its write to the database, which can't tell it apart from a current lease-holder's write.

The fix is a fencing token: every lease grant includes an integer that only ever increases. The protected resource (database, file system, message broker) remembers the highest token it has ever seen. Any write that arrives with a smaller token is rejected. The stale client's write is dropped at the storage layer, no matter what the lock-server thinks.

The fencing-token rejection example. Client A acquires the lease and gets token 31. A pauses. Client B acquires the lease and gets token 32. B writes to storage; storage now remembers 32. A wakes up and writes with token 31; storage sees 31 < 32 and rejects the write. The pause was harmless. Without the token, A's write would have silently overwritten B's.

The key property the token relies on: monotonicity. The lock-server must guarantee that no two clients ever receive the same token, and that tokens strictly increase in grant order. Any consensus-backed system gets this for free, since the token is just the log index.

The Redlock argument

Redis ships a distributed-lock algorithm called Redlock that acquires a quorum of single-node locks across N independent Redis nodes. Antirez (Redis's author) argued it was safe under reasonable assumptions; Kleppmann argued it wasn't. Their exchange in 2016 became the canonical reference on why distributed locks are subtle.

The argument, compressed. Redlock leans on bounded clock drift and bounded GC pauses to keep its safety properties. Kleppmann's response: those bounds don't hold in practice. A long GC pause or a clock jump breaks the algorithm, and crucially, Redlock returns no fencing token, so the storage layer can't reject a stale write. Antirez's response: with monotonic clocks and reasonable operational discipline the assumptions are fine for most use cases. The community settled on: if you need correctness, use a consensus-backed lock service that returns a fencing token (etcd, ZK, Chubby) and have your storage layer check it.

Real implementations

Most production lock services are Paxos- or Raft-backed and expose a lease API of some flavour:

System	Consensus	Lease TTL	Fencing tokens
Chubby (Google)	Paxos	~12 s, KeepAlive every ~5 s	yes (sequence number per acquire)
ZooKeeper	Zab	Session timeout (often 10–40 s)	yes (sequential znode counter)
etcd	Raft	Explicit Lease API, any TTL	yes (MVCC revision)
Consul	Raft	Session TTL, 10–86400 s	manual (LockIndex; user must enforce)
Redlock	none (multi-node Redis quorum)	typically 10–30 s	no

Chubby's design is the ancestor: a Paxos-replicated lock service used by GFS, Bigtable, MapReduce and a long list of Google internals. Clients hold leases of about 12 seconds and send KeepAlive RPCs every ~5 seconds. If a KeepAlive doesn't reach the Chubby master, the client enters a 45-second "grace period" before it gives up and treats the lease as lost.

ZooKeeper models leases with ephemeral nodes, session-tied znodes that disappear when the session dies. A lock is a sequential ephemeral znode; the smallest sequence number wins, and that sequence number doubles as the fencing token. etcd is structurally similar but exposes leases as a first-class object: LeaseGrant returns a lease ID, you attach keys to it, and when the lease expires the keys disappear. The MVCC revision number that comes with every key write serves as the fencing token. Consul exposes sessions and locks, but the LockIndex it returns has to be threaded through by the application. There's no automatic fencing at the storage layer.

Spanner is the odd one out. Its leader leases are bounded by TrueTime uncertainty, usually 1–7 ms, so a Paxos leader's lease window lines up with the interval of clock uncertainty. The bound is tight because TrueTime is, and the lease ends precisely when the leader can no longer prove it's still inside the lease window.

Lease APIs in two systems

etcd's lease API is the most explicit. You grant a lease, attach keys, and renew on a timer:

# grant a 10-second lease, get back a lease ID
$ etcdctl lease grant 10
lease 694d71ddacfda227 granted with TTL(10s)

# put a key tied to the lease
$ etcdctl put /lock/widget-42 "owner-A" --lease=694d71ddacfda227
OK

# renew the lease before TTL expires
$ etcdctl lease keep-alive 694d71ddacfda227
lease 694d71ddacfda227 keepalived with TTL(10)

# the MVCC revision returned with each put is the fencing token
$ etcdctl put /lock/widget-42 "owner-A" --lease=694d71ddacfda227 -w json
{"header":{"revision":42,...}}

ZooKeeper does it with ephemeral sequential znodes. The sequence number is the fencing token; the smallest one alive wins:

# create an ephemeral sequential znode under /lock/widget-42
[zk] create -e -s /lock/widget-42/req- "owner-A"
Created /lock/widget-42/req-0000000031

# list children, find the smallest sequence number
[zk] ls /lock/widget-42
[req-0000000031]

# session dies (client crash, network partition > timeout) -> znode auto-removed
# next client to create gets req-0000000032, which is its fencing token

Renewal patterns

Once you have leases, you need to keep them alive without flooding the lock-server. Three patterns show up in practice.

Continuous keep-alive. The client sends a renewal RPC every D/3 or so. Easy to reason about; it survives a single missed renewal because two more still fit in the lease window. This is what Chubby and etcd default to.
Batch renewal. One agent (a sidecar or a daemon) renews many leases on behalf of many local clients in a single RPC. Useful when a single host runs hundreds of workers each holding a lease. It collapses N renewal flows into one.
Adaptive renewal. Extend lease duration when load is low, shorten when load is high. Trades a bigger damage window for less renewal traffic during quiet periods. Less common; the constants are tricky to tune.

Leader election is just a lease

Most distributed leader elections are just a single lease. The candidate that wins the lease is leader for the life of the lease; renewals are heartbeats; lease expiry triggers re-election. etcd's concurrency.Election, ZooKeeper's recipe-style leader election, and Consul's session-based leader-locks all reduce to "one client at a time holds a particular lease key".

This is why the consensus protocols in topic 02 — consensus matter here: the lock-server itself has to be a replicated state machine to guarantee that leases are unique and tokens stay monotonic across lock-server failovers. A single-node lock-server isn't safe; a Paxos- or Raft-backed one is. The lease layer sits on top of the consensus layer.

Failure modes worth knowing

Clock drift between client and lock-server. If the client uses its wall clock to decide "is my lease still valid", an NTP step or a VM live-migration that jumps the clock can fool it. Use a monotonic clock (CLOCK_MONOTONIC on Linux, System.nanoTime() on JVM, time.Now() in Go for relative comparison) for lease-validity checks never the wall clock.
Partition between client and lock-server. The client can't reach the server to renew, but doesn't know whether the server has timed it out yet. The defensive move is: if a renewal RPC times out, assume the lease is lost, stop touching the protected resource, and re-acquire. Without this, the client will happily keep writing past lease expiry, and only the fencing token will save the storage layer.
Lock-server failover. When the lock-server primary dies, a new primary is elected. As long as the lock-server is Paxos/Raft-replicated, only one primary exists at a time and the lease table is consistent across the failover. If you're running an unreplicated lock-server, a failover can silently hand the same lease to two clients.
The grace period. Chubby's 45-second grace period is a useful idea: when the client loses contact with the lock-server it doesn't immediately drop the lease, because the lease may still be valid server-side and just unreachable. It waits long enough for the server to recover or to plausibly have failed over, then declares the lease lost. This trades latency for fewer false alarms about lost leases.

The production rule

Never use a distributed lock without fencing tokens, and always assume processes can pause arbitrarily long. The lock-server's view of who-holds-what is advisory; the only authoritative answer is the storage layer's check of the fencing token at write time.

In practice that means two things. First, pick a lock service that returns a monotonic token: etcd's MVCC revision, ZooKeeper's sequence number, Chubby's sequence number. Second, plumb that token through to the resource you're protecting and have the resource compare-and-set on it. If you can't add a check at the storage layer, you don't have a safe distributed lock, you have a hint.

Leases & fencing

Why distributed mutex is broken without expiry

The lease primitive

Fencing tokens — Kleppmann's fix

The Redlock argument

Real implementations

Lease APIs in two systems

Renewal patterns

Leader election is just a lease

Failure modes worth knowing

The production rule

Further reading

10 — Leader election