Distributed Lock Simulator: why a lease is never just a lock.

Hold a regular Mutex and you have it until you release it. Hold a distributed lock and you have it until a timer on a different machine decides you don’t. A GC pause, a swap, a partition — any of these splits the lock in two while you keep writing. Pause client A. Watch B acquire. Then watch A write anyway.

holder
token
rejected
0

backend: fencing: ttl:
etcd token = mod_revision · lease remaining —s

Grant a lease, attach a key, heartbeat it. The key’s mod_revision is your fencing token. Raft-replicated, so the service itself is consistent.

holdernone
clockt=0.0s
A idle no token
idle
B idle no token
idle
C idle no token
idle
D idle no token
idle
resource: shared-counter highest accepted token = 0 · accepted 0 · rejected 0
— start the clock, then click acquire on a client —

What you're looking at

Four clients, one lock service, and one shared resource that counts writes. Start the clock and click acquire on a client: the service hands it a lease and a monotonic token, and its card turns green. Each client can write to the resource, release, take a GC pause, or be partitioned from the service. The resource tracks the highest token it has accepted, and with fencing on it rejects any write carrying a lower one. The backend toggle swaps the lease flavour (Redlock, ZooKeeper, etcd); the TTL sets how long a lease lives before the service revokes it.

The scenario to run first: fencing off, TTL 5s. A acquires, then hit GC 8s on A. At five seconds the lease expires and B acquires with token 2. A wakes at eight seconds still believing it holds the lock and writes, and the resource takes it. Two holders, one corrupted counter. That is the bug the whole page is about. Now flip fencing on and repeat: A's late write carries the stale token 1, the resource is already at 2, and the write is rejected. The lesson is that no lease or TTL closes the gap between the service revoking a lock and the client noticing; only a token checked at the resource does.


A “distributed lock” is a contract, not a lock

You don’t hold a lock. A service somewhere believes you do, until a timer fires.

When you hold a regular in-process Mutex, you have it until you release it. There is no third party with a contradictory opinion; the CPU’s atomic CAS is the ground truth. When you hold a distributed lock, you have it until the lock service decides you don’t — and the lock service makes that decision based on a timer, not on whether you’re actually still alive. A 30-second JVM GC pause, a swap-in delay on a memory-starved host, a process suspended by the OS scheduler under heavy load, a tcp retransmit timeout, a VM live-migrated to a different rack — any of these makes the service “decide” you’re dead while you’re still alive and still about to write to the protected resource.

Kyle Kingsbury (Jepsen / Aphyr) made a career out of demonstrating this against every major distributed lock implementation. Martin Kleppmann’s 2016 essay on Redlock, written while writing Designing Data-Intensive Applications, formalised the critique: there is no algorithm involving a network and a clock that can guarantee “only one client holds the lock at a time” in a way the clients themselves can detect. The lock service can serialise its own view of who holds the lease; it cannot serialise your writes. That is a different problem, and it has a different solution.


Three real implementations, three different fictions

Redlock, ZooKeeper, etcd. They give you something to coordinate around. They do not give you a lock.

Redis Redlock, proposed by Antirez (Salvatore Sanfilippo) in 2014, acquires a lease on a majority of N independent Redis instances within a clock-bounded time window. Typical acquisition: 1–10 ms on a same-region cluster. The lease lives in each Redis as a key with PX TTL. Kleppmann’s critique is that the proof of correctness leans on assumptions about clock drift and stop-the-world pauses that aren’t actually guaranteed on commodity hardware. Antirez published a response. The argument is unresolved in theory; in practice, every Redis-backed lock library ships with at least one open issue about lost writes under load.

ZooKeeper, descended from Chubby (Google, 2006), takes a different route. You create /locks/foo as an ephemeral sequential znode: ephemeral because it’s tied to your session, so if your session times out the znode evaporates and every watcher fires; sequential because the create returns a monotonically-increasing number. That sequence number doubles as a fencing token. Typical acquisition: 50–200 ms because of the ZAB write to a majority of ensemble nodes. The session-timeout problem is the same as Redlock’s lease problem — a stop-the-world pause can outlast your session, and your old znode is already gone by the time you wake up.

etcd, the Raft-backed key-value store at the heart of Kubernetes, gives you a lease: you ask for one, you attach a key to it, and you call KeepAlive in the background to extend it. The key has a mod_revision that increases on every modification; that revision is what you fence with. Typical acquisition: 10–50 ms. The Raft layer makes the service’s own view consistent; the lease-vs-keepalive race makes the client’s view of its own holding status optimistic. All three of these give you something useful. None of them prevent the gap between “the service thinks the lease expired” and “your process notices.”


Fencing tokens close the gap

This is computer science, not magic. The token is the proof.

The pattern is simple to state and surprisingly easy to skip. The lock service returns a monotonically-increasing token on each acquire. Every write to the protected resource carries the token. The resource — a database with a version column, an object store with conditional-put, a sharded cache with CAS, a filesystem with O_EXCL temp-and-rename — accepts a write only if the token is the highest token it has ever accepted. If you got fenced (lost the lock without noticing), your token is now stale and your write is rejected, not silently applied. The corruption that would have happened becomes an error you can retry from.

ZooKeeper’s zxid serves as a fencing token. etcd’s mod_revision serves as a fencing token. A Postgres version column updated with UPDATE ... WHERE id = ? AND version = ? serves as a fencing token. Spanner’s TrueTime intervals, which bound clock uncertainty rather than pretending it doesn’t exist, can serve as a fencing token. The lock service’s job is to issue the token. The resource’s job is to refuse stale ones. Neither side can do the other’s job, and the systems that try to skip the resource-side check are the ones that hit production incidents at 3 AM.

The deepest reason this works and timer-based locks don’t: a fencing token converts the lock-holding question from a temporal one (“is my lease still valid?”) into a causal one (“was my write linearizable with respect to the resource’s history?”). Clocks lie. Networks pause. Causality — encoded as a monotonic counter on a single resource — doesn’t.


When you don’t need a distributed lock at all

Most “I need a lock” problems are actually idempotency or row-level locking problems.

The cases where a distributed lock is the right tool are narrow: leader election (and even then, use Raft/etcd/ZK directly rather than a third-party “lock library”), one-time singleton jobs across a fleet, and mutual exclusion over a resource that doesn’t support optimistic concurrency. For the “only one worker processes this message” pattern that drives most lock-shaped questions in code review, use an idempotency key: the work is designed to be safe to do twice, both workers do it, the second is a no-op. For “only one writer per record,” use database row-level locks (SELECT FOR UPDATE) inside a transaction, or optimistic version checks (UPDATE ... WHERE version = N). Both are simpler, both are safer, and neither requires the fencing-token discipline because the database is the resource — the lock and the protected state are colocated, which removes the class of bug this whole page is about.

The honest summary, four parts in: if you can colocate the lock with the data it protects, do that. If you can’t, use a real consensus system to issue a fencing token, and have the resource enforce it. If you can’t do either, you are negotiating with a timer on another machine and hoping no one ever runs a long GC.

Found this useful?