FLP, the impossibility result.
Fischer, Lynch, and Paterson proved in 1985 that no deterministic consensus protocol can guarantee termination in an asynchronous network with even one faulty node. It's the boundary every real consensus algorithm has to negotiate — and explains why every production system relies on timeouts, randomness, or partial synchrony.
TL;DR
In a purely asynchronous message-passing system — no clocks, no bounds on message delay — no deterministic protocol can solve consensus if even a single process is allowed to crash. There always exists an execution in which the algorithm runs forever without deciding.
The proof is short (about ten pages) and constructive: it builds an adversary scheduler that keeps the system stuck in a configuration where the outcome is undetermined. The result is not "consensus is impossible". It is "deterministic termination in pure asynchrony with crash failure is impossible". Every real consensus algorithm — Paxos, Raft, PBFT, Tendermint, Nakamoto — works by weakening one of those three words.
The problem
The setting is an asynchronous distributed system. n processes communicate by sending messages over a reliable channel. Messages eventually arrive, but with no upper bound on delay. There are no clocks, no timers, no way to tell elapsed time. Processes are deterministic — given the same state and the same incoming message, they make the same step.
Each process starts with an input bit (0 or 1). At most one process may crash — it stops sending and receiving for the rest of the execution, with no warning. A consensus protocol must satisfy three properties:
- Agreement. No two non-faulty processes decide differently.
- Validity. If a process decides v, then v was the input of some process. Rules out the trivial "always decide 0" protocol.
- Termination. Every non-faulty process eventually decides.
The question is whether a deterministic protocol exists that satisfies all three. FLP says: no. You can have agreement and validity, but in the worst case you cannot guarantee termination.
The key idea
The proof centres on a notion called bivalence. A configuration of the system (the joint state of every process plus all in-flight messages) is:
- 0-valent if every execution from it eventually decides 0.
- 1-valent if every execution from it eventually decides 1.
- Bivalent if both outcomes are still reachable.
Assume a protocol exists that always terminates. The authors show two things:
- Lemma 1 — a bivalent initial configuration exists. Line up initial configurations by flipping one input bit at a time, from all-0 to all-1. The all-0 config must be 0-valent (by validity); all-1 must be 1-valent. Somewhere along that line two adjacent configurations must straddle the decision. Crashing the one process whose bit differs between them makes them indistinguishable to every other process — so they must lead to the same decision. Contradiction unless at least one of them was bivalent.
- Lemma 3 — from any bivalent configuration, there's always a step to another bivalent configuration. Suppose not: every step from some bivalent C goes to a univalent configuration. By a careful case analysis on the order in which the next message is delivered, the authors construct two reachable configurations that differ only by the action of a single process — and then crashing that process collapses them to the same view, forcing the same decision. Contradiction again.
Put the two together: there's an initial bivalent configuration, and from any bivalent configuration the adversary can pick a next step that stays bivalent. The adversary repeats this forever. The protocol never terminates.
The adversary's only real power is delaying messages. There is no clock to distinguish "this node is slow" from "this node has crashed". One slow node looks identical to one crashed node from every other process's perspective, and the deterministic algorithm has no way to tell them apart. That single indistinguishability is the whole proof.
Contributions
- The formal model. A clean, minimal model of asynchronous message-passing with crash failures. The model itself — processes, configurations, schedules, events — became the standard vocabulary for the next four decades of distributed-systems theory.
- The bivalence proof technique. The valency argument has been reused to prove many other impossibility results: k-set agreement (Borowsky-Gafni, Herlihy-Shavit), wait-free shared-memory consensus, renaming, the asynchronous computability theorem. If you read distributed-systems theory, you keep meeting this proof structure.
- It drew the boundary. Subsequent work had to state, explicitly, what assumption it added on top of pure asynchrony. "We assume partial synchrony" or "we assume an eventually perfect failure detector" or "the algorithm is randomised" became required preamble for any consensus paper.
- It reframed the question. Before FLP: "can we do consensus?" After FLP: "what is the minimum extra assumption that lets us do consensus?" This shift produced the partial-synchrony hierarchy (Dwork-Lynch-Stockmeyer), the failure-detector hierarchy (Chandra-Toueg), and the randomised-consensus line of work.
Criticisms and limitations
The result is sharp and well-bounded, but it is also often misread. The honest limitations are:
- The model is pure asynchrony. Real networks don't look like this. Datacentre networks have median round trips under a millisecond and rarely exceed a few seconds even under pathological conditions. They behave like partial synchrony: bounded most of the time, occasionally not. So FLP is a worst-case statement, and real systems live nowhere near the worst case.
- The result is binary. "Impossible" doesn't quantify how impossible. Later work (Aspnes's 2003 survey of randomised consensus, the probabilistic-termination literature) showed that with randomness you can get expected-O(log n) rounds. The gap between "impossible deterministically" and "easy probabilistically" is enormous, and FLP doesn't characterise it.
- Frequently misread. "Consensus is impossible" is the wrong takeaway; the right one is "deterministic termination is impossible in pure asynchrony with crash failure". Agreement and validity are perfectly achievable. It is liveness — and only liveness — that the result targets.
- Crash failure only. The paper addresses crash-stop processes, not Byzantine ones. The Byzantine generalisation came later (Lamport, Shostak, Pease 1982 had already shown the lower bound on the number of processes; the FLP-style impossibility for Byzantine asynchrony follows by reduction).
Where it shows up today
Every production consensus protocol can be read as a specific answer to "which FLP assumption did you weaken?".
| System | How it escapes FLP |
|---|---|
| Paxos | Partial synchrony. Liveness depends on a stable leader; if the network is too asynchronous, leader election thrashes and the protocol stalls (safe but not live). Lamport says so directly in "Paxos Made Simple". |
| Raft | Same — randomised election timeouts plus partial synchrony. The randomisation is to avoid split votes, not to escape FLP; the partial-synchrony assumption is what guarantees eventual progress. |
| PBFT | Partial synchrony plus explicit view changes. The view-change mechanism is the way PBFT recovers from periods of FLP-style asynchrony by rotating to a new primary. |
| Tendermint, HotStuff | Partial synchrony with leader rotation. Both cite FLP directly and design the protocol around the partial-synchrony assumption. HotStuff's chained pipelining makes the cost of view change linear. |
| Bitcoin (Nakamoto) | Randomness. Proof-of-work is the coin flip; the longest-chain rule plus a probabilistic finality model sidesteps deterministic agreement entirely. Probability of disagreement decays exponentially in confirmation depth. |
| Ben-Or, Bracha (randomised BFT) | Explicit randomness via local coin flips. Expected termination in O(2n) rounds for Ben-Or, polynomial for later variants. The probabilistic line of FLP escape hatches. |
| etcd, Consul, ZooKeeper | Production Paxos/Raft variants. The partial-synchrony assumption is what their election-timeout configuration is encoding. If you set timeouts too aggressively for the network, you get availability problems; too loosely and failover is slow. |
A useful exercise: when reading any consensus protocol, find the place where it assumes a timeout, a coin flip, or a failure detector. That is the line where the protocol pays its FLP tax. There is no protocol — and there will never be a protocol — that does not have such a line.
Follow-up reading
- Fischer, Lynch, Paterson — Impossibility of Distributed Consensus with One Faulty Process (1985) — the paper itself. Ten pages, mostly readable, the proof is the meat.
- Dwork, Lynch, Stockmeyer — Consensus in the Presence of Partial Synchrony (1988) — defines partial synchrony and shows it's enough to escape FLP. Read after FLP.
- Ben-Or — Another Advantage of Free Choice (1983) — the original randomised-consensus paper. Predates FLP's publication; cited as the canonical randomised escape route.
- Chandra, Toueg — Unreliable Failure Detectors for Reliable Distributed Systems (1996) — the failure-detector route. Defines the ♦P, ♦S, and W hierarchies; shows exactly which failure detectors are sufficient for consensus.
- Lynch — Distributed Algorithms (1996) — the textbook. Chapter 12 reproduces the FLP proof with extra exposition; the rest of the book builds the surrounding theory.
- Aspnes — Randomized Protocols for Asynchronous Consensus (2003) — survey of the randomised line. Read this to see what's actually achievable when you allow probabilistic termination.
- Paxos Made Simple — annotated — the canonical partial-synchrony answer to FLP.
- In Search of an Understandable Consensus Algorithm (Raft) — annotated — Raft's pedagogical re-statement of the same idea.
- The Byzantine Generals Problem — annotated — the parallel impossibility line for malicious failures.