Castro & Liskov · 1999
Paper · Distributed systems · BFT

PBFT, Byzantine consensus, made practical.

Lamport's 1982 Byzantine Generals paper proved the bounds but the protocols were impractical — exponential message complexity, synchronous assumptions. PBFT was the first BFT consensus protocol that worked under realistic network conditions (partial synchrony, message loss) with reasonable performance. The paper is dense, the protocol has many cases, and the proof is delicate — but every BFT consensus algorithm since (Tendermint, HotStuff, Diem's Byzantine consensus) traces its lineage to this paper.

Authors Miguel Castro, Barbara Liskov
Year 1999
Venue OSDI

TL;DR

PBFT implements state-machine replication with up to f Byzantine faults among 3f+1 replicas. It uses a three-phase protocol per request: pre-prepare (the primary proposes an ordering), prepare (replicas vote on the ordering), commit (replicas vote on executing the ordered request). Each phase requires 2f+1 matching messages to proceed. The protocol guarantees safety under any network condition; liveness requires eventual synchrony (after some unknown time, message delays are bounded). When the primary is faulty, replicas trigger a view change to elect a new primary. The paper demonstrates ~30% latency overhead vs unreplicated state machines for typical workloads — the first time BFT consensus had been shown to be production-deployable.

The problem

By 1999, the state of practical Byzantine consensus was bad. Lamport's 1982 algorithms assumed synchrony (a known bound on message delay) and had exponential message complexity in the number of faults. Rampart (Reiter, 1994) and other protocols had similar issues. The result was that Byzantine fault tolerance was a theoretical curiosity — interesting in papers, unworkable in systems.

Castro and Liskov set out to change that. The goal was a protocol that worked under partial synchrony (the realistic assumption that message delays are bounded eventually, but not always), had polynomial message complexity, and could be deployed in production. The benchmark they targeted was within 30% of an unreplicated state machine's latency.

The key idea

Three-phase protocol per client request. The primary (designated leader) assigns each request a sequence number and broadcasts a pre-prepare message. Replicas that accept this broadcast a prepare message. When a replica has 2f+1 matching prepares (including its own), it broadcasts a commit message. When it has 2f+1 commits, the request is executed and a reply is sent to the client. The client accepts the reply when it receives 2f+1 matching ones.

Quorum certificates. Each phase requires 2f+1 matching messages because two such quorums intersect in at least f+1 replicas — that intersection contains at least one honest replica, ensuring consistency between any two quorum decisions.

View changes for faulty primaries. If the primary stalls or behaves incorrectly, replicas trigger a view change. Each replica broadcasts a view-change message naming the next primary; once 2f+1 are collected, the new primary starts the next view. The view-change protocol carries enough state (prepared sequence numbers) to ensure that any request that was prepared in the old view is preserved in the new view — safety holds across view changes.

Checkpointing for state truncation. Every K requests (e.g., K=100), replicas exchange checkpoints. Once 2f+1 matching checkpoints exist, log entries older than the checkpoint can be discarded. This keeps the message log bounded.

Cryptographic signatures sparingly. The paper uses MAC-based authentication for most messages (cheaper) and signatures only for view-change messages (which need non-repudiation across views). This was a major performance contribution — earlier BFT protocols signed everything.

The 3f+1 quorum geometry. With n=3f+1 replicas and up to f Byzantine, a quorum of 2f+1 always includes at least f+1 honest replicas. Any two quorums of 2f+1 in a system of 3f+1 must intersect in at least (2f+1)+(2f+1)−(3f+1) = f+1 replicas, of which at least one is honest. That single honest replica in the intersection is what carries decisions between phases and between views — it's the geometric reason 3f+1 is the magic number for synchronous and partially-synchronous BFT.

Contributions

The first practical BFT protocol. Polynomial message complexity (O(n²) per request in the normal case, O(n³) for view changes), realistic synchrony assumptions, measurable performance within 30% of unreplicated state machines.

The three-phase normal-case protocol. Pre-prepare / prepare / commit became the template for every PBFT-derived protocol since.

View change. A protocol for safely replacing a faulty primary while preserving in-flight requests. Subsequent protocols (HotStuff in particular) simplified this, but every BFT view change inherits from PBFT.

MAC-based authentication. Cheaper than signatures. Enabled production-grade performance.

Performance evidence. The paper benchmarks an NFS file server with PBFT replication and shows the 30% latency overhead claim. Until then, BFT had been considered too slow to use for anything outside the simplest demonstrations.

Criticisms and limitations

Message complexity. O(n²) per request is fine for n=4 but breaks down for n in the hundreds. HotStuff (2019) reduced this to O(n) per request by using threshold signatures, which is the modern preference.

View change is expensive. The view-change protocol is O(n³), and when the primary is slow but not totally failed, view changes can ping-pong. Subsequent protocols (Tendermint, HotStuff) simplified the view change but at the cost of more rotating-leader overhead in the normal case.

The protocol description is intimidating. 11 cases in the normal path, more for view changes. Proving correctness requires careful case analysis. This complexity is one reason BFT consensus was slow to deploy outside academia until Tendermint and the blockchain era rediscovered it.

Assumes a fixed replica set. No graceful protocol for adding or removing replicas. The blockchain successors add elaborate reconfiguration protocols on top.

Where it shows up today

Tendermint Core / Cosmos SDK — a PBFT descendant, simplified and adapted for blockchain. Powers Cosmos, BNB Chain, Terra (RIP), Crypto.com, and dozens of others.

HotStuff — the basis for Meta's Diem (Libra) blockchain and several derivatives. Reduces PBFT's message complexity from O(n²) to O(n) with threshold signatures.

Aptos and Sui — both use Diem-derived BFT consensus protocols.

Ripple's XRP Ledger consensus — different design but inherits PBFT-style 2f+1 quorum logic.

IBM Hyperledger Fabric's ordering service used PBFT (later switched to Raft for non-Byzantine settings).

Algorand and other PoS protocols incorporate PBFT-like committee-based consensus for finality.

Follow-up reading

More annotated papers
Back to the papers index
Foundational distributed-systems and database papers, read and annotated.
Found this useful?