2PC and sagas
A single user action touches three services: charge a card, reserve inventory, ship an order. You need all three to commit or all three to undo. The network drops messages, services crash mid-flight, and there is no shared transaction log. Two-phase commit, three-phase commit, and sagas are the three answers the industry has actually deployed. Each picks a different point on the consistency, latency, and operational-complexity curve.
The problem
A user clicks "place order". Behind the click, three services need to act: the payments
service charges the card, the inventory service decrements stock, the shipping service
schedules a pickup. Each owns its own database, so there is no single transaction log
that can wrap all three in a BEGIN ... COMMIT. Either they all commit and
the order goes out, or they all roll back and the user sees an error. Anything in
between is a half-charged customer or a phantom shipment.
The network can drop messages, reorder them, or deliver them after a 30-second pause. Any of the three services can crash between operations and restart minutes later. The coordinator that drives the workflow can crash too. Atomicity across services means making every one of those failure modes converge to the same global outcome, all-commit or all-rollback, without losing the user's money or stock.
The literature has three deployed answers: two-phase commit (a voting protocol with strong atomicity and a fatal blocking flaw), three-phase commit (a fix that works only under synchrony), and sagas (a different shape entirely, where you replace rollback with semantic compensation). Modern microservice systems lean heavily on sagas. Classic enterprise systems still run 2PC between paired databases.
2PC mechanism — prepare, then commit
Jim Gray formalised two-phase commit in 1978. It has exactly the structure the name implies. A single coordinator drives every participant through two phases in lockstep.
Phase 1 — prepare. The coordinator sends PREPARE to every
participant. Each participant locks the relevant rows, writes a prepared
record to its durable log, and replies YES if it could complete the work
or NO if it could not. The vote must be persisted before the reply leaves.
If the participant crashes after replying YES, it has to remember it
voted yes when it comes back up.
Phase 2 — commit or abort. If every participant voted YES,
the coordinator writes a commit record to its own log and sends
COMMIT to all participants. Each one applies the change, releases its
locks, and acknowledges. If any participant voted NO, or didn't respond
before the timeout, the coordinator writes abort and sends
ROLLBACK instead.
The protocol works because each participant has promised, durably, that it can commit. Once the coordinator has all the yes votes on disk, it knows global commit is feasible. The XA standard (Open Group, 1991) codifies this exact dance between a transaction manager and one or more resource managers; every enterprise message broker and JDBC driver still ships an XA implementation.
The blocking property — why 2PC's flaw is fatal
Suppose the coordinator collects every YES, writes its commit record, and
sends COMMIT to participant P1, then crashes before P2 or P3 hear from
it. P2 and P3 are now stuck in the prepared state. They voted yes, so they
can't unilaterally abort (P1 may have already committed). They never received commit,
so they can't unilaterally commit (the coordinator may have decided to abort). They
hold their locks and wait. Indefinitely.
This is the blocking property. A surviving participant in
prepared has no way, on its own, to learn the global outcome. Even talking
to the other participants doesn't help: if every survivor is also in
prepared, none of them knows. Recovery requires the coordinator to come
back up and replay its log, or an operator to manually resolve the in-doubt
transaction by reading state from all participants and forcing commit or abort. Either
way the locks stay held; every other transaction touching those rows blocks too.
prepared has surrendered
its autonomy. It has promised to obey the coordinator's eventual decision and has no
right to invent its own. Crash the coordinator at the wrong moment and you freeze
that promise into an unbreakable lock. This is why etcd and ZooKeeper don't use 2PC
internally, and why XA over a WAN is universally avoided.3PC — adding a pre-commit phase
Dale Skeen proposed three-phase commit in 1981 to fix the blocking flaw. The idea: add
a PRE-COMMIT phase between vote and commit. After collecting all yes
votes, the coordinator sends PRE-COMMIT to everyone; only after every
participant acknowledges does it send the final COMMIT. The point of the
extra round is that once a participant has received PRE-COMMIT, it knows
every other participant also voted yes, so it can safely commit on its own if the
coordinator vanishes afterwards.
The catch is the synchrony assumption. 3PC's recovery logic relies on bounded message delay. Survivors run a timeout-based election to decide the outcome, and the argument that no two survivors can decide differently depends on no message being delayed past the timeout. On a real network, a partition can deliver messages after the timeout, and you can end up with one side committing while the other aborts.
3PC is rarely deployed for that reason. The synchronous-network assumption it needs doesn't hold in practice, and the protocols people actually run when they want partition-tolerant agreement (Paxos and Raft) give stronger guarantees with similar message complexity. 3PC mostly lives in textbooks now.
Why 2PC doesn't scale
Even when 2PC works correctly, it is expensive. Every participant holds locks from the moment it votes yes until it receives commit and applies the change. That window covers two network round trips plus the coordinator's disk flush, easily 5 ms on a LAN, 50–200 ms across a WAN. Multiply by the number of concurrent transactions and the database lock manager becomes a contention bottleneck before it becomes a throughput one.
Worse, commit latency is set by the slowest participant. If P1 and P2 respond in 2 ms but P3 takes 50 ms, the whole transaction takes 50 ms. Tail latency dominates. Add more participants and the chance that at least one is slow approaches one. Google's Spanner and CockroachDB both pay this cost deliberately within a single shard but go to enormous lengths (TrueTime, parallel commit) to keep the dance fast.
2PC over a WAN is essentially unusable for interactive workloads. A coordinator in us-east-1 driving participants in eu-west-1 and ap-southeast-1 looks at ~150 ms RTT per phase, so a single commit takes 300 ms plus disk flushes, before the user sees a response. Hold the row locks for that long and concurrency collapses. This is the direct reason microservices reached for sagas: the cross-service version of 2PC is simply too slow to put behind a checkout button.
Sagas — replace locks with compensation
Hector Garcia-Molina and Kenneth Salem published "Sagas" in SIGMOD 1987. The original
motivation was long-lived transactions in a single database, but the pattern fits
cross-service workflows better than 2PC ever did. The shape: break the global
transaction into a sequence of local transactions
T1, T2, ..., Tn, each fully committing at its own service before the next
one starts. For each Ti, define a compensation Ci that semantically undoes it.
The saga runs forward: T1, then T2, and so on. If
Tk fails, the saga runs C(k-1), C(k-2), ..., C1 in reverse
order. Each compensation is itself a regular committed transaction. There is no
global lock, no prepare phase, no blocking. Every step is durable the moment it
finishes, and recovery takes only as long as it takes to run the remaining
compensations.
Concretely, the place-order example becomes: chargeCard,
reserveInventory, scheduleShipping. If
scheduleShipping fails, the saga executes releaseInventory
and then refundCard. Each of those is a normal API call that commits
locally. No participant ever sits in a prepared state holding locks
across services.
Compensation semantics — not ACID rollback
This is the part people miss. A compensation is not the same as a database
ROLLBACK. ACID rollback undoes a transaction that never committed, so the
outside world never saw it. A compensation runs after the original
transaction committed and may already be visible to other clients. It can't pretend
the original never happened. It can only run a forward-direction action that
semantically cancels the effect.
"Refund the card" is the canonical example. The original chargeCard
committed; the bank settled; the customer may have already seen the charge on their
statement. The compensation can't unwind history. It issues a fresh
refundCard transaction that nets to zero. The two appear in the ledger as
two separate events. Same for shipping: if the warehouse already started picking, the
compensation isn't "undo pick", it's "issue a return-to-stock work order".
C2 fails, the system is stuck halfway through unwinding. In practice this
means compensations are idempotent, retried indefinitely, and designed so the only
failure modes are transient (network, downstream unavailability). If you can't write a
compensation that always succeeds, you can't safely use a saga for that step.Some operations have no clean compensation. "Send the marketing email" cannot be unsent. The standard pattern is to defer such operations to the end of the saga, after every reversible step has committed. That way you only send the email once you know the whole transaction is going to commit. The other pattern, less common, is to issue an apology: a follow-up email saying the order was cancelled. Compensation as user-visible side effect.
Choreography vs orchestration
Two ways to wire a saga. In choreography each service listens for
events on a bus and decides what to do next. The payments service publishes
CardCharged; the inventory service consumes it, reserves stock, and
publishes InventoryReserved; the shipping service consumes that and so
on. No central brain. Failure events propagate the same way:
ShippingFailed triggers inventory to release stock, which triggers
payments to refund.
Choreography has the simplest topology: just services and a bus. The price is that the workflow itself is implicit. Nobody owns the document that says "here's what happens when a user places an order". Debugging means tracing events across services, and adding a new step means changing several services to react to (or emit) a new event.
In orchestration, a dedicated workflow service drives the saga
explicitly. It calls chargeCard, awaits the result, calls
reserveInventory, awaits the result, and so on. The flow lives in one
place as code or a state machine; the participating services are dumb endpoints with
no knowledge of the saga. Tracing is trivial, since the orchestrator's log is the
workflow.
Modern microservice systems lean heavily toward orchestration, largely because the workflow engines that emerged from this space (Temporal, AWS Step Functions, Netflix Conductor, Camunda Zeebe) provide durable execution, retries, timers, and visualisation out of the box. Choreography still appears, especially in event-sourced systems where the bus is already the system of record, but "orchestrated saga via Temporal" is the default modern shape.
Workflow engines — durable execution at scale
Temporal (open source, commercial spinout from Uber's Cadence) is the canonical modern example. A workflow is just code, a Go or Java or TypeScript function, but Temporal records every step's input and output to a durable log. If the worker running the workflow crashes, another worker picks up the same workflow ID, replays the recorded history, and resumes from the next un-executed step. This is the "durable execution" pattern: long-lived workflows survive process death without special checkpointing in user code.
// Temporal workflow (Go) — order saga with compensation
func OrderWorkflow(ctx workflow.Context, order Order) error {
ao := workflow.ActivityOptions{
StartToCloseTimeout: time.Minute,
RetryPolicy: &temporal.RetryPolicy{MaximumAttempts: 5},
}
ctx = workflow.WithActivityOptions(ctx, ao)
// Step 1 — charge the card
var chargeID string
if err := workflow.ExecuteActivity(ctx, ChargeCard, order).Get(ctx, &chargeID); err != nil {
return err
}
// Step 2 — reserve inventory; if it fails, refund the charge
if err := workflow.ExecuteActivity(ctx, ReserveInventory, order).Get(ctx, nil); err != nil {
_ = workflow.ExecuteActivity(ctx, RefundCard, chargeID).Get(ctx, nil)
return err
}
// Step 3 — schedule shipping; if it fails, release stock then refund
if err := workflow.ExecuteActivity(ctx, ScheduleShipping, order).Get(ctx, nil); err != nil {
_ = workflow.ExecuteActivity(ctx, ReleaseInventory, order).Get(ctx, nil)
_ = workflow.ExecuteActivity(ctx, RefundCard, chargeID).Get(ctx, nil)
return err
}
return nil
}AWS Step Functions takes the same idea and exposes it as Amazon States Language, a JSON state machine describing transitions, retries, and catch handlers. It runs the saga, persists state in DynamoDB, and integrates with every AWS service as a first-class step target. Netflix Conductor and Camunda Zeebe occupy similar territory with different opinions about workflow definition (Conductor uses JSON DAGs; Zeebe uses BPMN). All four solve the same problem: keep the workflow alive across worker crashes, retries, timeouts, and human-in-the-loop pauses.
Uber runs Cadence (Temporal's predecessor) for everything from driver onboarding to payouts; Airbnb's checkout flows are Temporal workflows; Stripe's payment intent state machine, Coinbase's withdrawal saga, and DoorDash's order pipeline all sit on workflow engines with saga-shaped compensation. The pattern has won at scale.
Protocols compared
| Protocol | Atomicity | Locking | Latency | Blocking? |
|---|---|---|---|---|
| 2PC (XA) | Strong — ACID across participants | Locks held across both phases | 2 RTTs + disk; bad over WAN | Yes — coordinator crash strands participants |
| 3PC | Strong under synchrony | Locks held across three phases | 3 RTTs + disk; worse than 2PC | No, but unsafe under partition |
| Saga (orchestrated) | Semantic — eventual via compensation | Local locks only, never global | Sum of step latencies; no global wait | No — every step is independently durable |
| Distributed SQL (Spanner, Calvin) | Strong — serialisable across shards | Per-shard locks + Paxos commit | Bounded by Paxos quorum RTT | No — Paxos handles coordinator failure |
When to use which
The choice usually maps to the data layout. If the transaction lives inside a single shard of a distributed database that already handles cross-node consensus (Spanner, CockroachDB, YugabyteDB, FoundationDB) use the database's built-in transactions. You get full ACID at the cost of one Paxos round per commit, which is what those systems are designed to amortise. No saga complexity, no compensation logic.
If the transaction crosses services that own independent databases, and especially if those services live in different teams or different deployments, use a saga. Choreography if the system is small and event-driven already; orchestration via Temporal or Step Functions for anything serious. Accept the lack of isolation between steps and design compensations that always succeed.
If the transaction touches exactly two databases of compatible type (two Postgres instances, or a Postgres and an Oracle via XA) and you can tolerate the latency, 2PC via XA still works. The classic enterprise pattern: an XA transaction spanning the application's main database and a JMS message queue, so the message and the row are atomic together. Don't extend this to three or more participants and don't try it over a WAN.
Real systems combine all three. Uber's payment flow uses Spanner-style strong transactions inside a single financial ledger, a Cadence saga to orchestrate the outer order-to-fulfilment pipeline, and message-queue XA where a legacy enterprise component refuses to participate any other way. The architectural decision is per transaction, not per system.
Further reading
- Garcia-Molina & Salem — Sagas (SIGMOD 1987) — the original paper introducing sagas as long-lived transactions with compensating actions.
- Pat Helland — Life Beyond Distributed Transactions — the essay that named the post-2PC world and motivated the move to eventually consistent, entity-scoped operations.
- Temporal documentation — workflows, activities, and durable execution — the reference for the most widely deployed modern saga orchestrator.
- Helland & Campbell — Building on Quicksand — the companion to "Life Beyond Distributed Transactions"; idempotent operations and apology-based recovery.
- Thomson et al. — Calvin: Fast Distributed Transactions for Partitioned Database Systems — the deterministic-ordering approach that sidesteps 2PC's blocking property for distributed-SQL workloads.
- Skeen — Nonblocking Commit Protocols (SIGMOD 1981) — the 3PC paper, useful mostly for understanding why nobody runs it.