17 / 20
Topics / 17

2PC and sagas

A single user action touches three services: charge a card, reserve inventory, ship an order. You need all three to commit or all three to undo. The network drops messages, services crash mid-flight, and there is no shared transaction log. Two-phase commit, three-phase commit, and sagas are the three answers the industry has actually deployed. Each picks a different point on the consistency, latency, and operational-complexity curve.


The problem

A user clicks "place order". Behind the click, three services need to act: the payments service charges the card, the inventory service decrements stock, the shipping service schedules a pickup. Each owns its own database, so there is no single transaction log that can wrap all three in a BEGIN ... COMMIT. Either they all commit and the order goes out, or they all roll back and the user sees an error. Anything in between is a half-charged customer or a phantom shipment.

The network can drop messages, reorder them, or deliver them after a 30-second pause. Any of the three services can crash between operations and restart minutes later. The coordinator that drives the workflow can crash too. Atomicity across services means making every one of those failure modes converge to the same global outcome, all-commit or all-rollback, without losing the user's money or stock.

The literature has three deployed answers: two-phase commit (a voting protocol with strong atomicity and a fatal blocking flaw), three-phase commit (a fix that works only under synchrony), and sagas (a different shape entirely, where you replace rollback with semantic compensation). Modern microservice systems lean heavily on sagas. Classic enterprise systems still run 2PC between paired databases.

2PC mechanism — prepare, then commit

Jim Gray formalised two-phase commit in 1978. It has exactly the structure the name implies. A single coordinator drives every participant through two phases in lockstep.

Phase 1 — prepare. The coordinator sends PREPARE to every participant. Each participant locks the relevant rows, writes a prepared record to its durable log, and replies YES if it could complete the work or NO if it could not. The vote must be persisted before the reply leaves. If the participant crashes after replying YES, it has to remember it voted yes when it comes back up.

Phase 2 — commit or abort. If every participant voted YES, the coordinator writes a commit record to its own log and sends COMMIT to all participants. Each one applies the change, releases its locks, and acknowledges. If any participant voted NO, or didn't respond before the timeout, the coordinator writes abort and sends ROLLBACK instead.

The protocol works because each participant has promised, durably, that it can commit. Once the coordinator has all the yes votes on disk, it knows global commit is feasible. The XA standard (Open Group, 1991) codifies this exact dance between a transaction manager and one or more resource managers; every enterprise message broker and JDBC driver still ships an XA implementation.

The blocking property — why 2PC's flaw is fatal

Suppose the coordinator collects every YES, writes its commit record, and sends COMMIT to participant P1, then crashes before P2 or P3 hear from it. P2 and P3 are now stuck in the prepared state. They voted yes, so they can't unilaterally abort (P1 may have already committed). They never received commit, so they can't unilaterally commit (the coordinator may have decided to abort). They hold their locks and wait. Indefinitely.

This is the blocking property. A surviving participant in prepared has no way, on its own, to learn the global outcome. Even talking to the other participants doesn't help: if every survivor is also in prepared, none of them knows. Recovery requires the coordinator to come back up and replay its log, or an operator to manually resolve the in-doubt transaction by reading state from all participants and forcing commit or abort. Either way the locks stay held; every other transaction touching those rows blocks too.

Why 2PC blocks. A participant in prepared has surrendered its autonomy. It has promised to obey the coordinator's eventual decision and has no right to invent its own. Crash the coordinator at the wrong moment and you freeze that promise into an unbreakable lock. This is why etcd and ZooKeeper don't use 2PC internally, and why XA over a WAN is universally avoided.

3PC — adding a pre-commit phase

Dale Skeen proposed three-phase commit in 1981 to fix the blocking flaw. The idea: add a PRE-COMMIT phase between vote and commit. After collecting all yes votes, the coordinator sends PRE-COMMIT to everyone; only after every participant acknowledges does it send the final COMMIT. The point of the extra round is that once a participant has received PRE-COMMIT, it knows every other participant also voted yes, so it can safely commit on its own if the coordinator vanishes afterwards.

The catch is the synchrony assumption. 3PC's recovery logic relies on bounded message delay. Survivors run a timeout-based election to decide the outcome, and the argument that no two survivors can decide differently depends on no message being delayed past the timeout. On a real network, a partition can deliver messages after the timeout, and you can end up with one side committing while the other aborts.

3PC is rarely deployed for that reason. The synchronous-network assumption it needs doesn't hold in practice, and the protocols people actually run when they want partition-tolerant agreement (Paxos and Raft) give stronger guarantees with similar message complexity. 3PC mostly lives in textbooks now.

Why 2PC doesn't scale

Even when 2PC works correctly, it is expensive. Every participant holds locks from the moment it votes yes until it receives commit and applies the change. That window covers two network round trips plus the coordinator's disk flush, easily 5 ms on a LAN, 50–200 ms across a WAN. Multiply by the number of concurrent transactions and the database lock manager becomes a contention bottleneck before it becomes a throughput one.

Worse, commit latency is set by the slowest participant. If P1 and P2 respond in 2 ms but P3 takes 50 ms, the whole transaction takes 50 ms. Tail latency dominates. Add more participants and the chance that at least one is slow approaches one. Google's Spanner and CockroachDB both pay this cost deliberately within a single shard but go to enormous lengths (TrueTime, parallel commit) to keep the dance fast.

2PC over a WAN is essentially unusable for interactive workloads. A coordinator in us-east-1 driving participants in eu-west-1 and ap-southeast-1 looks at ~150 ms RTT per phase, so a single commit takes 300 ms plus disk flushes, before the user sees a response. Hold the row locks for that long and concurrency collapses. This is the direct reason microservices reached for sagas: the cross-service version of 2PC is simply too slow to put behind a checkout button.

Sagas — replace locks with compensation

Hector Garcia-Molina and Kenneth Salem published "Sagas" in SIGMOD 1987. The original motivation was long-lived transactions in a single database, but the pattern fits cross-service workflows better than 2PC ever did. The shape: break the global transaction into a sequence of local transactions T1, T2, ..., Tn, each fully committing at its own service before the next one starts. For each Ti, define a compensation Ci that semantically undoes it.

The saga runs forward: T1, then T2, and so on. If Tk fails, the saga runs C(k-1), C(k-2), ..., C1 in reverse order. Each compensation is itself a regular committed transaction. There is no global lock, no prepare phase, no blocking. Every step is durable the moment it finishes, and recovery takes only as long as it takes to run the remaining compensations.

Concretely, the place-order example becomes: chargeCard, reserveInventory, scheduleShipping. If scheduleShipping fails, the saga executes releaseInventory and then refundCard. Each of those is a normal API call that commits locally. No participant ever sits in a prepared state holding locks across services.

Compensation semantics — not ACID rollback

This is the part people miss. A compensation is not the same as a database ROLLBACK. ACID rollback undoes a transaction that never committed, so the outside world never saw it. A compensation runs after the original transaction committed and may already be visible to other clients. It can't pretend the original never happened. It can only run a forward-direction action that semantically cancels the effect.

"Refund the card" is the canonical example. The original chargeCard committed; the bank settled; the customer may have already seen the charge on their statement. The compensation can't unwind history. It issues a fresh refundCard transaction that nets to zero. The two appear in the ledger as two separate events. Same for shipping: if the warehouse already started picking, the compensation isn't "undo pick", it's "issue a return-to-stock work order".

The forward-only compensation rule. A compensation must always succeed. Sagas have no mechanism for compensating a compensation — if C2 fails, the system is stuck halfway through unwinding. In practice this means compensations are idempotent, retried indefinitely, and designed so the only failure modes are transient (network, downstream unavailability). If you can't write a compensation that always succeeds, you can't safely use a saga for that step.

Some operations have no clean compensation. "Send the marketing email" cannot be unsent. The standard pattern is to defer such operations to the end of the saga, after every reversible step has committed. That way you only send the email once you know the whole transaction is going to commit. The other pattern, less common, is to issue an apology: a follow-up email saying the order was cancelled. Compensation as user-visible side effect.

Choreography vs orchestration

Two ways to wire a saga. In choreography each service listens for events on a bus and decides what to do next. The payments service publishes CardCharged; the inventory service consumes it, reserves stock, and publishes InventoryReserved; the shipping service consumes that and so on. No central brain. Failure events propagate the same way: ShippingFailed triggers inventory to release stock, which triggers payments to refund.

Choreography has the simplest topology: just services and a bus. The price is that the workflow itself is implicit. Nobody owns the document that says "here's what happens when a user places an order". Debugging means tracing events across services, and adding a new step means changing several services to react to (or emit) a new event.

In orchestration, a dedicated workflow service drives the saga explicitly. It calls chargeCard, awaits the result, calls reserveInventory, awaits the result, and so on. The flow lives in one place as code or a state machine; the participating services are dumb endpoints with no knowledge of the saga. Tracing is trivial, since the orchestrator's log is the workflow.

Modern microservice systems lean heavily toward orchestration, largely because the workflow engines that emerged from this space (Temporal, AWS Step Functions, Netflix Conductor, Camunda Zeebe) provide durable execution, retries, timers, and visualisation out of the box. Choreography still appears, especially in event-sourced systems where the bus is already the system of record, but "orchestrated saga via Temporal" is the default modern shape.

Workflow engines — durable execution at scale

Temporal (open source, commercial spinout from Uber's Cadence) is the canonical modern example. A workflow is just code, a Go or Java or TypeScript function, but Temporal records every step's input and output to a durable log. If the worker running the workflow crashes, another worker picks up the same workflow ID, replays the recorded history, and resumes from the next un-executed step. This is the "durable execution" pattern: long-lived workflows survive process death without special checkpointing in user code.

// Temporal workflow (Go) — order saga with compensation
func OrderWorkflow(ctx workflow.Context, order Order) error {
    ao := workflow.ActivityOptions{
        StartToCloseTimeout: time.Minute,
        RetryPolicy:         &temporal.RetryPolicy{MaximumAttempts: 5},
    }
    ctx = workflow.WithActivityOptions(ctx, ao)

    // Step 1 — charge the card
    var chargeID string
    if err := workflow.ExecuteActivity(ctx, ChargeCard, order).Get(ctx, &chargeID); err != nil {
        return err
    }

    // Step 2 — reserve inventory; if it fails, refund the charge
    if err := workflow.ExecuteActivity(ctx, ReserveInventory, order).Get(ctx, nil); err != nil {
        _ = workflow.ExecuteActivity(ctx, RefundCard, chargeID).Get(ctx, nil)
        return err
    }

    // Step 3 — schedule shipping; if it fails, release stock then refund
    if err := workflow.ExecuteActivity(ctx, ScheduleShipping, order).Get(ctx, nil); err != nil {
        _ = workflow.ExecuteActivity(ctx, ReleaseInventory, order).Get(ctx, nil)
        _ = workflow.ExecuteActivity(ctx, RefundCard, chargeID).Get(ctx, nil)
        return err
    }
    return nil
}

AWS Step Functions takes the same idea and exposes it as Amazon States Language, a JSON state machine describing transitions, retries, and catch handlers. It runs the saga, persists state in DynamoDB, and integrates with every AWS service as a first-class step target. Netflix Conductor and Camunda Zeebe occupy similar territory with different opinions about workflow definition (Conductor uses JSON DAGs; Zeebe uses BPMN). All four solve the same problem: keep the workflow alive across worker crashes, retries, timeouts, and human-in-the-loop pauses.

Uber runs Cadence (Temporal's predecessor) for everything from driver onboarding to payouts; Airbnb's checkout flows are Temporal workflows; Stripe's payment intent state machine, Coinbase's withdrawal saga, and DoorDash's order pipeline all sit on workflow engines with saga-shaped compensation. The pattern has won at scale.

Protocols compared

ProtocolAtomicityLockingLatencyBlocking?
2PC (XA)Strong — ACID across participantsLocks held across both phases2 RTTs + disk; bad over WANYes — coordinator crash strands participants
3PCStrong under synchronyLocks held across three phases3 RTTs + disk; worse than 2PCNo, but unsafe under partition
Saga (orchestrated)Semantic — eventual via compensationLocal locks only, never globalSum of step latencies; no global waitNo — every step is independently durable
Distributed SQL (Spanner, Calvin)Strong — serialisable across shardsPer-shard locks + Paxos commitBounded by Paxos quorum RTTNo — Paxos handles coordinator failure

When to use which

The choice usually maps to the data layout. If the transaction lives inside a single shard of a distributed database that already handles cross-node consensus (Spanner, CockroachDB, YugabyteDB, FoundationDB) use the database's built-in transactions. You get full ACID at the cost of one Paxos round per commit, which is what those systems are designed to amortise. No saga complexity, no compensation logic.

If the transaction crosses services that own independent databases, and especially if those services live in different teams or different deployments, use a saga. Choreography if the system is small and event-driven already; orchestration via Temporal or Step Functions for anything serious. Accept the lack of isolation between steps and design compensations that always succeed.

If the transaction touches exactly two databases of compatible type (two Postgres instances, or a Postgres and an Oracle via XA) and you can tolerate the latency, 2PC via XA still works. The classic enterprise pattern: an XA transaction spanning the application's main database and a JMS message queue, so the message and the row are atomic together. Don't extend this to three or more participants and don't try it over a WAN.

Real systems combine all three. Uber's payment flow uses Spanner-style strong transactions inside a single financial ledger, a Cadence saga to orchestrate the outer order-to-fulfilment pipeline, and message-queue XA where a legacy enterprise component refuses to participate any other way. The architectural decision is per transaction, not per system.

Further reading

Found this useful?