Helland · 2009
Paper · Distributed systems · Patterns

Building on quicksand, nothing is exactly once.

Pat Helland's essay is short — eight pages — and stylistically informal: he writes the way he talks. But the argument has shaped how a generation of distributed-systems engineers think about consistency, retries, and the limits of distributed transactions. The core claim: exactly-once delivery is a fiction. Once you accept that, the design space simplifies.

Authors Pat Helland
Year 2009
Venue CIDR

TL;DR

In a distributed system, the sender of a message doesn't know whether the receiver got it. Resending after a timeout is the only safe behaviour. So receivers will see duplicates. The only honest design is one where every operation is idempotent (a repeated operation has no further effect) or commutative (operations can be applied in any order with the same result). Helland walks through several implications: long-lived transactions break, traditional 2PC doesn't scale, and the right pattern is sagas — long-running workflows of idempotent steps with explicit compensation actions for rollback. Twenty years later, every microservice architecture, every event-driven system, every Stripe integration runs on these ideas.

The problem

By 2009, the wave of "distributed everything" had hit operations hard. Messaging systems advertised "exactly-once" delivery; everyone discovered this was a lie in practice. 2PC across microservices was advertised as a transaction primitive; everyone discovered it didn't survive the kinds of failures that actually happen. Idempotency was a topic in textbook databases; nobody was reasoning about it for application code.

Helland had spent his career at Tandem, Microsoft, and Amazon building exactly these systems. The paper is partly an indictment of the industry vocabulary ("exactly-once" is a marketing claim, not an engineering property) and partly a positive proposal for how to actually build reliable distributed systems given the limits.

The key idea

Exactly-once is a property of the conversation, not the transport. The sender can't know whether a message arrived. The receiver can't know whether the sender thinks the message arrived. Any protocol that requires both sides to know the same thing is unbuildable. The honest design pattern: the sender retries until acknowledged; the receiver deduplicates on a stable key. This pushes the "exactly-once" property into the application layer.

Idempotency is the only honest contract. Every endpoint should accept duplicates and produce the same effect as a single delivery. This requires the application to maintain enough state to recognize "I've seen this request before". Stripe's API is the textbook example: every request carries an idempotency key; the server keeps a 24-hour log of responses; duplicates return the original response.

Long workflows need explicit compensation. Distributed transactions don't scale; sagas do. A saga is a sequence of local transactions, each of which has a defined compensation (an "undo") action. If step 5 fails, the system runs the compensations for steps 1-4 in reverse order. Compensation is not the same as rollback — it's a forward-only action that produces an effect that semantically undoes the previous one. (E.g., "issue a refund" compensates "charge the card", but the refund is a separate transaction that may itself need compensation.)

The semantics of operations changes as you scale. A single-node database can guarantee linearisable, ACID transactions. A multi-region database can't — coordination is too expensive. The paper says: accept this and design with weaker semantics. CRDTs, last-writer-wins with logical clocks, eventual consistency are all engineered responses to this constraint.

The idempotency key. Every external-facing API that has any cost (charges, emails, fulfilment) needs an idempotency key. The client generates a unique key per logical operation; the server keeps a log of (key → response) for some window; duplicate requests with the same key return the original response without re-executing. Stripe, Twilio, AWS SDKs all do this. It's the cleanest pattern that exists for getting "exactly-once" behaviour in the presence of network retries.

Contributions

Named the problem. Before this paper, "exactly-once delivery" was a marketing claim; afterward, the systems community treated it as a category error. The "no exactly-once" framing is now standard vocabulary.

Idempotency as a design principle. The paper pushed idempotency from a database property to an application-layer architecture decision. Every microservice tutorial since reaches for idempotency as the first reliability pattern.

Sagas as the replacement for distributed transactions. The pattern was named in 1987 (Garcia-Molina & Salem) but largely forgotten until this paper revived it. The modern microservices "saga pattern" (with compensation actions, orchestrator-based or choreography-based) is a direct descendant.

Operational realism. The paper is unusually candid about the messiness of real distributed systems — partial failures, retries, time-outs, duplicate messages, ambiguous outcomes. The honesty itself was a contribution; most prior writing about distributed transactions had been aspirational.

The CALM / CRDT connection. The argument that systems should be designed around commutative and idempotent operations connects directly to the CRDT research that followed and to Helland's later "Immutability Changes Everything" paper.

Criticisms and limitations

The paper is informal. It's an essay, not a formal protocol. Some readers want the protocols and proofs; this paper doesn't provide them. The companion academic work (Sagas, CRDTs, eventual consistency theory) is needed for rigorous treatment.

Implementation is harder than the prescription. "Make every operation idempotent" sounds simple but in practice requires deep schema changes, careful retry policies, and discipline across teams. Many production systems still get this wrong.

Sagas aren't a free lunch. The paper presents sagas as the obvious replacement for 2PC, but sagas have their own challenges: ordering, isolation (other transactions can see partial saga state), and compensation correctness. Real production saga frameworks (Temporal, Cadence, AWS Step Functions) carry significant operational weight.

Where it shows up today

Stripe's API and every "Idempotency-Key" header in HTTP APIs.

AWS SDKs' built-in retry logic with idempotency tokens.

Temporal, Cadence, Netflix Conductor — workflow orchestrators built around exactly Helland's saga model.

Kafka exactly-once semantics — which the paper would tell you is misnamed. Kafka's "EOS" is producer-level deduplication plus transactional offsets; the application still has to be idempotent for true exactly-once-ness end to end.

Microservice architecture textbooks (Newman, Richardson) reference Helland's arguments throughout.

CQRS, event sourcing, and outbox patterns all assume the world Helland described.

Follow-up reading

  • Immutability Changes EverythingHelland · 2015 · CACM. Helland's follow-up. If state is immutable, idempotency and commutativity are easier — and you get a clearer way to reason about correctness.
  • Life Beyond Distributed TransactionsHelland · 2007 · CIDR. The predecessor essay. Argues distributed transactions don't scale and points toward the patterns of this paper.
  • SagasGarcia-Molina & Salem · 1987 · SIGMOD. The original saga paper. Helland's 2009 essay revived these ideas for the microservice era.
  • Conflict-Free Replicated Data Types (CRDTs)Shapiro et al · 2011 · INRIA TR. Helland's commutativity argument formalised as data structures. Annotated.
  • The Tail at ScaleDean & Barroso · 2013 · CACM. Why retries (the source of duplicate messages) are unavoidable at scale. Annotated.
More annotated papers
Back to the papers index
Foundational distributed-systems and database papers, read and annotated.
Found this useful?