06 / 20

Topics / 06

Idempotence at scale

"Exactly-once" is a marketing term. The network gives you at-most-once or at-least-once, never both. Real systems pick at-least-once delivery, give every request an idempotency key, and let the receiver dedupe. Stripe, Twilio, Kafka, and Temporal all run versions of the same trick. This page covers how to fake exactly-once well enough that nobody notices.

Why exactly-once is a fiction

TCP, RPC, and every message bus in the world give you a choice: at-most-once or at-least-once. Not both. The sender ships a message and waits for an ack. If the ack doesn't arrive, the sender can't tell whether the message was lost, the ack was lost, or the receiver crashed mid-handling. The only two options are "give up" (at-most-once, and you lose messages) or "retry" (at-least-once, and the receiver sees duplicates).

No clever protocol closes this gap on its own. Two Generals shows it's impossible in the worst case. What looks like "exactly-once delivery" in marketing material is always one of two things: at-least-once delivery plus idempotent processing, or a closed system where the producer, the broker, and the consumer all coordinate (Kafka EOS, for example). The honest framing is that exactly-once is a property of the application layer, not the wire.

Why this matters. If you design assuming the network will deliver a message only once, the first real outage will charge a card twice, send the same email seven times, or fulfil one order as three. Every serious production API has a story like this. Plan for duplicates from day one.

Idempotency keys

The standard answer. Each request carries a unique key, usually a UUIDv4 the client generates before the first attempt. The server keeps a table mapping (key → response). The first request with a given key runs the operation and records the response. Any duplicate request with the same key skips the work and returns the cached response.

Stripe's API is the textbook version: every POST accepts an Idempotency-Key header, and the docs are explicit that clients should send one on every retryable request. GitHub does the same with X-GitHub-Delivery. AWS calls them "idempotency tokens" and uses them in API Gateway and Lambda's request-id model.

POST /v1/charges HTTP/1.1
Host: api.stripe.com
Idempotency-Key: 4b9d8f2a-1c3e-4f5a-b2d7-8e1c9a4b6f30
Content-Type: application/x-www-form-urlencoded

amount=2000&currency=usd&source=tok_visa&description=Order+1834

The key has to be unique per logical operation, not per attempt. A client that generates a fresh UUID on every retry has defeated the whole mechanism, and that's the most common incident pattern. Generate the key once when the user clicks the button, keep it across retries, and throw it away only when the request finally succeeds or is explicitly abandoned.

The dedup window

How long do you keep the (key → response) table? Too short and a duplicate that arrives after the window expires runs the operation again. Too long and the table balloons. The rule of thumb: longer than the longest plausible retry budget.

Stripe keeps idempotency keys for 24 hours. Twilio keeps them for 7 days. Cloudflare's Workers API keeps them for 30 minutes. The right answer depends on how long your clients keep retrying. A mobile app that retries on next launch needs days; an internal RPC with a 30-second budget needs minutes. Pick a number, write it in the docs, and make sure clients give up before that window closes.

The outbox pattern

Idempotency keys cover the inbound side. The harder problem is the outbound side: a local database write that has to trigger an external message, like "charge succeeded → send receipt email" or "order placed → publish to Kafka". You now have two atomic operations across two systems, which is a distributed transaction, and those are exactly the thing the rest of this site warns against.

The outbox pattern sidesteps it. Inside the local transaction, you write the message into a same-database table called the outbox. The transaction commits atomically: both the business state and the outbox row are saved together, or neither is. A separate worker process then polls the outbox, publishes each row to the external system with an idempotency key, and marks the row as published once the destination acks.

Why the outbox is unavoidable. Without it, you either (a) publish before committing and risk publishing a message for work that rolled back, or (b) commit before publishing and risk losing the message if the process dies between commit and publish. The outbox makes the commit and the intent-to-publish atomic, and moves the at-least-once delivery problem to a place where idempotency keys can solve it. Every event-driven architecture that survives production ends up with some version of this.

Sagas

A saga is a long-running workflow built from a sequence of idempotent steps, each with an explicit compensation action that undoes its effect. "Book flight, book hotel, charge card." If the card charge fails, the saga runs "cancel hotel" then "cancel flight" instead of trying to roll back a distributed transaction that was never really atomic in the first place.

Pat Helland's Building on Quicksand is the canonical paper. The practical implementations are Temporal (and its predecessor Cadence at Uber), AWS Step Functions, and Netflix Conductor. They all give you the same shape: define each step as an idempotent activity, define the compensation, and let the orchestrator handle retries, timeouts, and failure recovery. The orchestrator's job is to remember where it was and resume safely after a crash, which it can only do because every step is idempotent.

Effectively-once

The honest term for what the industry actually delivers. At-least-once delivery plus idempotent processing equals exactly-once-looking semantics at the application layer. Kafka Streams uses this exact framing. Kafka's "exactly-once semantics" feature, added in 0.11 (KIP-98), is the canonical example: it combines a transactional producer that dedupes by producer ID and sequence number with transactional consumer offsets, so a read-process-write loop inside Kafka behaves as if each input were processed exactly once.

The fine print is that Kafka EOS only holds within Kafka. The moment you read from Kafka and write to Postgres, you're back to at-least-once and need idempotency keys again. The protocol gives you exactly-once within a closed system; the application has to carry it to the edges.

Where to put the idempotency key

Location	Example	Trade-off
HTTP header	Stripe's `Idempotency-Key`, GitHub's `X-GitHub-Delivery`	Clean separation from payload, no schema change. Easy for clients to forget.
Request body	Deterministic ID derived from content hash	No client cooperation required. Mistakes (wrong hash inputs) silently re-execute.
URL path	`PUT /orders/{order_id}`	Naturally idempotent at the HTTP level. Best when the resource ID is client-known.
Producer-assigned ID	Kafka producer ID + sequence number	Invisible to the application; broker handles dedup. Closed-system only.

Most public APIs go with the header pattern because it lets clients adopt idempotency gradually without breaking the existing payload schema. Internal services often prefer URL paths with client-generated resource IDs (ULIDs work well) because PUT is idempotent by definition and needs no extra machinery.

Real-world failures

Duplicate Stripe charges. Client retries a POST /charges without an idempotency key after a timeout. The first request actually succeeded; the second creates a second charge. Every payment company has hit this at some point. The fix is mandatory idempotency keys on the client SDK with a timeout-and-retry wrapper that reuses the same key.
Email duplicate-send. A transactional email worker crashes after calling the SMTP provider but before marking the job done. On restart the job re-runs and the user gets two copies. The standard fix is a dedup table keyed on (recipient, message_hash, day) at the send layer.
Job-queue double-processing. A consumer pulls a job, starts processing, then crashes before acking. The broker redelivers, a second worker picks it up, and the work runs twice. Fix: a per-job idempotency key plus a status table the worker checks before doing anything externally visible.
Webhook replay. Stripe, GitHub, and Shopify all retry webhooks for up to three days on non-200 responses. Receivers that don't dedupe by the event ID header will process the same event more than once. Every webhook handler should start with "have I seen this event ID before".

Common patterns

The vocabulary that shows up in code reviews:

PUT for idempotent, POST for non-idempotent. The REST convention. A PUT /orders/{id} with the same body should be safe to retry; a POST /orders creates a new order on every call unless you add an idempotency key.
Unique constraint + insert-or-skip. In Postgres, a unique constraint on the idempotency key column plus INSERT ... ON CONFLICT DO NOTHING makes the database the dedup oracle. MySQL has INSERT IGNORE.
Conditional writes. DynamoDB's ConditionExpression, Spanner's Mutation.insert (vs insertOrUpdate), and any CAS primitive let you say "only write if the row doesn't already exist", which gives you per-key idempotency without a separate table.

-- Postgres: dedup on idempotency key with a unique constraint
CREATE TABLE charge_attempts (
    idempotency_key UUID PRIMARY KEY,
    charge_id       BIGINT NOT NULL,
    response_body   JSONB  NOT NULL,
    created_at      TIMESTAMPTZ DEFAULT now()
);

INSERT INTO charge_attempts (idempotency_key, charge_id, response_body)
VALUES ($1, $2, $3)
ON CONFLICT (idempotency_key) DO NOTHING
RETURNING charge_id;

-- If RETURNING is empty, the row already existed:
-- look it up and return the cached response instead of re-charging.

What the big systems do

System	Mechanism	Window
Stripe	`Idempotency-Key` header, server caches response	24 hours
Twilio	Per-message `MessagingServiceSid` + dedup ID	7 days
GitHub	`X-GitHub-Delivery` on retried webhooks	3 days (delivery window)
AWS DynamoDB	`ConditionExpression` for CAS, idempotency tokens on writes	10 minutes (token TTL)
Kafka EOS	Producer ID + sequence number, transactional offsets	Within a Kafka transaction
Temporal	Workflow ID dedup + activity-level idempotency keys	Workflow retention period

The honest rule

Every external-facing API with any real-world cost (payments, emails, SMS, push notifications, fulfilment, shipping labels) needs an idempotency key. No exceptions. Designing it in from day one is cheap; retrofitting it after the first duplicate-charge incident is expensive and reputational.

Internal APIs benefit too. A microservice call that can't be safely retried is a latent outage waiting for the next network hiccup. The cheap version is "use PUT with a client-known ID where you can"; the proper version is "every mutating call carries an idempotency key". Either way, the receiver should handle the same request twice without harm.

Idempotence at scale

Why exactly-once is a fiction

Idempotency keys

The dedup window

The outbox pattern

Sagas

Effectively-once

Where to put the idempotency key

Real-world failures

Common patterns

What the big systems do

The honest rule

Further reading

07 — Delivery semantics