08 / 11

Protocols / 08

Webhooks

A webhook is a server-to-server callback over HTTP. Something happens on one service, and it posts a small JSON event to a URL you gave it. The idea is simple enough to explain in a sentence, but the engineering lives in the failure cases: what to do when your receiver is down, how the receiver knows the call really came from the sender, and what happens when the same event arrives twice. The honest version of webhooks is a small message-queue problem wearing a friendly HTTP face. This page walks the whole thing: push versus polling, the delivery and retry model, signing and replay defense, idempotency, processing async, and what it takes to run one as a provider.

Push instead of poll

Start with the problem webhooks solve. Your application needs to react to something that happens inside someone else's system: a payment cleared, a build finished, a file finished uploading, a user replied. The naive way to find out is to ask, over and over. You call their API every few seconds and check whether anything changed. That is polling, and it has two costs that grow as you scale. Most of your calls return "nothing new," so you burn requests, rate limit budget, and money to learn that the world is unchanged. And the thing you actually care about arrives late, because you only see it on the next poll, which means your reaction time is bounded below by your polling interval.

A webhook flips the direction. Instead of you asking, the other service tells you. You register a URL once, and from then on the provider sends a POST to that URL the moment an event occurs. The work that used to be spread across thousands of empty poll requests collapses into one delivery per actual event, and the latency drops from "up to one poll interval" to "however long the network and the sender's queue take," which is usually under a second. You traded a chatty client loop for a callback you have to be ready to receive.

Polling pays for empty checks and adds latency up to one interval. A webhook delivers once, at the moment of the event.

Polling is not always the wrong choice. It is simpler to operate, it needs no public endpoint, and it degrades gracefully when the other side is flaky, because you just try again on the next tick and you control the schedule. Webhooks win when events are sparse relative to your poll rate and when latency matters, which describes most integrations worth building. The catch is that the moment you accept a push, you have signed up to run a highly available, public, authenticated receiver, and the rest of this page is about doing that without getting burned. For the bidirectional, low-latency cousin where the connection stays open both ways, see WebSockets and SSE; webhooks are the fire-and-forget, server-to-server end of the same family.

A typical event payload

Before the mechanics, look at what actually lands on the wire. A webhook delivery is an ordinary HTTP POST. The body is JSON describing the event, and a few headers carry the metadata the receiver needs to trust and deduplicate it.

POST /webhooks/billing HTTP/1.1
Host:               app.example.com
Content-Type:       application/json
User-Agent:         ExampleHooks/1.0 (+https://example.com/docs/webhooks)
Webhook-Id:         msg_2NVXK3a8C9PqYg1HkVcXf9
Webhook-Timestamp:  1714820400
Webhook-Signature:  v1,g0hM9SsE+OTPJTGt/tmIqtSyZUEOZjTC...

{
  "id":      "evt_1Hg82C2eZvKYlo2C0yX8",
  "type":    "charge.succeeded",
  "created": 1714820400,
  "data":    { "object": { "id": "ch_001", "amount": 4200, "currency": "usd" } }
}

Three pieces matter. There is the event itself, which is just JSON with a type and a payload. There is an identifier, Webhook-Id, that the receiver uses to spot a duplicate. And there is a signature, Webhook-Signature, paired with a Webhook-Timestamp, that lets the receiver prove the request came from the real sender and is recent. Header names differ between providers, but those three jobs are always present in any serious design: identify, authenticate, and date the event. The body is small on purpose. A good event carries an id, a type, and just enough data to act on, or a reference the receiver can fetch with a normal authenticated API call if it needs more.

The delivery model: at-least-once

Here is the single fact that shapes every other decision. A webhook is delivered at least once, never exactly once. When the sender posts your event and your receiver is down, returns a 5xx, or times out, the sender has to choose between dropping the event and trying again. Dropping events quietly is the worst possible behaviour for a notification system, so every provider that takes itself seriously retries. The consequence is unavoidable: sometimes the same event arrives twice, occasionally three or four times. This is not a provider being sloppy. It is the only honest delivery guarantee a distributed system can make over an unreliable network.

The classic reason a duplicate appears is a lost acknowledgement. Your receiver gets the event, processes it, and starts to return 200. The connection drops before the sender sees the response. From the sender's point of view the delivery failed, so it retries, and now you have processed the event once but you are about to receive it again. There was nothing wrong on either side. The ack just got lost, and at-least-once delivery means the sender must assume the worst and resend.

"Exactly-once" is achievable only if both sides cooperate, and even then it is really "at-least-once delivery plus deduplication on the receiver." The sender promises to keep trying until it hears a success; the receiver promises to recognise an event it has already handled and to no-op on the repeat. Put those two halves together and the effect is exactly-once, even though the wire saw the event more than once. That is the same trick the rest of distributed systems uses, and it has a name worth knowing: idempotence. Designing for at-least-once is not optional. Treat webhooks as exactly-once and you will ship double charges, duplicate emails, and inventory that decrements twice.

Retries, backoff, and jitter

Since the sender retries, the shape of that retry schedule is a real design choice. A tight, fixed retry hammers a struggling receiver and rarely helps, because the receiver is usually down for longer than a few seconds. The standard answer is exponential backoff: wait a little after the first failure, then roughly double the wait each time, up to a cap. A typical policy looks like try immediately, then at one minute, ten minutes, one hour, six hours, twelve hours, twenty-four hours, then give up. Five to seven attempts spread over a day or two is a sensible default. That is long enough to ride out a deployment or a brief outage, and short enough that nobody discovers their webhooks were silently catching up from two days ago.

The delivery lifecycle. A 2xx ends it. Anything else schedules a retry on a growing, jittered delay, until the attempt cap sends it to the dead-letter queue.

Plain backoff has one more flaw, and jitter fixes it. Imagine a receiver goes down for ten minutes while ten thousand events queue up behind it. Without randomness, all ten thousand retries fire on the same synchronised schedule, so the instant the receiver recovers it is hit by a thundering herd far larger than its normal load, and it falls over again. Adding a random spread, commonly something like plus or minus fifty percent, to every backoff interval smears those retries across a window so the recovering receiver sees a ramp instead of a wall. The classic short writeup is the AWS piece on exponential backoff and jitter; the headline result is that "full jitter" beats no jitter by a wide margin under contention.

Status codes tell the sender what to do. A 5xx or a network error or timeout means "try again, the failure is probably transient." A 429 means "you are going too fast," and a polite sender honours the Retry-After header and slows down rather than giving up. A 4xx that is not 429 means "this request is wrong and will stay wrong," so retrying is pointless and the sender should stop and mark a permanent failure. Encoding that distinction is the whole logic of a single delivery attempt:

# pseudocode for one delivery attempt
def deliver(event, attempt):
    try:
        resp = http.post(url, json=event, timeout=10)
        if 200 <= resp.status_code < 300:
            return SUCCESS
        if 400 <= resp.status_code < 500 and resp.status_code != 429:
            return PERMANENT_FAILURE        # client says "do not retry"
    except (Timeout, ConnectionError):
        pass                                # transient, fall through to retry

    if attempt >= MAX_ATTEMPTS:
        return DEAD_LETTER

    delay  = min(BASE * 2 ** attempt, MAX_DELAY)
    delay *= random.uniform(0.5, 1.5)       # jitter
    schedule_retry(event, attempt + 1, delay)
    return RETRY_SCHEDULED

Signing: proof the call came from the sender

Your webhook endpoint is a URL on the public internet. Anyone who learns it, or guesses it, can POST to it. If you act on whatever JSON arrives, an attacker can forge a charge.succeeded event and get free goods, or fire a flood of fake events to confuse your system. The endpoint being secret is not security; URLs leak through logs, proxies, and browser history. You need to authenticate the request itself, and the standard tool is a message authentication code, almost always HMAC-SHA256.

The shape is a shared secret known only to the sender and the receiver. The sender computes a keyed hash over the request, and the receiver, holding the same secret, recomputes the same hash and checks that they match. Because the attacker does not have the secret, they cannot produce a valid signature for a body they invented, and any tampering with the body in transit changes the hash and fails the check. Two details turn this from "looks right" into "actually safe," and both are easy to get wrong.

HMAC verification. Both sides hash the timestamp and the raw body with the shared secret; the receiver compares in constant time and rejects a mismatch.

# sender (a backend job)
secret  = "whsec_AbCdEf1234"
ts      = str(int(time.time()))
body    = json.dumps(event).encode()
signed  = ts.encode() + b"." + body
sig     = hmac.new(secret.encode(), signed, hashlib.sha256).hexdigest()
headers = { "Webhook-Timestamp": ts, "Webhook-Signature": f"v1,{sig}" }

# receiver
signed   = ts.encode() + b"." + raw_body        # raw_body = exact bytes received
expected = hmac.new(secret.encode(), signed, hashlib.sha256).hexdigest()
if not hmac.compare_digest(expected, received_sig):
    return 400

The first detail is that the signature must cover the timestamp, not just the body. If you sign only the body, an attacker who captures one valid request, say off a misconfigured proxy log, can replay it byte for byte forever, and every replay verifies because the signature is correct. By folding the timestamp into the signed string and refusing requests whose timestamp is more than a few minutes old, you give a captured request a short shelf life. The replay window shrinks from "forever" to "five minutes," and inside that window your idempotency check handles the rest.

The second detail is to verify over the raw body bytes, before any parsing, and to compare in constant time. JSON has many byte-level representations of the same logical value, so if you decode the body and re-encode it to hash, your whitespace and key ordering will differ from the sender's and the signature will never match. Capture the exact bytes off the wire and hash those. And use a constant-time comparison such as hmac.compare_digest rather than a plain ==; an ordinary string compare can leak how many leading bytes matched through timing, which a patient attacker can turn into a forged signature. These are small lines of code, but they are the difference between a signature that means something and one that is theatre.

Idempotency and duplicate events

Signing proves who sent the event and that it is recent. It does nothing about duplicates, because a retried event is a perfectly genuine, perfectly signed copy of one you may have already handled. This is where the receiver has to be idempotent: processing the same event twice must produce the same result as processing it once. The simplest, most reliable way to get there is to remember the event ids you have already acted on and to short-circuit on a repeat.

def handle(req):
    if not verify_signature(req):            return 400
    if abs(now() - req.timestamp) > 300:     return 400   # stale, possible replay

    # claim the id atomically; nx = only if not already set
    first_time = redis.set(f"webhook:{req.id}", "1", nx=True, ex=86400)
    if first_time:
        enqueue(req.event)                   # do the real work later, off the request
    return 200                               # ack either way

The nx flag makes the claim atomic: only the first caller that sets the key wins, and every subsequent delivery of the same id sees the key already present and skips the work. A twenty-four hour TTL on the key matches a typical retry window, so you are not storing dedup state forever. If you would rather not run a separate store, a unique index on the event id in your database does the same job; the insert fails on a duplicate and you treat that failure as "already seen." Either way, the dedup check must run before any side effect, and the side effect itself should be written so that even a race that slips two copies through cannot do damage twice, for instance by keying the write on the event id so the second write is a harmless overwrite. Idempotency keys are the same idea generalised to ordinary APIs; the idempotence page covers the patterns in depth.

Respond fast, process async

There is a strong temptation to do the real work inside the request handler, return 200 when it is done, and call it a day. Resist it. The sender is holding a connection open and counting down a timeout, usually around ten seconds. If your handler calls three other services, writes to a database, and sends an email before it answers, you have turned a cheap delivery into a slow one, and any hiccup in those downstream calls turns into a timeout, which the sender reads as a failure, which triggers a retry, which runs the whole slow path again. You have built a system that gets slower and more duplicated exactly when it is under stress.

The pattern that holds up is to make the handler do almost nothing: verify the signature, check for a duplicate, drop the event onto a queue, and return 200 immediately. A worker pulls from the queue and does the actual processing on its own schedule, with its own retries, fully decoupled from the sender's timeout. The receiver acks in milliseconds, the sender is happy, and a slow downstream dependency only backs up your queue instead of causing webhook retries. This is the same producer–consumer split that message queues exist to provide, and a webhook receiver is one of their cleanest use cases.

The receiver acks in milliseconds. Verification and dedup gate the queue; a worker does the expensive processing later, on its own schedule.

Ordering is not guaranteed

A natural assumption is that events arrive in the order they happened. They do not, and building on that assumption causes subtle bugs. Retries reorder events all by themselves: a created event that failed once and is retried can land after the updated event that came after it but succeeded on the first try. Add parallel delivery workers, network reordering, and per-event backoff, and the stream you receive is only loosely related to the timeline that produced it. You can get an updated for a resource you have not seen created yet, or a deleted for one you think is still live.

The fix is to not depend on order. Most events carry a version, a sequence number, or at least a created timestamp, and the receiver can use that to ignore an event that is older than the state it already has. When an event tells you a resource changed but not its full new state, the safest move is often to treat the webhook as a hint and fetch the current state from the provider's API, which is authoritative and always current. Design each handler so that applying events out of order converges to the right answer rather than corrupting state, and the lack of an ordering guarantee stops being a problem.

Security on the receiving side

Signature verification protects you from forged events. It does not protect you from a subtler class of attacks that target what your receiver does with the event, and the worst of these is server-side request forgery. If your handler reads a URL out of the event and fetches it, an attacker who can influence the event content, or who controls a provider you have integrated with, can point that URL at your own internal network: http://169.254.169.254/ to steal cloud credentials from the metadata service, or http://localhost:6379/ to poke at an unauthenticated Redis. Your receiver sits inside your network, so any request it makes on an attacker's behalf comes from a trusted position.

The defenses are concrete. Do not fetch arbitrary URLs from event payloads; if you must, resolve the hostname yourself and refuse private and link-local ranges, and re-check after redirects so a public URL cannot bounce you to an internal one. Block the cloud metadata endpoint at the network layer. Keep webhook handlers on least privilege so a compromised handler cannot reach much. And on the simpler end, treat every field in the event as untrusted input even after the signature checks out, because a valid signature only proves the sender produced the bytes, not that the bytes are safe to interpolate into a query or a shell command. The signature authenticates the channel; it does not sanitise the content.

Dead-letter handling and observability

Eventually an event exhausts its retries. The receiver was wedged for too long, or the handler crashes on a particular payload, or a downstream system was down past the retry window. Whatever the cause, the event is now "dead," and the one thing you must not do is drop it silently. Push dead events to a dead-letter queue or a database table where they can be inspected, fixed, and replayed once the underlying problem is resolved. A dead-letter queue is also your best early-warning signal: a sudden growth in it almost always means a systematic fault, a bad deploy, a malformed handler, an endpoint returning the wrong status code, so alert on its size rather than waiting for someone to notice missing data.

Webhooks are hard to debug precisely because the delivery happens out of sight, so build the observability in from the start. Log every attempt with the event id, the response status, the latency, and the attempt number, so "did this event arrive" and "why did it fail" are queries, not guesses. Good providers expose a delivery dashboard that shows recent attempts and lets you replay them by hand; if you are receiving, lean on it during an incident, and if you are providing, build one, because your customers will ask. Keep the raw signed bytes of failed deliveries around for long enough to replay them, and make replay a first-class operation rather than a database surgery.

Designing a webhook system as a provider

Most engineers meet webhooks as a consumer, but building the provider side is where the interesting tradeoffs live, and it is a common system-design interview prompt. The core is not the HTTP POST; it is a durable delivery pipeline. When something happens in your system, you write the event to a store and enqueue a delivery job rather than calling the customer's URL inline, because their endpoint might be slow or down and you cannot let that block your own request path. A pool of delivery workers picks up jobs, signs each payload, posts it, and on failure reschedules with backoff and jitter, moving to the dead-letter queue after the cap. That is the same shape a queue gives you, applied at the sending end.

A few provider responsibilities are easy to underestimate. Secret rotation has to work without downtime, which means accepting both the old and the new secret during a rollover window and letting the customer flip over on their schedule; design two active signatures into the scheme from day one rather than bolting it on later. You owe customers a way to see and replay their deliveries, both for debugging and for catching up after their own outage. You should fan out efficiently when many customers subscribe to the same event, and isolate a slow or failing endpoint so it does not starve delivery to everyone else, often by giving each destination its own concurrency budget. And the choice of what an event carries, a thin id the customer fetches against, or a fat payload they can act on directly, is a real decision: thin events are smaller and avoid leaking stale data, fat events save the customer a round trip but commit you to a payload schema you will have to version.

On schemas, treat the event format as a public API contract from the first release. Give events a type and a version, only ever add fields rather than removing or repurposing them, and document the full catalogue. Customers will write code against the exact shape you ship, and a quiet change to a field will break them in production with no warning, because there is no compiler between your event and their parser. The versioning page covers the deprecation playbook that applies just as much to events as to endpoints.

The Standard Webhooks specification

For years every provider reinvented the same primitives slightly differently: their own header names, their own signature scheme, their own retry behaviour. A consumer integrating with five services wrote five subtly different verification routines. The Standard Webhooks project (standardwebhooks.com) is a community effort to write one common spec covering payload format, the signature scheme, retry policy, and operations, so a single library can verify webhooks from any conforming provider. It launched in 2023 and adoption is growing.

The spec picks sensible defaults and is a good template even if you never adopt it formally. It standardises the headers as Webhook-Id, Webhook-Timestamp, and Webhook-Signature; it signs HMAC-SHA256 over id.timestamp.payload; and it builds key rotation into the signature format by allowing a request to carry more than one signature. If you are starting a webhook system today, reading this spec will save you from rediscovering every footgun the rest of the industry already stepped on.

Operational checklist

The whole page compresses into a short list you can hold in your head while building either side of a webhook integration.

Sign every payload, and sign over the timestamp as well as the body.
Verify over the raw body bytes, before parsing, and compare in constant time.
Reject requests whose timestamp is more than a few minutes off, to bound replay.
Deduplicate by event id before any side effect runs; make the side effect idempotent too.
Return 2xx as fast as possible. Do no expensive work in the handler; enqueue it and let a worker process it.
Retry on 5xx, network errors, and 429 (honouring Retry-After). Do not retry plain 4xx.
Use exponential backoff with jitter, and put a deadline on each attempt (10s is typical).
Send exhausted events to a dead-letter queue, alert on its growth, and make replay easy.
Never fetch attacker-influenced URLs from a payload; block private ranges and the metadata endpoint.
Support secret rotation with no downtime by accepting old and new secrets during the rollover.