Retry Strategy Simulator: retry without stampeding.

A retry strategy is the timing rule that decides how long a client waits before trying a failed request again. Plain exponential backoff still synchronises every client onto the same instant. Decorrelated jitter randomises each wait so a transient blip doesn't turn into a self-inflicted DoS.

Strategy
jittered
Sum
7.74s

Strategy
Base / Max / Attempts
Backoff curve
8s 6s 4s 2s 0s 0.02s #1 0.12s #2 0.04s #3 0.72s #4 1.17s #5 1.37s #6 1.37s #7 2.92s #8
Delays (ms)
attempt 117 ms
attempt 2124 ms
attempt 345 ms
attempt 4719 ms
attempt 51172 ms
attempt 61374 ms
attempt 71371 ms
attempt 82922 ms

What you're looking at

Each vertical bar in the chart is one retry attempt; its height is how long the client waits before that attempt, read off the seconds axis on the left, and the number above the dot is the exact delay. The list below repeats the same values in milliseconds, and the meta block shows which strategy is active and the total time all the waits add up to. The strategy and base/attempt buttons rebuild the whole curve.

Start on exp and hit Re-roll a few times: the bars never move, because plain exponential is deterministic. Now switch to jittered and re-roll. The same eight attempts jump to a different random height every time, and that is the entire point — ten thousand clients running this never land on the same timestamp, so a recovering service sees a smooth ramp instead of a synchronised spike. The surprise is that jitter doesn't change the average wait or the total budget; it only scatters the arrivals.


What is a retry strategy?

Synchronised retries are a self-DDoS.

A retry strategy is the rule a client follows when a request fails — whether to try again, how long to wait, and how to vary the delay across attempts. The single most useful rule: add jitter. Synchronised exponential backoff still synchronises retries; randomised (jittered) exponential backoff doesn't. AWS, Google, and Stripe all default to decorrelated jitter for this reason.

Imagine ten thousand clients calling a service. The service blips for 100 milliseconds. All ten thousand calls fail. With pure exponential backoff and no jitter, all ten thousand retry exactly 200 ms later. Then 400 ms. Then 800 ms. The service sees a perfect ten-thousand-call spike at every step — exactly the load profile the backoff was meant to avoid. The retry algorithm has converted a transient hiccup into a self-inflicted distributed denial-of-service against your own dependency.

The natural first instinct when a call fails is to try again. The naive form — "wait one second, then retry" — works fine for one client and one failure. With a million clients calling a single back-end service, naive retry on a transient failure is a guaranteed traffic spike. Every client retries at exactly the same offset; the dependency comes back, gets hit by a million simultaneous requests, falls over again; everyone retries again. The pattern is called a thundering herd, and the dependency frequently never recovers without manual intervention.

Jitter breaks the correlation. With delay = random(0, base × 2^n), retries spread uniformly across each window. The service sees a smooth ramp instead of a comb. The math: variance in delay produces variance in retry time produces a flattened arrival distribution. The integral — total retried requests — is the same; the peak rate is divided by the variance ratio. A four-fold reduction in peak rate is the typical observed improvement; the load that broke the service uncalibrated may not break it once spread.

Marc Brooker, an AWS Principal Engineer, formalised the right defaults in “Exponential Backoff And Jitter” (AWS Architecture Blog, March 2015) and the longer treatment in the AWS Builders' Library article “Timeouts, retries, and backoff with jitter”. Both pieces are short, dense, and the closest thing the field has to a canonical reference. The 2015 simulation in the original post used 100 clients retrying 5 times after a synchronised failure: without jitter, peak load was 100 simultaneous calls per retry round; with full jitter, peak load was 25 (a 4× reduction); with decorrelated jitter, peak load was 18 and total time-to-recovery was shorter because some clients finished sooner and stopped retrying.

The simulator above lets you see the difference visually. Set jitter to zero, watch the spikes; raise it, watch the timeline smooth out. The total throughput is unchanged; the peak load is divided. That single change, applied to every retrying client in a fleet, is often the difference between a service that recovers in fifteen seconds and one that has to be restarted by a human at three in the morning.

THUNDERING HERD · NO JITTER vs FULL JITTERNo jitter · synchronised spikes200ms400ms800msFull jitter · spread evenlySame total retries — quarter the peak rate.

Origins — exponential backoff three decades before the cloud

A pattern three decades older.

The phrase thundering herd predates the cloud era by three decades. Operating-system papers used it for the kernel-level wakeup pattern when many threads block on the same semaphore and a single wake event releases all of them; only one can make progress, the rest spin and re-block. The 2000-era Linux 2.4 epoll work and the FreeBSD accept_filter mechanism both addressed it at the kernel level. The same shape appears in distributed systems whenever a shared dependency fails and a synchronised cohort of clients reacts in lockstep.

The discipline of bounded retry with jitter sits at the intersection of three older ideas. Ethernet's 1976 binary exponential backoff (Robert Metcalfe and David Boggs, “Ethernet: distributed packet switching for local computer networks”, CACM July 1976) introduced the doubling-on-conflict principle for shared media; the random component was a uniform draw from a doubling window. Vinton Cerf's 1980 work on TCP retransmission timeouts built on the same intuition for end-to-end congestion. Van Jacobson's 1988 paper “Congestion avoidance and control” (SIGCOMM 1988) refined the timeout-and-backoff feedback loop into the algorithms still running every TCP stack today. Application-layer retry algorithms in 2025 are direct descendants; the constants moved, the shape did not.

One useful piece of intuition from the Ethernet literature: the collision domain — the set of devices contending for the same shared resource — is the right unit of analysis. On a wire, collisions are physical; in a microservice fleet, they're queued requests waiting for a shared dependency. The same pacing math applies. If you have a thousand client processes calling one downstream service, that service is the collision domain, and your retry policy is the medium-access protocol governing how those clients share its capacity.


Linear vs exponential vs jittered vs decorrelated jitter

Same delay budget, very different load.

The four strategies you can switch above all share the same goal — slow down between retries — but they impose very different load profiles on the downstream service. With one thousand concurrent clients all hitting a transient blip, the difference between exponential and decorrelated jitter is the difference between a flat line at 100 RPS and a comb that spikes to ten times the base load every 200 milliseconds.

Strategy Formula Herd Use it when
Lineardelay = base × nSevereDemos. Single-client tools. Avoid for shared services.
Exponential (no jitter)delay = base × 2^nSevere — synchronisedSingle client only. Rare in modern code.
Equal jitterdelay = base × 2^n / 2 + rand(0, base × 2^n / 2)MediumWhen you want a guaranteed minimum wait.
Full jitter (AWS)delay = rand(0, base × 2^n)LowDefault. Almost always the right answer.
Decorrelated jitterdelay = rand(base, prev × 3)LowestHigh concurrency. Brooker's recommendation.

The reason decorrelated jitter wins on the metrics is subtle. Full jitter is memoryless: each retry chooses a new delay independently of the previous, which means a client that drew a small delay on attempt N can draw an even smaller one on attempt N+1, contributing two retries in quick succession. Decorrelated jitter ties the next delay to the previous via the prev × 3 ceiling, ensuring monotonic-on-average growth while still spreading. The cost is a small amount of state per call site (the previous delay); the benefit is a tighter distribution of arrival times.

The cap matters. maxBackoff is typically 30 to 60 seconds for retryable errors and 5 to 10 seconds for latency-sensitive paths. Set it too low and the algorithm degenerates into linear retry; set it too high and a single client can wait minutes between retries. AWS SDK retry mode defaults are documented in the aws-sdk source: legacy retry mode caps at 20 seconds, standard at 20 seconds with three attempts, and the adaptive mode (introduced 2020) at a self-tuning cap that responds to throttling responses.

BACKOFF CURVES · DELAY VS ATTEMPTdelayattemptlinearexponentialfull jitterdecorrelated123456

What NOT to retry — non-idempotent calls and permanent errors

A retry policy is half what you don't retry.

Retrying the wrong thing is worse than not retrying. Five rules of thumb separate good retry policies from foot-guns; each has a primary citation and a production failure-mode that demonstrates the cost of getting it wrong.

  1. 01
    Retry only on transient signals.
    5xx, timeouts, connection-resets, ECONNREFUSED. Never retry 4xx (auth, validation, not-found) — same outcome on retry, only louder. RFC 9110 (HTTP Semantics, June 2022) section 9.2.2 lists the idempotent methods (GET, HEAD, PUT, DELETE, OPTIONS, TRACE) that are safe to retry without idempotency keys; POST and PATCH are not idempotent unless the application says so.
  2. 02
    Use an idempotency key for non-idempotent operations.
    Without it, a retried POST charges twice. Stripe popularised the pattern: client generates a UUID, sends it in the Idempotency-Key header, server stores the (key, response) pair for 24 hours and returns the same response for any retry. Stripe's 2017 engineering blog post by Brandur Leach is the canonical writeup. AWS SDK requests carry a similar ClientRequestToken; gRPC's official retry semantics include first-class idempotency hints.
  3. 03
    Always cap by deadline, not attempt count.
    An eight-attempt exponential at 100 ms base sums to 25.5 seconds — the user has long given up. Set if elapsed > 5s: stop instead. The remote service has decided it's down anyway, and the client's caller (a request handler, a UI loop, a queue worker) has its own deadline that your retry budget cannot exceed without doing more harm than good.
  4. 04
    Honour Retry-After if the server sends it.
    A 429 or 503 response can include a Retry-After: 30 header (RFC 9110 section 10.2.3). The server is telling you it's overloaded. Wait at least that long before retrying — your jitter math doesn't beat the server's explicit signal. Cloudflare, GitHub, and Stripe all return Retry-After under load.
  5. 05
    Stamp every request with a request ID.
    Retried requests should be traceable end-to-end. Stripe's Stripe-Request-Id header lets the server log the entire family of retries; AWS X-Ray, OpenTelemetry, and Jaeger all propagate trace context across retry attempts. Without traceability, retry storms are nearly impossible to diagnose post-hoc.

When retry policies amplify outages — and how to detect it

When retry policies amplify outages.

Real-world retry storms have caused or extended several major outages. The pattern is consistent: a downstream wobble triggers retries; retries multiply load; multiplied load causes the wobble to become a full failure; the full failure means more retries; the system collapses into a self-reinforcing loop. The problem is structural, not behavioural; tuning the backoff curve is necessary but insufficient.

The architectural fix is the retry budget, articulated in chapter 22 of the Google Site Reliability Engineering book (Beyer, Jones, Petoff, Murphy, O'Reilly 2016). Each client tracks its retry rate as a percentage of its base request rate; when the percentage exceeds a threshold (typically 10 percent), retries are dropped instead of executed. The budget bounds the multiplicative amplification: a retrying client never sends more than 1.1 times its base rate, regardless of how badly the server is failing. gRPC's RetryThrottlingPolicy implements this directly with two parameters, maxTokens and tokenRatio, controlling the size of the budget.

The complementary architectural pattern is the circuit breaker (Michael Nygard's Release It!, Pragmatic Bookshelf 2007, second edition 2018). When the failure rate over a sliding window crosses a threshold, the breaker opens and fails new requests fast for a cool-down period. After the cool-down, a half-open state lets one or two probe requests through; if they succeed, the breaker closes. Netflix's Hystrix (open-sourced 2012, archived 2018 in favour of Resilience4j and adaptive concurrency) was the canonical Java implementation; Polly (.NET), Resilience4j (Java), Sentinel (Alibaba's Java/Go offering, 2018), and Envoy's outlier_detection (C++) are the contemporary equivalents.

The mitigation hierarchy in production order is: (1) bounded retries with full or decorrelated jitter; (2) per-client retry budgets; (3) circuit breakers paired with retries; (4) adaptive concurrency limits — Netflix's library of the same name (open-sourced 2018) implements TCP Vegas-style congestion control for in-flight RPCs, dynamically shrinking the in-flight cap when latency rises; (5) per-tier retries only — if client → API gateway → service A → service B all retry, a single failure produces 1 × R₁ × R₂ × R₃ total calls; allow retries at one tier and pass through at the others. The 2017 GitLab database outage and the 2021 Slack DNS incident both contained retry-storm components; the postmortem fixes in both cases included removing aggressive retries from intermediate tiers and adding circuit breakers, which is counter-intuitive but right.

Retry at one tier only

If client → gateway → service A → service B all retry three times each, a single B-side failure produces 3 × 3 × 3 × 3 = 81 calls. Pick one tier — usually the outermost client — to retry; configure every other tier to pass the failure straight through. The math is unforgiving and the postmortems are written from there.


Hedged requests — sending two, taking the first

Sending two, taking the first.

Retries are reactive: send the request, wait for failure, send again. Hedged requests are proactive: send the request, wait a short time (typically the 95th-percentile latency), and if no response has arrived, send a duplicate to a second backend. The first response wins. The technique was articulated by Jeff Dean and Luiz André Barroso in “The Tail at Scale” (CACM, February 2013), describing the latency-tail problem in Google's search infrastructure: a 1-in-1000 slow response from any one backend becomes a 50-percent-probability slow response when 100 backends contribute to a single user-visible request. Reducing the worst-case latency of any one component does not help; reducing the variance does.

The implementation cost is real. Naive hedging doubles the load on the dependency under healthy conditions, because every request now sends two copies. The Dean-Barroso paper proposed two refinements: tied requests, where the second copy explicitly cancels the first if it arrives in flight, and request hedging on tail only, where the duplicate is only sent if the first has already exceeded the 95th percentile. The combined effect is a substantial latency-tail reduction at modest extra cost — Google reported 39 percent reduction in p99 latency at 5 percent extra load on internal services.

Modern adoptions: gRPC's hedging policy is documented in the gRPC retry design (2018, Mark D. Roth). YugabyteDB's docs describe hedging for read replicas. Cassandra has supported hedged reads since 2.1 (2014). Apache Pulsar's broker hedges reads on a configurable threshold. The pattern is most useful where the backend is read-only, the server is large enough that occasional duplicate processing is cheap, and the latency tail is the user-visible problem — which is to say, every modern read-mostly distributed system.

Hedging interacts with retry budgets carefully. The hedged duplicate counts as a retry against the budget; without that accounting, a bursty failure storm can amplify under hedging the same way it does under unbounded retry. Envoy's hedging policy and the gRPC RetryPolicy YAML both expose explicit budget integration. The combination of hedged-on-tail with bounded budgets and circuit breakers is the production-grade configuration.

HEDGED REQUEST · TIED CANCELLATIONCLIENTREPLICA A (slow)REPLICA Brequest → Ap95 elapsedhedge → B (duplicate)B response winsCANCEL → ATIED REQUEST · A ABORTS · NO DOUBLE WORK

Retrying through a queue — at-least-once, idempotency keys

Retrying through a queue.

Synchronous-RPC retry is one shape of the problem; asynchronous queue-based retry is another. The semantics differ subtly and the failure modes are different.

SQS visibility timeout (AWS Simple Queue Service) is the canonical model. A worker pulls a message from the queue; the message becomes invisible to other consumers for the visibility timeout (default 30 seconds, max 12 hours); if the worker calls DeleteMessage within that window, the message is permanently removed; if the worker fails to delete (crash, timeout, exception), the visibility timeout expires and another worker picks up the message. The pattern gives at-least-once delivery without coordination. The trade-off: the worker is responsible for being idempotent. The same idempotency-key pattern from Part 03 applies.

Kafka commit offsets implement a different shape of the same primitive. A consumer reads from a partition; periodically commits its offset; on crash, restarts from the last committed offset and re-processes any messages between the commit and the crash. Auto-commit (the default in older clients) commits every 5 seconds, which means a crash can cause up to 5 seconds of duplicate processing. Manual commit lets the application choose — commit after each batch (slow but precise), commit every N messages (faster, larger duplicate window). Kafka's exactly-once semantics (introduced in 0.11, June 2017) layer transactional producer writes plus consumer-side read_committed isolation on top of the commit-offset primitive, but require all participants in the transaction to be Kafka-aware.

Dead letter queues are the final stop. After N failed retries (typically 3 to 10), the message is moved to a DLQ where it sits awaiting human inspection. SQS, RabbitMQ, Pulsar, and most managed queue services support DLQs natively. The DLQ is not the failure case; it is the diagnostic case. A healthy system should produce few DLQ entries (under one percent of throughput); a sudden surge in the DLQ is a signal of a downstream change that has invalidated previously-valid messages, and should page someone.

The pairing of synchronous retry with queue retry is where most production systems live. The synchronous client retries with bounded backoff and budget; if all retries fail, the message goes to a queue; the queue worker retries with much longer windows (minutes to hours) using its own backoff schedule; if the queue worker exhausts its retries, the message goes to the DLQ. The two layers cooperate by virtue of the user's deadline being respected at the synchronous tier and the eventual-consistency budget being respected at the queue tier. The Stripe webhooks system, the AWS Lambda async-invoke retry policy, and the GitHub Actions workflow retry config all implement variations of this layered model.


Retry strategies in production — by language and SDK

Retry in the wild, by language and SDK.

The defaults shipped by widely-used libraries determine the retry behaviour of most production systems, regardless of what the application code intends. Knowing the defaults is half the operational battle.

AWS SDK retry mode. The SDK retry-behaviour documentation describes three coexisting modes. Legacy retry mode (the SDK v1 default through 2020) uses three attempts with exponential backoff capped at 20 seconds, no jitter for some operations, full jitter for others, and varying handling of throttling. Standard retry mode (SDK v2, 2018; default in v3 since 2020) uses three attempts with full jitter, a token-bucket retry quota of 500 tokens that drains on retry and refills on success, and consistent throttling-aware behaviour. Adaptive retry mode (introduced January 2020) adds client-side rate limiting based on observed throttling responses, dynamically reducing the request rate when the service signals stress. The Builders' Library article on this is the authoritative reference.

gRPC RetryPolicy (formal stable since gRPC 1.34, December 2020) is configured in the service config JSON and supports maxAttempts, initialBackoff, maxBackoff, backoffMultiplier, and retryableStatusCodes. The throttling policy with maxTokens and tokenRatio implements a retry budget directly. Hedging policy is configured separately with hedgingPolicy. Most production gRPC clients ship with retry disabled by default and require explicit opt-in via service config.

Cloudflare workers retry the origin fetch on configurable error codes via the Worker's fetch() options or the Cloudflare Load Balancing health-check policy. Cloudflare's Load Balancing documentation emphasises 502/504 as retryable and explicitly warns against retrying 5xx responses with bodies that include Retry-After: 0.

HTTP libraries. Node.js's native fetch does not retry; the got library defaults to two retries with exponential backoff including jitter. Python's requests library does not retry by default; urllib3's Retry object exposes total, backoff_factor, and status_forcelist. Go's net/http does not retry; the retryablehttp Hashicorp library is the standard third-party option. Rust's reqwest does not retry; the reqwest_retry middleware adds it. Java's HttpClient does not retry; Resilience4j, Spring Retry, and failsafe-rs are the dominant options.

The pattern across languages is the same: the underlying HTTP client almost never retries by default, and the application or middleware has to opt in. This is correct — retries are a policy decision that depends on idempotency, latency budget, and downstream behaviour, none of which the HTTP library can know on its own — but it means most systems have some retry library somewhere, often configured in haste, and the policy is rarely audited end-to-end.

# ~/.aws/config — adaptive retry mode
[default]
region = us-east-1
retry_mode = adaptive
max_attempts = 3

# JS · decorrelated jitter loop
async function withRetry(call, opts = {}) {
  const { base = 100, cap = 20_000, max = 6 } = opts;
  let prev = base;
  for (let attempt = 0; attempt < max; attempt++) {
    try { return await call(); }
    catch (e) {
      if (!isTransient(e) || attempt === max - 1) throw e;
      const ceiling = Math.min(cap, prev * 3);
      const delay   = base + Math.random() * (ceiling - base);
      await sleep(delay);
      prev = delay;
    }
  }
}

Further reading on retry strategies

Primary sources, in order.

Found this useful?