Exponential backoff.
Retry the way you re-text a silent friend: once soon, then with longer and longer gaps.
A friend has not answered your text. You give it a minute and send one more; still nothing, so you give it ten; after that, an hour. Forty messages in a row was never on the table — partly dignity, partly the knowledge that it would not help. Code calling a flaky service needs the same instinct, and by default it has none: a naive retry loop is forty texts a second.
Exponential backoff writes the instinct down. Retry quickly once, because most blips are over in milliseconds; then double the pause after every failure. A brief glitch is fixed on the second attempt, while a real outage meets callers who are, by the eighth attempt, mostly waiting quietly.
- Hello? Still me.1
Most failures are weather, not climate: a restart, a blip, a dropped packet. Weather deserves a retry.
- 2
The first pause is tiny on purpose — a glitch that lasts milliseconds should cost about that much.
- 3
Doubling stretches ten failures into minutes of waiting without anyone having to tune a schedule.
- 4
Instant retries are how a service’s own clients keep it on the floor after the original problem has passed.
- 5
Clients that failed in the same instant would retry in the same instant — the randomness breaks up the wave.
- 6
Patience needs an exit: cap the pause, budget the attempts, and report failure while the report still helps.
The storm you cause yourself
Take a service that hiccups for two seconds under load. Every caller fails at once, so every caller retries at once — and the retry wave is bigger than the traffic that caused the hiccup, so it fails too, and breeds a bigger wave still. The dependency has recovered by now; its clients will not let it back up. Exponential backoff thins each successive wave, and jitter smears the survivors across time so no wave forms at all.
Knowing when to stop
Endless patience is its own bug. A caller still retrying after five minutes is holding a user’s request hostage to a dependency that may be down for the afternoon. Real implementations cap the pause so nothing waits absurdly long, cap the attempts or the total time, and then fail cleanly — often handing the decision to a circuit breaker that stops new calls from even trying for a while. Giving up on time is a feature.