Thundering Herd Simulator, and the four fixes

A thundering herd is when one event wakes a large set of waiters at once and they all rush the same resource, turning a single trigger into N simultaneous reactions for one unit of useful work. The wake is free for the producer; the cost lands wherever the N woken waiters contend. Pick a flavor, set N, trigger naive, then trigger again with the mitigation on.

scenario

accept()

1,000

wasted

—

scenario: mode:

N = 1,000

large fleet

connection arrives

shared listen fd + epoll · contends on listening socket

— click "Trigger" to fire the event; 240 of 1,000 waiters will be drawn —

wakes

useful

wasted

origin load

—

Try this

Set N=1000 on accept(), trigger naive, then trigger mitigated. The naive p99 climbs into the milliseconds; the mitigated stays at ~80 µs — one syscall worth.
Switch to cache. At N>200 the origin saturates and p99 explodes quadratically — that's the cache stampede. Flip on mitigation: singleflight.Group collapses N requests into one.
Try condvar at N=500. Naive: 499 threads wake and immediately block on the mutex — every futex round trip is wasted. Mitigated: cond_signal wakes exactly one.
Try cron at N=2000. Naive is the synchronized fleet; mitigation adds sleep $((RANDOM % 60)) so the load smears across 60 seconds.
Slide N to 5000. Watch the wasted column. Same shape every time.

Why this hits at exactly the wrong time

Thundering herds are not a steady-state problem. They are a failure-cascade problem. They show up when a cache key expires, when a process restarts and all clients reconnect at once, when a new pod comes up at deploy and gets hit by warm traffic immediately, when certificates renew at the same hour across the fleet.

The cure in every case is one of four moves: wake fewer (mutex vs broadcast), coalesce duplicates (single-flight), spread in time (jitter, stale-while-revalidate), or spread in space (sharded keys, per-worker listen sockets).

Adjacent

What you're looking at

The grid is your pool of waiters — each tile one of the N processes, threads, or requests sleeping on the same resource (only the first 240 are drawn; the maths still uses the full N). Trigger the event and watch them flip: yellow is woken, green is the one that did useful work, rust-red is a wasted wake that woke, lost the race, and went back to sleep. The four counters underneath tally wakes, useful work (always one), wasted wakes, and load arriving at the origin. The two bars compare p99 latency for the naive path against the mitigated one. Pick a scenario, set N, choose naive or mitigated, then trigger.

Run accept() naive at N=1000, then flip to mitigated and trigger again — naive p99 climbs into milliseconds while mitigated holds near 80 µs, one syscall's worth. Then switch to cache and push N past 200: the origin saturates and p99 grows quadratically, the cache-stampede shape. What should surprise you is how the picture never changes — one event, N reactions, one unit of real work, N−1 wasted — whether the resource is a socket, a mutex, a database, or a cron backend.

All thundering herds are the same shape

One event. N waiters. N reactions. One useful result. The wake is free for the producer; the cost is bottlenecked at the resource the wakes contend for.

Four scenarios; one bug. In accept() herds, N worker processes sleep on a shared listening socket; one connection arrives and the kernel wakes all of them, of which N−1 immediately fail their accept and go back to sleep. In a cache stampede, N concurrent requesters miss the same key at the same instant and each one calls the origin. In a condvar broadcast, N threads wait on a pthread_cond_t, the producer calls pthread_cond_broadcast, all N wake, all N race for the same mutex, N−1 block. In cron synchronization, 800 servers all run 0 * * * * and at xx:00:00.000 every one of them hits the same backend.

The variants look taxonomically different — sockets, caches, mutexes, schedulers — but the mechanics are identical. One producer triggers N consumers; the consumers contend on one downstream resource; only one request was ever needed. The fix space is also constant: wake fewer (cond_signal, EPOLLEXCLUSIVE), coalesce duplicates (singleflight), spread in time (jitter, stale-while-revalidate, XFetch), or spread in space (sharded cache keys, SO_REUSEPORT with one listen socket per worker, shuffled wake order). The reference paper for the original variant is Mogul & Borg's 1991 The effect of context switches on cache performance, which gave the bug its name in the BSD kernel community.

Cache stampedes deserve a long entry

The most common flavor in modern infrastructure. And the one most likely to take down your origin database during a traffic spike.

A popular item's TTL expires. N concurrent requests for it all miss. Each issues the same expensive query against the origin — typically the database the cache exists to protect. The origin saturates, latency climbs, requests time out. Most fail. The cache stays empty because the upstream queries never completed. The next wave of requests sees another miss. The system is now stuck in a stampede until traffic drops or a human intervenes. Stack Overflow ran a postmortem in 2016 on exactly this pattern; Discord has written about per-key locks for the same reason.

The mitigation menu, ranked roughly by simplicity. Single-flight (a.k.a. request coalescing): the first miss takes a lock on the key; everyone else waits for that one origin call to complete and shares the result. Go's singleflight.Group is the canonical implementation; Discord's cache layer uses per-key mutexes. Stale-while-revalidate: serve the expired value while one background task refreshes it. Mark Nottingham's RFC 5861 codified this for HTTP; Vercel and Next.js use it by default. Probabilistic early refresh (Vattani et al. 2015, the XFetch paper): each requester computes a small probability of refreshing before the TTL hits, weighted by how recent the value is and how long it took to compute. Akamai and CloudFlare use variants of this in production. Distributed locks per key exist but you should not reach for them unless the lock acquisition is cheaper than the origin query — usually it isn't, and you've just moved the herd from the cache backend to the lock service.

accept() and the kernel's eternal struggle

Linux has been fighting wake-all on shared file descriptors for thirty years.

The accept() herd is the original. Before Linux 2.6, multiple processes sleeping in accept() on the same listening socket all woke when a connection arrived; only one could succeed. The 2.6 kernel made accept itself exclusive, but the problem returned the moment people put the listening fd in an epoll set and slept on epoll_wait instead — epoll's wake semantics defaulted to wake-all again. Linux 4.5 (2016) added EPOLLEXCLUSIVE, which restores exclusive wake on a per-epoll basis. NGINX, HAProxy, and modern Postgres skip the whole question by using SO_REUSEPORT — each worker creates its own listening socket bound to the same port, and the kernel hashes the incoming connection's source/dest IP/port 4-tuple to pick exactly one of the sockets. Wake fan-out is one, by construction.

The same dichotomy lives inside pthread_cond_t. pthread_cond_signal wakes exactly one waiter; pthread_cond_broadcast wakes all of them. Beginners reach for broadcast because it's easier to reason about ("everyone who cares gets the message"). In practice, broadcast is only correct when the state change matters to every waiter — for example, "the queue is closing, all of you exit." When there's one new work item and any waiter can handle it, signal is the right call. The same logic applies to Go's sync.Cond (Signal vs Broadcast) and Java's Object.notify vs notifyAll. The futex layer underneath all of these has the same distinction — FUTEX_WAKE with val=1 versus val=INT_MAX.

The cron problem nobody catches in code review

Every team eventually writes a cron at 0 * * * * because it's the obvious thing. Then 800 servers fire at the same millisecond.

The cron bug is invisible until it isn't. A handful of servers all running 0 * * * * /usr/local/bin/sync-thing is fine; the backend handles eight concurrent requests without breaking a sweat. At eight hundred servers, that same hour boundary becomes an instant DDoS against whatever the cron talks to — a metrics endpoint, a license server, a config-fetcher. The cure is one line: 0 * * * * sleep $((RANDOM % 60)); /usr/local/bin/sync-thing spreads the fleet across a minute. For longer-running jobs you want a wider spread; for latency-sensitive ones a tighter one. Kubernetes CronJob has startingDeadlineSeconds but no native jitter — you add it inside the container.

The same logic applies to anything that fleet-wide clocks synchronize on: TLS certificate renewal at the same expiry hour, scheduled deploys, fleet-wide log rotation, periodic health-check intervals that all start at the same millisecond after restart. AWS publishes guidance on jittered exponential backoff for retries (the 2015 Marc Brooker post, Exponential Backoff and Jitter) for the same reason — without jitter, every retry storm becomes a synchronized retry storm. The lesson generalizes: any time you have N independent actors choosing the same moment by some deterministic rule, you are one traffic threshold away from an incident. Add jitter, even when you don't think you need it. Especially when you don't think you need it.

Found this useful?