Thundering Herd Simulator, and the four fixes
A thundering herd is when one event wakes a large set of waiters at once and they all rush the same resource, turning a single trigger into N simultaneous reactions for one unit of useful work. The wake is free for the producer; the cost lands wherever the N woken waiters contend. Pick a flavor, set N, trigger naive, then trigger again with the mitigation on.
The grid is your pool of waiters — each tile one of the N processes, threads, or requests sleeping on the same resource (only the first 240 are drawn; the maths still uses the full N). Trigger the event and watch them flip: yellow is woken, green is the one that did useful work, rust-red is a wasted wake that woke, lost the race, and went back to sleep. The four counters underneath tally wakes, useful work (always one), wasted wakes, and load arriving at the origin. The two bars compare p99 latency for the naive path against the mitigated one. Pick a scenario, set N, choose naive or mitigated, then trigger.
Run accept() naive at N=1000, then flip to mitigated and trigger again — naive p99 climbs into milliseconds while mitigated holds near 80 µs, one syscall's worth. Then switch to cache and push N past 200: the origin saturates and p99 grows quadratically, the cache-stampede shape. What should surprise you is how the picture never changes — one event, N reactions, one unit of real work, N−1 wasted — whether the resource is a socket, a mutex, a database, or a cron backend.
All thundering herds are the same shape
One event. N waiters. N reactions. One useful result. The wake is free for the producer; the cost is bottlenecked at the resource the wakes contend for.
Four scenarios; one bug. In accept() herds, N worker processes sleep on a
shared listening socket; one connection arrives and the kernel wakes all of them, of which
N−1 immediately fail their accept and go back to sleep. In a cache stampede,
N concurrent requesters miss the same key at the same instant and each one calls the
origin. In a condvar broadcast, N threads wait on a pthread_cond_t, the
producer calls pthread_cond_broadcast, all N wake, all N race for the same
mutex, N−1 block. In cron synchronization, 800 servers all run 0 * * * *
and at xx:00:00.000 every one of them hits the same backend.
The variants look taxonomically different — sockets, caches, mutexes, schedulers — but
the mechanics are identical. One producer triggers N consumers; the consumers contend on
one downstream resource; only one request was ever needed. The fix space is also
constant: wake fewer (cond_signal, EPOLLEXCLUSIVE), coalesce
duplicates (singleflight), spread in time (jitter, stale-while-revalidate,
XFetch), or spread in space (sharded cache keys, SO_REUSEPORT with one
listen socket per worker, shuffled wake order). The reference paper for the original
variant is Mogul & Borg's 1991 The effect of context switches on cache
performance, which gave the bug its name in the BSD kernel community.
Cache stampedes deserve a long entry
The most common flavor in modern infrastructure. And the one most likely to take down your origin database during a traffic spike.
A popular item's TTL expires. N concurrent requests for it all miss. Each issues the same expensive query against the origin — typically the database the cache exists to protect. The origin saturates, latency climbs, requests time out. Most fail. The cache stays empty because the upstream queries never completed. The next wave of requests sees another miss. The system is now stuck in a stampede until traffic drops or a human intervenes. Stack Overflow ran a postmortem in 2016 on exactly this pattern; Discord has written about per-key locks for the same reason.
The mitigation menu, ranked roughly by simplicity. Single-flight
(a.k.a. request coalescing): the first miss takes a lock on the key; everyone else
waits for that one origin call to complete and shares the result. Go's
singleflight.Group is the canonical implementation; Discord's cache layer
uses per-key mutexes. Stale-while-revalidate: serve the expired value
while one background task refreshes it. Mark Nottingham's RFC 5861 codified this for
HTTP; Vercel and Next.js use it by default. Probabilistic early refresh
(Vattani et al. 2015, the XFetch paper): each requester computes a small probability
of refreshing before the TTL hits, weighted by how recent the value is and how long
it took to compute. Akamai and CloudFlare use variants of this in production.
Distributed locks per key exist but you should not reach for them
unless the lock acquisition is cheaper than the origin query — usually it isn't, and
you've just moved the herd from the cache backend to the lock service.
accept() and the kernel's eternal struggle
Linux has been fighting wake-all on shared file descriptors for thirty years.
The accept() herd is the original. Before Linux 2.6, multiple processes sleeping in
accept() on the same listening socket all woke when a connection arrived;
only one could succeed. The 2.6 kernel made accept itself exclusive, but
the problem returned the moment people put the listening fd in an epoll
set and slept on epoll_wait instead — epoll's wake semantics defaulted to
wake-all again. Linux 4.5 (2016) added EPOLLEXCLUSIVE, which restores
exclusive wake on a per-epoll basis. NGINX, HAProxy, and modern Postgres skip the whole
question by using SO_REUSEPORT — each worker creates its own listening
socket bound to the same port, and the kernel hashes the incoming connection's
source/dest IP/port 4-tuple to pick exactly one of the sockets. Wake fan-out is one,
by construction.
The same dichotomy lives inside pthread_cond_t. pthread_cond_signal
wakes exactly one waiter; pthread_cond_broadcast wakes all of them.
Beginners reach for broadcast because it's easier to reason about ("everyone who cares
gets the message"). In practice, broadcast is only correct when the state change
matters to every waiter — for example, "the queue is closing, all of you exit." When
there's one new work item and any waiter can handle it, signal is the right call. The
same logic applies to Go's sync.Cond (Signal vs
Broadcast) and Java's Object.notify vs
notifyAll. The futex layer underneath all of these has the same
distinction — FUTEX_WAKE with val=1 versus
val=INT_MAX.
The cron problem nobody catches in code review
Every team eventually writes a cron at 0 * * * * because it's the obvious thing. Then 800 servers fire at the same millisecond.
The cron bug is invisible until it isn't. A handful of servers all running
0 * * * * /usr/local/bin/sync-thing is fine; the backend handles eight
concurrent requests without breaking a sweat. At eight hundred servers, that same hour
boundary becomes an instant DDoS against whatever the cron talks to — a metrics
endpoint, a license server, a config-fetcher. The cure is one line:
0 * * * * sleep $((RANDOM % 60)); /usr/local/bin/sync-thing spreads the
fleet across a minute. For longer-running jobs you want a wider spread; for
latency-sensitive ones a tighter one. Kubernetes CronJob has
startingDeadlineSeconds but no native jitter — you add it inside the
container.
The same logic applies to anything that fleet-wide clocks synchronize on: TLS certificate renewal at the same expiry hour, scheduled deploys, fleet-wide log rotation, periodic health-check intervals that all start at the same millisecond after restart. AWS publishes guidance on jittered exponential backoff for retries (the 2015 Marc Brooker post, Exponential Backoff and Jitter) for the same reason — without jitter, every retry storm becomes a synchronized retry storm. The lesson generalizes: any time you have N independent actors choosing the same moment by some deterministic rule, you are one traffic threshold away from an incident. Add jitter, even when you don't think you need it. Especially when you don't think you need it.