Distributed Cache Simulator: end to end

Four cache nodes, one origin database, a hashing function in the middle. Issue GETs and SETs and watch the cache route, hit, miss, and evict. Then kill a node, fire a thousand requests at a hot key, or expire a popular entry and see the origin buckle. This is the system every production app puts in front of its database.

hit rate
0.0%
requests
0
dirty
0

policy hash protect ttl scenarios
Ready. 4 nodes online. Origin DB empty.
CLIENT
coordinator
hashes key, routes request, retries on miss per policy
node-0 live
0/8 slots
node-1 live
0/8 slots
node-2 live
0/8 slots
node-3 live
0/8 slots
ORIGIN DB
5–50 ms per query
0 rows · 0 queries
event log · newest first
— quiet —

What you're looking at

The diagram is the whole path a request takes: a client that hashes the key, four cache nodes that each hold a few slots, and the origin database behind them. Type a key and hit GET or SET to watch it route to one node and either hit, miss and pull from origin, or evict the least-recently-used slot when full. The toggles set the read and write policy, the hashing scheme, TTL, and two stampede protections, while the counters and event log track hit rate and how often the origin actually gets touched.

Run the Stampede scenario with no protection first: one expired key fires a thousand concurrent misses and the origin gets hit roughly a thousand times. Turn on single-flight and run it again, and that drops to one. The other moment worth seeing is Node failure. Under hash mod N, killing a node scatters most keys to new homes because the modulus changed; switch to consistent hashing and only about a quarter move. Those two surprises, the herd and the reshuffle, are the reasons distributed caches are harder than putting a map in front of a database.


Cache is a contract about latency, not data

The cache is a separate system that lies. Its job is to serve recent reads fast, not to be the source of truth.

A Redis GET in the same region returns in 0.5 ms at p50; a cold Postgres read against a modest table runs 5–50 ms depending on indexes and buffer-pool state. The cache exists because of this two-order-of-magnitude gap and nothing else. It is not durable. It is not consistent. It is not authoritative. It is the system you put in front of the system that is all those things, so that 99% of reads never bother the authoritative one.

That framing forces two design choices, and every named cache pattern is just a combination of the two. On miss, who fetches from origin — the application (cache-aside) or the cache itself (read-through)? On write, what happens to the data — both cache and DB synchronously (write-through), cache now and DB eventually (write-back), or DB only with the cached copy invalidated (write-around)? Six combinations. Each has a name, a sweet spot, and a failure mode. Redis cluster behind a typed ORM is almost always cache-aside + write-through. A CDN like CloudFront in front of S3 is read-through + write-around. A write-heavy analytics ingest with a Redis buffer in front of ClickHouse is cache-aside + write-back, and someone is paid to watch the dirty queue.

Pick the pattern whose failure mode you can live with. Write-back loses dirty keys if the cache dies before flush. Write-through doubles write latency. Write-around guarantees cold reads on every freshly written key. The contract is what you owe the app on miss, on write, on crash; everything else is implementation.


Consistent hashing solves the wrong problem first

Modulo hashing is fine on a static cluster. It's ruinous the first time you add a node.

With four nodes and hash(key) % 4, every key has a stable home. Add a fifth node and the modulus becomes 5; suddenly hash(k) % 4 ≠ hash(k) % 5 for roughly 80% of keys. Almost every key now hashes to a different node than it lives on. The cache hit rate collapses to noise until traffic re-warms every replica. Production caches did this in the late 1990s and lost. The fix arrived in 1997 with Karger et al.'s consistent-hashing paper from MIT, written for Akamai's CDN.

Consistent hashing maps both keys and nodes onto the same ring (a 32-bit or 128-bit hash space wrapped end to end). A key belongs to the first node clockwise from its hash. Adding a node only steals keys from its clockwise neighbour. Removing a node only gives keys to its clockwise neighbour. Expected disruption is exactly 1/N of the keyspace per change instead of nearly all of it. Virtual nodes — each physical node occupying many points on the ring — fix the load-balancing imbalance you'd otherwise see with only N points.

This is the algorithm under Memcached's Ketama distributor, Cassandra's vnodes (256 tokens per node by default), DynamoDB's partition placement, Riak's preference lists, and DiscordDB. Redis Cluster takes the related-but-different slot-based approach: a fixed 16,384 slots are distributed across nodes, and resharding moves slot ownership rather than rebalancing a continuous ring. The motivation is the same — bound how much state moves when membership changes — but the implementation buys explicit control over slot mapping at the cost of a fixed slot count.


Stampedes are a coordination problem

A popular key's TTL expires. A thousand concurrent requests all miss and all hit the origin at once. Welcome to the thundering herd.

A typical Postgres or MySQL instance handles 5–10× its steady-state QPS before it collapses; the moment a cache-miss surge pushes traffic past that line, query latency goes vertical, lock contention climbs, and the database starts shedding connections. Now every other request that depended on that database is failing too. The cache, which is fine, gets blamed. The actual cause is that a single synchronisation point — the moment a hot key's TTL crosses zero — turned an implicit cooperation problem into a stampede.

Four standard mitigations, all in production somewhere. Single-flight (Go's singleflight package, dating to roughly 2013) deduplicates in-flight requests for the same key — the second request waits on the first's result. Per-key locks in Redis or via etcd do the same thing across processes. Stale-while-revalidate (Mark Nottingham's RFC 5861, 2010) serves the expired value to all comers while a single background request refreshes; the application never blocks on cache miss. Probabilistic early refresh (XFetch, Vattani et al., 2015) refreshes a key before TTL with probability proportional to (time_to_expiry / fetch_duration); the closer to expiry, the more likely any individual request triggers the refresh, so the actual expiry moment never has a synchronised cliff. Discord wrote up exactly this pattern in their cache postmortem after a 2020 outage on the "trending channels" key.

The fourth mitigation is to not have a single hot key. Sometimes that means fan-out — sharding by user-segment so the load spreads across many cache keys — and sometimes it means doing the work upfront. Stack Overflow's home-page data, for years, was generated by a background job that refreshed every 30 seconds regardless of whether anyone asked. The cache served the latest version. There was no miss path.


When the cache is wrong

"There are only two hard things in computer science: cache invalidation and naming things." — Phil Karlton, attributed.

TTL-only invalidation is the easy choice and the lazy one. The cache holds a copy for n seconds; after that it expires and the next reader pays the miss. It works because most data tolerates being a few seconds stale. It fails when "a few seconds" is too long — pricing, permissions, balance checks. Memcached and Varnish are usually configured with short TTLs for exactly this reason; both lose nothing by expiring early because they're cheap to repopulate.

Explicit invalidation on write is correct but couples the writer to the cache. The service updating Postgres now has to know the cache exists, what key shape it uses, and how to reach it. In a microservice world that knowledge propagates through your codebase like ink in water; one team renames a key prefix and three downstream services stop invalidating. The pattern works fine inside a monolith, and breaks the moment your write paths fragment across teams.

The clean version is change-data-capture: Debezium reads the Postgres WAL, publishes row changes to Kafka, and a small consumer service translates them into cache invalidations. The writer doesn't know the cache exists. The cache stays current within the consumer's lag (usually under a second). The cost is operational — you now run Debezium, Kafka, and an invalidation worker — but the architecture is honest about who owns what. Netflix's EVCache, Shopify's cache layer, and LinkedIn's Couchbase deployment all use a variant of this pattern. The original quote about cache invalidation being one of the two hard things is mostly a joke about how often "just invalidate it" turns out to be the wrong sentence.

Found this useful?