Cache Eviction Simulator: when the cache is full, someone gets evicted.
When a fixed-size cache fills up, the eviction policy decides which entry to drop. Run one trace through LRU, LFU, FIFO, and more, and watch the hit rates diverge. Hit rate is a property of the policy plus your workload, not the policy alone.
The row of slots is the cache, capped at the capacity you pick; each filled slot shows its key plus two numbers — f for how many times it has been accessed, t for the tick it was last touched. Those are the signals each policy reads. Hit the lettered keys to access them by hand, or use Run trace to replay a fixed 16-access stream. The counters up top tally hits, misses, and hit rate, and the recent log marks each access HIT or MISS and names whatever got evicted to make room.
Run the same trace under each policy and the hit rate lands somewhere different every time. That's the point worth absorbing: the trace never changes, only the eviction rule does. LRU evicts by the t column, LFU by the f column, FIFO by insertion order, RANDOM by coin flip. What should surprise you is how close RANDOM stays to LRU on this trace, and how a key with a high frequency count can survive under LFU long after LRU would have thrown it out. No single policy wins — the workload decides.
What is cache eviction?
Memory is small, disk is far.
Cache eviction is the policy a cache uses to decide which item to discard when it is full and a new item arrives. The classical answer — Least Recently Used (LRU) — was studied by László Bélády at IBM in 1966 and is implemented in every major cache today. The simulator above replays a workload trace through LRU, LFU, FIFO, ARC (Megiddo & Modha 2003), and TinyLFU (Einziger et al. 2017) side by side; watch the hit-rate curves diverge as the workload skews.
Imagine your application reads from a database. Each query takes 5 milliseconds. With ten thousand reads per second across the fleet, the database is doing fifty seconds of work every second — impossible. The first instinct is to keep recently-fetched values in memory: when the next request asks for the same key, return it from RAM in ten microseconds instead of going to disk. That memory store is a cache, and the speedup is enormous: a hit avoids the 500-fold slower disk path entirely.
The question is which entries to keep. RAM is finite — say, ten thousand slots. The application has a million keys. You'll keep at most 1% of them. When the cache fills and a new entry needs to come in, something else has to leave. Picking the right victim is the entire problem of cache eviction, and it has surprisingly deep implications: a 1% improvement in hit rate, at Facebook's scale, frees enough origin bandwidth to fill a small fleet of datacentre hosts (Berg et al, OSDI 2020).
The naive choices fail in characteristic ways. FIFO (evict oldest insertion) is simple but ignores access patterns — a hot key inserted long ago gets evicted while a cold key inserted yesterday lingers. Random is a cheap baseline that's easy to reason about but worse than thoughtful policies on every real workload. LRU (least recently used) is the textbook answer: keep an ordered list, move every accessed entry to the front, evict from the back. LRU exploits the empirical fact that recently-used keys are likely to be used again — it works because real workloads have temporal locality.
LRU has one famous failure mode that production caches all bake in defences against: scan invalidation. A nightly analytics job scans every row in a million-row table. Each access pushes a hot entry out of the cache and inserts a cold one that will never be touched again. By the time the scan finishes, the cache is full of one-time visitors, the hot working set has been entirely evicted, and the next morning's user traffic hits a cold cache. Hit rate craters until the working set re-warms. The fix is policies that filter admission — only accept new entries that prove themselves popular — or that segment the cache so a scan can only pollute a small portion. Modern policies (2Q, ARC, W-TinyLFU, S3-FIFO) are all variations on this defence.
The simulator above lets you compare LRU, LFU, and FIFO on a stream of access keys you choose. Type a sequence (some hot keys, some cold, maybe a scan), watch which policy keeps the right entries, and read the hit-rate counter. The mathematical upper bound on any cache's hit rate is Belady's MIN, an offline algorithm from 1966 that evicts the entry whose next reference is furthest in the future — uncomputable in practice but the universal benchmark for what a perfect cache could achieve.
Origins — Bélády's 1966 optimum and its approximations
A 1966 optimum, and the policies that approximate it.
The mathematical foundation of cache replacement appeared in 1966, when László Bélády at IBM published A Study of Replacement Algorithms for a Virtual-Storage Computer in the IBM Systems Journal. The paper proved that the optimal page-replacement policy — later called Belady's MIN — is to evict the page whose next reference is furthest in the future. Given perfect knowledge of the access trace, MIN minimises miss rate; no online algorithm can do better. The proof is short, the algorithm is uncomputable in practice (you don't know the future), but every cache policy since has been judged against the gap between its hit rate and MIN's on a recorded trace. Twenty years later Belady noticed that adding more memory to FIFO sometimes increased the miss rate — Belady's anomaly — which became one of the founding examples in the literature on why cache policies need theoretical analysis, not just engineering intuition.
Two years later Fernando Corbató — later a Turing Award winner for CTSS and Multics — described CLOCK in A Paging Experiment with the Multics System (MIT Project MAC, 1968). CLOCK is FIFO with a single reference bit: the eviction pointer sweeps the cache like a clock hand; if a page's reference bit is set, the bit is cleared and the hand advances; if the bit is clear, the page is evicted. CLOCK approximates LRU at constant time and constant space per slot — exactly the constraints a kernel page replacement subsystem needs. Linux, FreeBSD, and Solaris all use CLOCK variants in their virtual memory managers. Jiang & Zhang's CLOCK-Pro (USENIX 2005) added a second-chance queue to handle scan-resistant patterns where pure CLOCK degrades to FIFO.
LRU — least recently used — is the default mental model for cache replacement and the algorithm most teaching examples lead with. Implemented as a doubly-linked list plus hash map, LRU is O(1) per access; the cost is a write to the link structure on every read, which on shared caches becomes a hot-cacheline bottleneck. LFU — least frequently used — tracks an access counter per entry and evicts the lowest. LFU shines on Zipf-distributed workloads where popularity is stable, but suffers frequency pollution: an item that was hot for an hour and is now cold lingers forever because its counter is high. LFU-DA (with decay) and ARC fix this. FIFO evicts in insertion order regardless of usage; it's the simplest, almost always worse than LRU in practice, but used as a baseline and as a building block in newer hybrid policies.
Why caches work — access patterns aren't uniform
Caches work because access patterns aren't uniform.
If every key were equally likely to be accessed next, no policy could beat random — there's nothing to predict. Real workloads don't look like that. They follow a Zipf distribution: the k-th most popular item is accessed in proportion to 1/ks, with the exponent s typically between 0.7 and 1.2 for web traffic, 1.0 for word frequencies in natural-language text (the original Zipf observation from 1935), and higher (= more skewed) for popular video and music. Lee Breslau, Pei Cao, Li Fan, Graham Phillips, and Scott Shenker characterised web access as Zipf-like in their 1999 INFOCOM paper Web Caching and Zipf-like Distributions: Evidence and Implications; the result has been replicated for two decades on every kind of cacheable workload.
Under heavy Zipf, a tiny fraction of keys absorbs most of the traffic. A cache that holds the top 1% of keys captures something like 50% of the requests — even if total memory is a hundred times smaller than the full set. This is why caching works at all; it's why CDNs are sized as a small constant fraction of origin, not a copy. Pei Cao's 1996 dissertation at Princeton, Application-Controlled File Caching and Prefetching, formalised the working-set characterisation that lets a cache designer estimate the necessary capacity from a measured access pattern; the working-set framework had been introduced earlier by Peter Denning (1968) for virtual memory but extended naturally to disk and CDN caches.
Different layers of the storage hierarchy show different shapes. Stephen Wires, Stephen Ingram, Zachary Drudi, Nicholas Harvey and Andrew Warfield's 2014 OSDI paper Characterizing Storage Workloads with Counter Stacks showed disk-backed enterprise traces with miss-ratio curves that knee sharply — doubling cache size halves miss rate up to a threshold, after which adding capacity is wasted. Twitter's 2020 SOSP paper A Large Scale Analysis of Hundreds of In-memory Cache Clusters at Twitter (Yang, Yue, Vigfusson, Mitzenmacher, Rashmi) released production traces from 153 in-memory cache clusters and showed that LRU and LFU are noticeably worse than the best modern policy on those traces — an empirical justification for the 2010s wave of hybrid policies.
The job of an eviction policy is to track and keep the top fraction. LRU does that by recency; LFU does that by frequency. Real workloads have a mix — some items are bursty, some are steadily popular — and modern policies try to be good at both, dynamically tuning the recency-vs-frequency split based on what's working.
Beyond LRU — LFU, ARC, 2Q, and the first wave of replacements
The first wave past plain LRU.
2Q, by Theodore Johnson and Dennis Shasha (VLDB 1994), splits the cache into two queues: A1, a small FIFO for new items, and Am, a longer LRU for items promoted on second access. New keys land in A1; if accessed again, they're promoted to Am. Items evicted from A1 are remembered in a small "out" queue (A1out) so a re-arriving key skips A1 and goes straight to Am. The mechanism filters out scan-once traffic that would otherwise pollute LRU with items that will never be touched again. 2Q was the first practical demonstration that two-list segmented designs beat plain LRU on real database traces; many subsequent algorithms inherit its layered shape.
Segmented LRU (Karedla, Love & Wherry, 1994) is a similar idea: a "probationary" LRU and a "protected" LRU, with promotion on second access and demotion when protected fills. Used by FreeBSD's buffer cache and by Sun's ZFS in its primary ARC. Variations of segmented LRU appear in CDN edge caches and in CPU L2/L3 cache replacement (Intel's Pseudo-LRU is a tree-structured approximation that fits in hardware).
ARC (Adaptive Replacement Cache, Nimrod Megiddo & Dharmendra Modha at IBM, FAST 2003) generalises 2Q with self-tuning. It keeps two LRU lists — T1 for items seen once, T2 for items seen twice or more — and two ghost lists, B1 and B2, recording metadata for items recently evicted from T1 and T2. When a hit occurs in B1, ARC concludes "I should have kept that recency-favouring item longer" and shifts the cache budget toward T1; when a hit lands in B2, it shifts toward T2. The split is rebalanced continuously, with no manual tuning. ARC sits within ~5% of Belady's MIN on most benchmark traces. IBM patented the algorithm; Sun's ZFS implementation routes around the patent by using a slightly different bookkeeping scheme; PostgreSQL adopted CLOCK-Pro before ARC was widely available and has stayed with it.
TinyLFU and W-TinyLFU — the admission filter that closed the LRU gap
The admission filter that closed the LRU gap.
The 2010s wave of cache research realised that admission is as important as eviction. Why admit a stranger that just arrived, evicting an entry with proven popularity? TinyLFU, by Gil Einziger, Roy Friedman and Ben Manes (ACM Transactions on Storage, 2017; original arXiv 2015), pairs a small count-min sketch — a compact probabilistic frequency estimator using only a few bits per entry — with whatever cache structure sits behind it. When a new entry arrives and would evict an existing one, the cache looks up both entries' estimated frequencies in the sketch; only if the newcomer's frequency is higher does the swap happen. The sketch ages entries via periodic halving so a temporarily-hot key doesn't pollute the cache forever.
W-TinyLFU — the windowed variant — adds a small admission window (about 1% of the cache) that always accepts new arrivals; only when an entry ages out of the window does the TinyLFU filter decide whether to admit it to the main cache. This handles bursty traffic patterns and short-lived popularity spikes that pure TinyLFU would reject too aggressively. Ben Manes's Caffeine Java library (2014, Google open-source) implemented W-TinyLFU and became the de-facto Java cache; it's the engine behind Spring Cache, Cassandra's row cache (since 4.0), Solr, Druid, Apache Ignite's near-cache, and dozens of microservices that previously used Google Guava's LRU. Manes's blog (Design of a Modern Cache) is the readable engineering treatment. Ristretto, by Karl Mc Guinness and the Dgraph team (2019, written in Go), brought the same algorithm to Go; it powers Dgraph and several large-scale Go services.
Benchmarks at Caffeine's own simulator project show W-TinyLFU within 2–3% of Belady's MIN on web search, OLTP, and database traces — and 5–15% better than LRU on the same traces, with comparable CPU cost. The admission filter idea has spread: S3-FIFO (Yang, Qiu, Yue, Vigfusson & Rashmi, SOSP 2023) uses three FIFO queues with frequency-based promotion to match W-TinyLFU performance with simpler, lock-free data structures; SIEVE (Zhang, Yang, Wang, Lou, Yue & Rashmi, NSDI 2024) is an even simpler "lazy promotion" variant. The state of the art keeps moving, and modern Java caches like Caffeine track the literature closely.
| Policy | Tracks | Scan-resistant? | Used by |
|---|---|---|---|
| LRU | recency | no | Memcached classic, Guava |
| LFU | frequency | yes | Redis (allkeys-lfu since 4.0) |
| ARC (2003) | recency + frequency, adaptive | yes | ZFS, IBM patents |
| 2Q (1994) | two queues, second-access promotion | yes | Berkeley DB, Fastly Varnish |
| CLOCK / CLOCK-Pro | approximate LRU via reference bit | yes (CLOCK-Pro) | Postgres buffer pool, Linux page cache |
| W-TinyLFU (2017) | window + count-min sketch admission | yes | Caffeine, Cassandra row cache, CacheLib |
Cache eviction in production — Redis, Memcached, Caffeine, Postgres
What real caches actually run.
Memcached, the canonical in-memory cache (Brad Fitzpatrick, 2003, originally for LiveJournal), uses a slab allocator: memory is partitioned into fixed-size chunks per object-size class to reduce fragmentation, and each slab class runs its own LRU. Eviction operates per slab class — a key consequence is that one hot slab class can fully evict its members while another sits with cold but unevicted items, an effect Twitter and Facebook have written about extensively. Memcached's 1.4.27 (2016) introduced "modern LRU" with a hot/warm/cold queue split (essentially segmented LRU) to mitigate the effect. The protocol-level interface is intentionally bare-bones; the cache strategy hides behind get, set, and cas.
Redis (Salvatore Sanfilippo, 2009) exposes its eviction policy as a configuration knob. The maxmemory-policy options are: noeviction (return errors when full), allkeys-lru and volatile-lru (LRU over all or expired-flag keys), allkeys-lfu and volatile-lfu (LFU since 4.0, 2017), allkeys-random and volatile-random, and volatile-ttl (evict the soonest-to-expire). Redis approximates LRU and LFU rather than implementing them exactly: each eviction round samples N candidates (default 5, tunable via maxmemory-samples) and evicts the worst. The approximation is cheap to maintain and within a few percent of true LRU on real workloads.
CacheLib, Facebook's C++ caching library (Berg et al, OSDI 2020 — The CacheLib Caching Engine: Design and Experiences at Scale) consolidates the cache logic of CDN, social-graph, key-value and ML-feature services into one library running on tens of thousands of servers. CacheLib uses W-TinyLFU as the default eviction policy, integrates SSD as a second-tier cache, and supports millions of operations per second per node. The paper's headline number: a 1% improvement in hit rate at Facebook scale saves enough origin bandwidth to fill a small fleet of datacentre hosts. Twitter's Pelikan (an in-house Memcached/Redis fork) uses similar segmented-LRU designs.
CDN edges run their own variants. Cloudflare's edge uses a tiered LRU with admission filtering on the second tier (described in their 2018 blog post on The Curious Case of Cache Hit Ratio). Akamai publishes less detail but the public talks describe Belady-style offline analysis used to tune live policies. Fastly exposes Varnish-style configurable VCL and uses a 2Q-derived structure on its in-process cache. The convergence: nobody runs plain LRU in production at scale anymore. The lesson took twenty years to land but it's now uncontroversial.
// Caffeine: Window-TinyLFU with count-min sketch admission.
// Engine behind Spring Cache, Cassandra row cache, Druid, Solr.
import com.github.benmanes.caffeine.cache.Cache;
import com.github.benmanes.caffeine.cache.Caffeine;
import java.time.Duration;
Cache<Long, User> users = Caffeine.newBuilder()
.maximumSize(10_000) // bounded; W-TinyLFU picks evictees
.expireAfterWrite(Duration.ofMinutes(5))
.recordStats() // hit/miss/loadTime metrics
.build();
User u = users.get(42L, this::loadFromDb); // single-flight on miss
users.invalidate(42L);
double hitRate = users.stats().hitRate(); // observe & tune sizeWhere caches go wrong in production — scan pollution, hot keys, ghost hits
Where caches go wrong in production.
Caches fail in distinctive ways. Thundering herd: a popular key expires, every requester misses simultaneously, and N concurrent requests all stampede the origin to refill the same value. Mitigations: probabilistic early expiration (Vattani, Charikar & Lattanzi, RecSys 2012 — Optimal Probabilistic Cache Stampede Prevention), single-flight coalescing (one origin request per key, all other requesters wait for the result — the standard Go pattern, also implemented by Redis Cluster client libraries), and stale-while-revalidate semantics in HTTP caches (RFC 5861) that serve the stale value while a single async refresh runs.
Hot key contention: a single key dominates traffic and saturates the cache shard responsible for it. Memcached's sharding via consistent hashing places each key on one node; a celebrity user's profile can pin a single Memcached node at 100% CPU. Mitigations: client-side replication of hot keys across multiple shards, request coalescing at the load balancer, or a small in-process cache in front of the distributed cache. Twitter's 2014 article Caching at Scale describes their approach.
Frequency pollution in pure LFU: an entry that was hot during a launch event lingers forever because its counter is high. Decay schemes — LFU-DA, the periodic halving in TinyLFU's sketch, the time-windowed counters in CacheLib — all aim to forget popularity that's no longer fresh. Get the decay rate wrong and you either pollute the cache or churn it unnecessarily.
Scan-resistance failure: a one-time analytical scan reads every row in a table; under plain LRU, the scan's ephemeral pages evict the hot working set, and post-scan hit rate craters until the working set re-warms. 2Q, ARC, segmented LRU, and W-TinyLFU all defend against this, each in their own way; plain LRU does not. Postgres's buffer cache uses CLOCK-Pro precisely to handle this case in a database where periodic VACUUM and SELECT * queries are routine.
Negative caching: caching that "this key does not exist" prevents repeated origin lookups for missing data. Done badly it produces consistency bugs — the absence is cached, the resource is later created, but the negative cache hides it for the TTL. The fix is short TTLs on negative entries, explicit invalidation on create, or bloom-filter front-ends that distinguish "definitely absent" from "may exist; check origin". The scaling consequences are large: Stack Overflow has written about negative cache TTL tuning saving them tens of thousands of database queries per second.
Berg et al, The CacheLib Caching Engine (OSDI 2020), report that at Facebook scale a one-percent improvement in hit rate saves enough origin bandwidth to fill a small fleet of datacentre hosts. The corollary: spend the engineering time picking the right policy. The gap between LRU and W-TinyLFU on real traces is 5–15 %.
Choosing a cache eviction policy — a bet about your workload
A policy is a bet about your workload.
The right cache policy depends on access pattern, working-set size, latency budget, and implementation constraint. The wrong question is "which policy is best?"; the right one is "what does my trace look like, and what gap to Belady do I tolerate?". Three concrete patterns:
For an API response cache with bursty popularity (a tweet goes viral, an article gets shared) and TTLs in the seconds-to-minutes range: W-TinyLFU or S3-FIFO via Caffeine/Ristretto. The window absorbs bursts without polluting; the frequency filter rejects one-off requests cheaply. Don't reach for plain LRU here — it under-performs by 5–15% and wastes admission decisions on items that'll never be re-read.
For a database buffer pool: CLOCK-Pro (Postgres), 2Q (BerkeleyDB historically), or LRU-K (Oracle). The dominant pattern is heterogeneous — some pages are sequentially scanned for analytics, some are randomly hit by OLTP, some sit in the working set for months. The policy must be scan-resistant and tolerant of long-tail accesses. Plain LRU degrades catastrophically on full-table-scan workloads. The choice is rarely user-tunable; the database picks it based on decades of operational experience with that engine.
For a CDN edge with very large object sizes and strong popularity skew: tiered LRU with admission filtering (Cloudflare-style), or slab-based LFU (Akamai-style). The cost of admitting a 100 MB video is high; admit only when you've seen its popularity proven. Object-size weighting often matters as much as request frequency; many CDN policies admit small objects readily and gate larger ones behind multi-request thresholds.
For a process-local memoisation cache with thousands of small entries: plain LRU via a doubly-linked list and hash map is fine. The scale is too small for policy choice to matter; the implementation cost of W-TinyLFU outweighs its hit-rate gain. Caffeine and Guava's LoadingCache cover the Java case; Python's functools.lru_cache, Go's groupcache singleflight, and Rust's moka crate (a Caffeine port) cover the others. Reach for the simplest tool that meets the latency and hit-rate budget; profile before reaching for anything fancier.
Further reading on cache eviction
Primary sources, in order.
- Semicolony · deep diveHow CPU caches work — L1/L2/L3, MESI, false sharingThe hardware-level companion to this simulator. Set associativity, write-back, MESI state machine, false-sharing animation, and 70+ pages of context on the silicon underneath.
- Semicolony · simulatorCPU cache & MESI coherence simulatorMulti-core L1/L3 with line-granular MESI states, manual ops, and a false-sharing scenario. The hardware sibling to this software-cache simulator.
- Bélády · 1966A Study of Replacement Algorithms for a Virtual-Storage ComputerIBM Systems Journal. The paper that defined optimal replacement (MIN) and named what every cache policy is benchmarked against.
- Megiddo & Modha · 2003ARC: A Self-Tuning, Low Overhead Replacement CacheFAST 2003. IBM Research's adaptive replacement cache; ZFS uses it, FreeBSD uses it, the policy that taught the field self-tuning works.
- Johnson & Shasha · 19942Q: A Low Overhead High Performance Buffer Management Replacement AlgorithmVLDB 1994. The two-list design that anticipated ARC and segmented LRU; cited by every later admission-filter paper.
- Einziger, Friedman & Manes · 2017TinyLFU: A Highly Efficient Cache Admission PolicyACM Transactions on Storage (originally arXiv 2015). The count-min-sketch admission filter behind Caffeine and Ristretto.
- Berg et al · 2020The CacheLib Caching Engine: Design and Experiences at ScaleOSDI 2020. Facebook's consolidated caching library, running tens of thousands of nodes; concrete numbers on hit-rate-vs-cost trade-offs.
- Jiang & Zhang · 2005CLOCK-Pro: An Effective Improvement of the CLOCK ReplacementUSENIX 2005. The scan-resistant CLOCK variant Postgres adopted for its buffer cache.
- Yang et al · 2020A Large Scale Analysis of Hundreds of In-memory Cache Clusters at TwitterSOSP 2020. Production traces from 153 cache clusters, with the recommendation that LRU is rarely the right default at scale.
- Yang, Qiu, Yue, Vigfusson & Rashmi · 2023FIFO can be Better than LRU: The Power of Lazy Promotion and Quick DemotionSOSP 2023, the S3-FIFO paper. As simple as FIFO, beats LRU and matches W-TinyLFU on real traces.
- Semicolony guideCaching, layeredWhere each layer earns its keep — browser, CDN, app, database.
- Semicolony guideRedis internalsThe data structures, the persistence story, and the maxmemory-policy menu.