LRU Cache Simulator: a list that reorders itself.
An LRU cache keeps a fixed number of entries and evicts whichever was used least recently. Linked list + hash map = O(1) get, O(1) put, with eviction at the tail. The data structure behind virtually every cache.
The horizontal strip is the cache itself, ordered most-recently-used on the left to least-recently-used on the right; the rightmost cell wears a "next out" tag once the list is full. Each cell shows a key and its value. Put inserts or updates a key at the front; Get looks one up and, on a hit, slides that cell back to the left. The ops log records HIT, MISS, inserted, updated, or which key got evicted, and the meter up top tracks the running hit rate.
Hit Seed, then Get a key that's near the right end and watch it jump to the front — that splice is the whole policy. Now set capacity to 3 and Put four fresh keys in a row: every Put past the third drops the tail. The surprise is how a single scan of cold, one-off keys can flush items you still care about, which is exactly the access pattern that makes plain LRU look bad and motivates the LFU and ARC variants discussed below.
What is an LRU cache?
Why a fixed-size cache has to throw something out.
An LRU cache is a cache that evicts the least recently used item when full. The classical implementation pairs a hash map for O(1) lookup with a doubly-linked list for O(1) recency updates — the two structures are kept in sync. László Bélády studied the theoretical optimum (MIN) at IBM in 1966; LRU is the practical approximation that ships in every production cache.
Imagine you run a website with a million users. Each request needs the user's profile, which lives in a database query that takes about 8 ms. You decide to cache profiles in memory: the first time someone shows up you query the database; on subsequent requests you hand back the cached copy in 80 μs — a hundred-times speedup. So far so good.
Now you discover the problem. A million profiles, at 1 KB each, is a gigabyte of memory. Multiplied across the dozen worker processes you actually run in production, you would need twelve gigabytes just for the cache. Your machine has eight. You have to bound the cache to some smaller size — say, a hundred thousand entries — and that means when the cache is full and a brand-new user arrives, you have to throw something out to make room. The question is: which one?
Pick badly and the cache stops paying for itself. Throw out the wrong entry — say, the entry for the user who is about to make their tenth request in a row — and the next request misses, taking the slow database path again. Now you are paying both the storage cost of caching and the latency cost of not having cached the right thing. Pick well and the cache hit rate stays high; almost every request finds what it needs in memory. The difference between the two pickers, at scale, is whether your site costs a thousand dollars a month or a hundred thousand.
The everyday heuristic is to bet on recency. If a user has just been active, they are probably about to be active again. If a profile has not been touched in an hour, it is probably cold for the rest of the day. So when you have to evict, throw out the entry whose last access is furthest in the past. This policy has a name: Least Recently Used, or LRU. It is the default in Redis, in Memcached, in your operating system's page cache, and in the CPU's L1 cache. The simulator above shows the textbook version: get a key and it slides to the front of the list; insert a new key when the list is full and the rightmost entry — the one nobody has touched in the longest — falls off the end.
The interesting engineering question is not whether to use a recency-based policy. It is how to implement one so that both "find this key" and "mark this key as most-recently-used" run in constant time. That puzzle is the next part.
Origins — Bélády's clairvoyant oracle (1966)
Bélády's clairvoyant page-replacement oracle.
The theory of cache replacement starts with Lazlo Belady's 1966 paper A Study of Replacement Algorithms for a Virtual-Storage Computer at IBM. The setting was paged virtual memory on the 360 system: a small main memory backed by a much larger paging drum, and a stream of page references generated by running programs. Belady asked the obvious counterfactual question. If you could see the entire reference stream in advance, what would the optimal replacement decision look like? His answer, sometimes called MIN or simply Belady's algorithm, is to evict the page whose next reference is furthest in the future. Optimal in the sense that no policy can produce fewer faults on the same trace.
MIN is unimplementable in practice — you cannot see the future — but it is enormously useful as a benchmark. The hit-ratio gap between MIN and any real-world policy is the irreducible cost of not knowing the workload in advance. Decades of research on cache policies amount to chasing that gap. LRU, the policy most production caches default to, evicts the page whose last reference is furthest in the past — recency as a proxy for future reference. The substitution is principled: under workloads with strong temporal locality, recently used pages are likely to be used again soon. Under workloads without that property, LRU degrades smoothly toward random replacement rather than catastrophically.
Belady's same paper produced a second result that became famous in its own right. Belady's anomaly: for the FIFO replacement policy, increasing the cache size can sometimes increase the miss count on a particular reference trace. The result was disturbing because it violated the intuition that more cache should never hurt. The anomaly does not occur for LRU — LRU has the stack property, meaning that the contents of an LRU cache of size k are always a subset of an LRU cache of size k+1 on the same trace, so a larger cache can never miss where a smaller one hits. The stack property is one of the deeper reasons LRU is a sensible default.
The line of work from Belady runs through Aho, Denning, and Sleator-Tarjan into the modern competitive-analysis literature. Sleator and Tarjan in their 1985 paper showed that LRU is k-competitive against the optimal offline policy on a cache of size k — the worst-case ratio of LRU's misses to MIN's misses, for any input, is at most k. That bound is tight in the worst case but conservative in practice. On real workloads LRU often falls within a small factor of MIN, which is why it remains the dominant policy half a century later.
How LRU is implemented — hash map plus doubly-linked list
Hash map plus doubly-linked list, working in concert.
A working LRU implementation needs two operations to be cheap: looking up a key, and moving its entry to the front of the recency order. Either operation alone is easy. Both at once is the small puzzle that makes LRU more than a textbook curiosity. A hash table answers the lookup in expected constant time but knows nothing about order. A doubly-linked list maintains order and supports constant-time splicing but cannot find an entry without a linear scan. The standard implementation uses both, joined at the hip.
The hash table maps each key to a pointer into the linked list. Lookup is a hash table probe, returning the node directly. Splicing the node to the front is a constant-time pointer rewire, doable because the list is doubly linked — each node knows both its predecessor and its successor. Eviction is constant time too, because the tail of the list is always immediately reachable through the list's tail sentinel. The combined data structure supports get and put both in expected O(1), and the constants are small.
The trick has a name in the data-structures literature — it is sometimes called an indexed list — but in practical engineering it is usually written from scratch, because the language's standard library either provides it directly or makes it easy to assemble. Python's collections.OrderedDict exposes move_to_end and popitem, both O(1); functools.lru_cache is implemented on top of an OrderedDict with a small lock around updates. JavaScript's Map preserves insertion order, so a delete followed by a reinsert is the equivalent of moving an entry to the most-recently-used position. Java's LinkedHashMap exposes a constructor flag for access-order traversal that turns the structure into an LRU cache outright.
In lower-level languages the structure is written explicitly. The Linux kernel's list.h provides intrusive doubly-linked-list macros that cooperate with arbitrary structures; the page cache, the dentry cache, and the inode cache all use them. Rust's linked_hash_map and indexmap crates offer the same shape. C++'s standard library does not provide one directly, but a std::list plus an std::unordered_map from key to list iterator gets there in a few dozen lines, with the small subtlety that std::list iterators are stable across insertions and deletions of other elements — a property the implementation depends on.
LRU variants — LFU, ARC, 2Q, and adaptive blends
Recency, frequency, and adaptive blends.
LRU's failure mode is the sequential scan. A single pass over more bytes than the cache holds evicts the entire working set, even if the previous contents were the right ones to keep. LFU — Least Frequently Used — replaces the per-entry recency timestamp with a per-entry counter; on eviction the lowest counter goes. LFU resists the scan because a one-shot scan touches each page exactly once, which is not enough to displace hot pages with high counters. The trade-off runs in the other direction: LFU adapts slowly to phase changes, because old hot keys keep their counters and crowd out newly hot ones. Pure LFU is rarely used; the variants are.
The first widely deployed adaptive policy was ARC (Adaptive Replacement Cache), described by Megiddo and Modha at IBM in their 2003 paper ARC: A Self-Tuning, Low Overhead Replacement Cache. ARC maintains two LRU lists, one for entries seen once (recency) and one for entries seen at least twice (frequency), plus two ghost lists that record recently evicted entries from each. The size split between the two real lists adapts dynamically based on which ghost list a re-reference hits. The implementation is more involved than LRU but the hit-ratio improvements on real workloads are real, sometimes ten or fifteen percentage points. ARC was the default cache policy in ZFS and was used in IBM DB2 until the patents expired and the algorithm propagated.
Postgres uses a different blend called 2Q, due to Johnson and Shasha (1994). 2Q maintains a small FIFO queue for recently-seen pages and a larger LRU queue for pages seen at least twice; pages are promoted from FIFO to LRU on the second access. This is scan-resistant in the same way ARC is — a one-shot scan churns through the FIFO without disturbing the main LRU — but the implementation is much simpler. PostgreSQL's buffer manager and several embedded databases use variants of 2Q.
The current state of the art in software caching is W-TinyLFU, introduced by Einziger and Friedman in their 2017 paper. The structure combines a small admission filter (a Count-Min Sketch tracking approximate frequencies of recently-seen keys) with a tiered LRU cache. New arrivals enter a small "window" segment; survivors are promoted to a "main" segment governed by an LFU eviction rule, but only if their estimated frequency exceeds that of the main segment's eviction candidate. The result is scan-resistant, adapts quickly to phase changes, and consistently approaches Belady's optimum on real-world traces. Caffeine, the Java caching library by Ben Manes, is the canonical implementation; ScyllaDB and several CDN edge caches use the same idea.
| Policy | Signal | Scan-resistant | Where |
|---|---|---|---|
| LRU | recency | no | Redis (allkeys-lru), Memcached |
| LFU | frequency counter | yes | rare in pure form |
| ARC | recency + frequency, ghost lists | yes | ZFS, IBM DB2 |
| 2Q | FIFO probation + LRU main | yes | PostgreSQL buffer manager |
| W-TinyLFU | Count-Min sketch admission | yes | Caffeine, ScyllaDB |
| CLOCK | single referenced bit | approximate | Linux page cache, FreeBSD |
An LRU cache of size k always contains a subset of an LRU cache of size k+1 on the same trace — which means a bigger cache can never miss where a smaller one hits. FIFO doesn't have this property, which is why Belady's anomaly bites it but not LRU.
Why no production OS runs textbook LRU — clock and approximations
Why no production OS runs textbook LRU.
Strict LRU is exact. Every access updates a global recency order. In an operating-system page cache that property is unaffordable: every memory reference in the system would have to lock and modify a single shared list. The contention destroys throughput long before the hit-ratio benefit pays off. Production page replacement uses approximations of LRU that trade a small amount of accuracy for an enormous reduction in synchronisation cost.
The classic approximation is CLOCK, sometimes called the second-chance algorithm. Pages are arranged in a circular array; each carries a single "referenced" bit set by hardware whenever the page is touched. The replacement hand sweeps the ring; when it finds a page with the bit set, it clears the bit and moves on; when it finds a page with the bit clear, that page is evicted. The policy approximates LRU because pages that are referenced often are skipped repeatedly and rarely chosen for eviction. The implementation is lock-free for the common case and scales to billions of pages.
Linux uses a two-list refinement called the active/inactive list design. Pages live on either the inactive list or the active list; reference bits move them between lists in a controlled way. Every time the kernel needs free pages it scans the inactive list with a CLOCK-style sweep. The mm/swap.c and mm/vmscan.c files contain the algorithm, which has accumulated dozens of patches over the years to handle workloads — sequential reads, large mmap regions, tmpfs pressure — that the basic version mishandles. The Linux multi-generational LRU work, merged in kernel 6.1, replaces the active/inactive split with multiple generations and dramatically improves behaviour on memory-pressured systems.
Memcached and Redis face similar pressures and reach for similar tricks. Memcached's eviction is segmented LRU plus a background "crawler" thread that periodically sweeps for expired entries. Redis 4 introduced an LFU eviction policy alongside LRU; both are implemented as approximations — Redis samples a small number of keys at eviction time and removes the worst of the sample, rather than maintaining a true global order. The sample size is configurable through maxmemory-samples; raising it improves accuracy at modest CPU cost.
# Textbook LRU in 8 lines — hash map + ordered dict
class LRUCache:
def __init__(self, cap):
self.cap, self.od = cap, OrderedDict()
def get(self, k):
if k not in self.od: return -1
self.od.move_to_end(k) # O(1) splice to MRU
return self.od[k]
def put(self, k, v):
if k in self.od: self.od.move_to_end(k)
self.od[k] = v
if len(self.od) > self.cap:
self.od.popitem(last=False) # evict LRU tailWhen the cache stops paying for itself — eviction-rate signals
When the cache stops paying for itself.
The first failure mode is cache stampede. A popular key expires; many concurrent readers all miss simultaneously; all of them race to recompute or refetch the value; the backend is hit with a thundering herd. The classical mitigations are request coalescing (only one in-flight refetch per key), probabilistic early refresh (start refreshing before the expiry deadline with probability that grows as the deadline approaches), and lazy revalidation through stale-while-revalidate semantics. Anyone running a cache in front of a slow origin will eventually meet this failure mode and need to choose between coalescing and pre-warming.
The second is scan pollution. A backup job, a one-off ETL, or an analytics query reads the entire dataset once. With pure LRU the read displaces every hot key in the cache; the hit ratio collapses for several minutes afterward as the working set repopulates. The mitigation is a scan-resistant policy (2Q, ARC, W-TinyLFU) or a separate cache region for cold workloads that does not interfere with the hot region. Postgres' choice of 2Q is a direct response to this failure; analytics queries hitting the buffer pool would otherwise crater OLTP latency.
The third is hot-key contention. A small number of keys see overwhelming traffic; the lock or shard that owns those keys becomes a serialisation point. The cache hit ratio is fine but throughput plateaus because every concurrent reader is waiting on the same critical section. Mitigations include sharding the cache by key hash, using lock-free hash maps, replicating hot keys across shards, and serving hot keys from per-thread caches. The last technique is what Memcached's slab allocator implicitly enables; busy threads keep their own working set in CPU-local caches.
The fourth is working-set drift. The cache size was sized for last quarter's workload; this quarter's distribution has shifted; the hit ratio has dropped without any visible incident. Without monitoring you do not notice. The instrument that catches this is a continuous hit-rate-vs-cache-size curve estimated online from sampled accesses; CacheLib (Meta's cache library) and several CDN providers run continuous estimation and alert when the curve crosses a threshold. The fix is usually to grow the cache or to shed some traffic to a lower tier.
Where LRU and friends actually run — Linux, Postgres, Redis, Caffeine
Where these algorithms actually run.
Linux's page cache is the largest LRU-style cache in everyday computing. It uses an active/inactive two-list approximation with refault tracking; the kernel can detect when an evicted page is re-referenced shortly after eviction (a refault) and grow the corresponding list at the expense of the other. Multi-generational LRU, merged in kernel 6.1 after Google deployed it internally, replaces the two lists with several generations and integrates with reclaim throttling more cleanly. The eviction work itself runs on the kswapd kernel threads.
Memcached uses segmented LRU per slab class. Each slab class is sized for objects in a fixed size range; within a class entries are LRU-ordered. The reasons for slabbing are really about fragmentation rather than caching, but the per-slab LRU is a real LRU, with the nice property that a flood of one size class does not displace another. The data structure supporting it is the doubly-linked list plus hash map of textbook LRU, scaled to handle hundreds of thousands of operations per second per core.
Redis defaults to noeviction when memory fills and refuses writes; the maxmemory-policy option enables eviction. The choices are LRU, LFU, random, and TTL-based, each in two flavours — over all keys or over keys with an explicit expiry. The implementation samples a small number of keys at eviction time and evicts the worst of the sample. Caffeine in the Java ecosystem is the W-TinyLFU implementation that replaced Guava's LRU caches in many JVM applications; it consistently achieves higher hit ratios than the structures it displaced.
CDNs care about cache replacement more than almost anyone else. The economics of an edge POP are dominated by hit ratio: a hit costs nothing, a miss costs a backbone fetch. Akamai, Cloudflare, and Fastly each run their own multi-tier caching stack with policies tuned to web-traffic skew. Cloudflare published in their engineering blog the journey from LRU to a TinyLFU-flavoured policy on their edge tier; the gains were measured in tens of basis points of hit ratio, which translates to substantial bandwidth savings at their scale. Meta's CacheLib, open-sourced in 2021, is a general-purpose caching engine used inside dozens of Meta services; it ships with multiple policies and lets the operator pick.
Hit-rate curves and the working-set knee
Hit-rate curves and the working-set knee.
The fundamental cache-sizing question is how hit ratio depends on cache size. Plot the hit ratio against cache size and you almost always get a curve with a sharp knee: below some threshold the cache is too small to hold the working set and the hit ratio is very low; above the threshold the curve flattens because you are catching diminishing-returns long-tail accesses. The knee is the working-set size; sizing the cache there is usually the right answer. Sizing below the knee throws money away on a cache that does not work; sizing far above it throws money away on memory that is not pulling its weight.
Estimating the curve online without re-running the workload at multiple sizes is the trick that distinguishes a sophisticated cache from a basic one. Mattson's stack-distance algorithm, published in 1970, does it analytically: maintain a single LRU stack, record the depth at which each access hits, and you can read off the hit ratio for any cache size up to the stack's height. Modern estimators (counter-stacks, SHARDS, KOSMIK) make the technique cheap enough to run in production on real traffic.
Two further knobs matter in practice. The first is negative caching: caching not only successful lookups but also misses, so that a key known to be absent does not require a backend hit on every request. DNS resolvers do this aggressively; HTTP caches do it via short-lived 404 cache entries. The risk is that a transient absence becomes a long-lived false negative if the negative cache TTL is too long. The second is cache key normalisation: making sure that semantically equivalent requests share a cache entry. Query-string ordering, header capitalisation, and locale variants all cause cache fragmentation if not handled.
Finally, observability matters more than the algorithm. A cache without hit-ratio metrics, eviction counters, refault rates, and key-space size estimates is a black box. A cache with those metrics is debuggable; you can see scan pollution as a sudden eviction spike, hot-key contention as latency tail, and working-set drift as a slow downward trend in hit ratio. Most production cache outages are not caused by the choice of replacement policy. They are caused by a slow drift in the workload combined with the absence of an instrument that would have shown the drift.
Two related questions show up at scale and deserve mention. The first is whether the cache should sit in-process, side-car, or remote. In-process caches have the lowest latency and the worst sharing properties; the working set lives in one address space and other processes pay full miss cost. Remote caches reverse the trade. A side-car or shared-memory cache splits the difference; this is what tools like mcrouter in front of Memcached, or redis-cluster, provide. The second is whether to use a write-through, write-back, or write-around policy. Write-through is simplest and safest; write-back hides write latency at the cost of durability complexity; write-around keeps the cache focused on reads and avoids polluting it with values that may never be read again. The right choice depends on the read-to-write ratio and on how much you trust the persistence layer behind the cache.
Further reading on LRU caching
Primary sources, in order.
- Belady · 1966A Study of Replacement Algorithms for a Virtual-Storage ComputerThe original optimal-replacement paper. Defines MIN; proves it is unbeatable; introduces the anomaly that bears Belady's name.
- Megiddo & Modha · 2003ARC: A Self-Tuning, Low Overhead Replacement CacheThe IBM paper that introduced adaptive replacement; later deployed in ZFS and DB2.
- Einziger & Friedman · 2017TinyLFU: A Highly Efficient Cache Admission PolicyThe sketch-based admission filter behind W-TinyLFU; the structure used by Caffeine.
- Johnson & Shasha · 19942Q: A Low Overhead High Performance Buffer Management Replacement AlgorithmThe two-queue scan-resistant policy that PostgreSQL uses in its buffer manager.
- Caffeine wikiEfficiency — W-TinyLFU in productionHit-ratio comparisons across LRU, ARC, and W-TinyLFU on real workloads, by the library's author.
- Meta engineering · 2021CacheLib — the general-purpose caching engineAn open-source caching engine with multiple policies, used inside dozens of Meta services.
- LWN · 2021Multi-generational LRUThe redesign of Linux's page-replacement that landed in kernel 6.1.
- Kleppmann · 2017Designing Data-Intensive Applications — chapter on Storage and RetrievalCache, buffer pool, and page-replacement strategies for systems engineers. The standard practitioner's reference.
- Cloudflare blogWhy we started putting unpopular assets in memoryA practical look at how cache admission and tiered storage interact at CDN scale; ties directly to W-TinyLFU.
- MIT OCW · 6.006Lecture 15 — Caches and AmortizationFree undergrad lecture covering LRU, paging, and amortised analysis; the student-friendly entry point.
- ByteByteGoTop 5 Caching Strategies ExplainedA short whiteboard intro to write-through, write-back, cache-aside, read-through, and refresh-ahead. Useful as a sanity check on terminology.
- Semicolony guideCachingCache hierarchies, write policies, invalidation, and the failure modes that bite in production.
- Semicolony simulatorCache eviction policiesLRU vs LFU vs FIFO vs random under the same access trace, side by side.