CPU caches
A CPU cache is the small, fast memory between the registers and DRAM. It exists because DRAM has barely improved in latency in 25 years while CPUs got 100× faster, and the only way to keep a core fed is to keep its working set in something closer. Modern chips have three or four cache levels, ~64 bytes per line, set-associative, write-back, and coherent across cores via a protocol called MESI. Almost every "why is this code so slow?" question on modern hardware is a cache question.
Eight orders of magnitude
This is the gap. Each bar is one layer of the hierarchy, drawn on a log scale. A register read and an HDD seek differ by 10⁷ — seven zeros. The whole point of a cache is to keep the memory traffic on the left half of this chart.
The hierarchy on a modern chip
Three levels is now standard. L1 is split between instructions and data so the fetcher and the load/store unit don't fight; L2 is unified per core (Intel, AMD) or per cluster (Apple); L3 is one big slab shared across the whole socket. Sizes and latencies, measured on three current chips:
| Level | Apple M4 (P-core) | Intel Raptor Lake (P-core) | AMD Zen 5 |
|---|---|---|---|
| L1 instruction | 192 KB | 32 KB | 32 KB |
| L1 data | 128 KB | 48 KB | 48 KB |
| L2 | 16 MB shared / cluster | 2 MB / core | 1 MB / core |
| L3 | none (system cache: 24 MB) | 36 MB shared | 32 MB / CCX |
| Cache line | 128 bytes | 64 bytes | 64 bytes |
| L1 hit latency | ~3 cycles | ~5 cycles | ~4 cycles |
| L2 hit latency | ~16 cycles | ~14 cycles | ~14 cycles |
| L3 / SLC latency | ~50 cycles | ~45 cycles | ~50 cycles |
| DRAM | ~250 cycles | ~250 cycles | ~270 cycles |
Apple's L1 data cache is 4× larger than Intel's (physical-address-tagged plus a 16 KB
page size lets it stay fast at that capacity), and Apple's cache line is 128 bytes —
twice the x86 norm. Code padded to 64-byte boundaries on x86 still
false-shares on Apple silicon.
A line, a set, an index
Caches don't store individual bytes — they store fixed-size lines, 64 bytes on x86, the unit of transfer to and from memory and the unit of coherence between cores. A 32 KB L1 with 64-byte lines holds exactly 512 lines. Where each line can live depends on the cache's associativity:
- Direct-mapped (1-way). Each address has exactly one slot. Cheap but pathological — two hot variables that map to the same slot evict each other on every access.
- Fully associative. Any address can live in any slot. Best hit rate but the lookup compares the tag against every slot in parallel — only used in tiny caches like the TLB.
- N-way set-associative. The middle ground every real L1/L2/L3 picks. The cache is divided into
Ssets, each holdingNlines (the "ways"). An address picks its set by bits[6 .. 6 + log₂ S − 1], then any of N ways inside that set will do.
Skylake-era L1: 32 KB, 8-way, 64-byte lines → 64 sets. Bits 0–5 are the byte offset, bits 6–11 are the set index, the remaining bits are the tag. Drag the slider to see how the bit decomposition shifts:
Replacement — choosing a victim
When all N ways in a set are full and you need to bring in a new line, one of them has to leave. The choice — the replacement policy — is implemented in metadata bits attached to each set and updated on every access.
- LRU. Tracks usage order across the N ways. Optimal-ish, metadata cost grows fast — perfect LRU on an 8-way set needs ~16 bits of state per set.
- NRU / pseudo-LRU. One bit per way, periodically reset. Cheaper than LRU and within ~2% on hit rate.
- RRIP (Re-Reference Interval Prediction). What modern Intel L3 uses. Each line gets a 2-bit "re-reference prediction"; streaming loads are inserted high so they don't pollute the cache. Significantly better than LRU on workloads exceeding capacity.
- Random. Pick any way. Surprisingly competitive on large caches with random access. ARM Cortex-A series uses pseudo-random in some configurations.
Write policy — through, back, and allocate
A cache that holds reads is half a cache. The other half is what happens on a store. Two orthogonal choices:
- Write-through updates the cache and writes to the next level on every store. Simple, eats memory bandwidth. Used in some L1-instruction caches and tiny embedded designs.
- Write-back updates only the cache and marks the line dirty. The dirty data is written out lazily when the line is evicted. Every modern data cache is write-back.
- Write-allocate / no-write-allocate. When a store misses, do you bring the line in (allocate, then write) or write straight through? Modern CPUs are write-allocate at L1/L2, often no-allocate at L3 for streaming-store instructions like x86's
MOVNTDQA.
Coherence — MESI
One core would be easy. Eight cores all caching the same DRAM is the hard part. Coherence is the property that a load from one core sees a sane value given what another core's store did. Each cache line carries a state and the cores exchange messages to keep states consistent.
Modern Intel uses MESIF (adds Forward — one of the S-state copies is designated as the responder so DRAM doesn't have to step in). AMD uses MOESI (adds Owned — a dirty copy can be shared without first writing back to memory; the owner is responsible for eventual writeback). Both are bandwidth optimisations on top of the same idea.
False sharing — the production-grade footgun
MESI works at the granularity of a cache line. Two variables on the same line are coherent as if they were one variable, even if logically they have nothing to do with each other. Two threads on different cores each writing their "own" counter, where those counters happen to be 16 bytes apart, will MESI-invalidate each other on every store.
// Looks fine. Each thread updates its own counter.
struct Counters {
long a; // thread 1 increments this
long b; // thread 2 increments this ← same 64-byte line as 'a'
};
// Throughput: ~5 million ops/sec/thread (10 M total)
// Single-thread baseline: 800 million ops/sec.
// MESI bouncing the line between cores: 80× slowdown.
// Fix: pad each counter onto its own line.
struct CountersPadded {
long a;
char _pad[56]; // round up to 64 bytes
long b;
char _pad2[56];
};
// Throughput: ~750 M ops/sec/thread. Restored.The Linux kernel has ____cacheline_aligned for this. You'll find it
throughout any data structure accessed by per-CPU code. The same pattern shows up in
Java's @Contended, .NET's StructLayout, Go's manual padding
in runtime/sched.go, and Rust's crossbeam_utils::CachePadded.
On Apple silicon every one of those needs to be aware of the 128-byte line — a
64-byte pad is half the protection you think it is.
perf c2c on Linux records
the cache-line address of every coherence miss and shows you the contended lines and
the call sites that touched them. A three-line struct with two unrelated atomics in
it will show up as the hottest line in the binary.Cache-friendly access patterns
Hardware does most of the work — prefetchers detect strided patterns and pull lines ahead, the speculative core hides latency by issuing loads early. Two patterns defeat all that engineering: random access into large structures, and pointer chasing. The classic exhibit:
Other patterns worth knowing:
- Array of structs vs struct of arrays. If the hot loop only touches one field, AoS wastes ~80% of every cache line on data the loop ignores. SoA brings the wasted bytes to zero. Particle simulators that update
positionin one loop andvelocityin another are the canonical case. - Pointer chasing. A linked list scattered across the heap can't be prefetched — the address of node N+1 is only known after N loads. Throughput drops to one node per L2/L3 latency. Replacing with a flat array is often a 5–20× win.
- Hot/cold splitting. If a struct has 200 bytes but the hot loop reads only 16, splitting hot/cold doubles density on the hot half. Used in Postgres tuples, RocksDB indices, ECS layouts in game engines.
Cache misses, classified
A taxonomy from Mark Hill's 1989 thesis, still useful: every miss is one of four kinds, and the fix depends on which.
| Kind | Cause | Fix |
|---|---|---|
| Compulsory (cold) | First-touch miss for any line | Software prefetch hints; warm-up loops |
| Capacity | Working set exceeds cache | Tiling, blocking, smaller types; level up |
| Conflict | Hot lines map to same set | Spread addresses across sets; raise associativity |
| Coherence | Another core invalidated my line | False-sharing pad; rethink sharing pattern |
On Linux, perf stat -e LLC-load-misses gives you the rough miss count;
perf c2c separates the coherence misses; pmu-tools toplev.py
gives a top-down breakdown attributing stalls to the right level. On Apple silicon,
Instruments' "CPU Counters" template surfaces the same data through the Apple PMU.
Inclusive, exclusive, and the L3
An L3 can hold a copy of every line in any L1 / L2 (inclusive), only the lines that aren't in any L1 / L2 (exclusive), or anything (non-inclusive non-exclusive — Intel's L3 since Skylake-X). The choice has subtle but real consequences:
- Inclusive L3 — easy snoop filtering. A request from another socket only checks the L3. Cost: a 1 MB L2 effectively shrinks the usable L3 by 1 MB.
- Exclusive L3 (AMD's design through Zen 1) — full L1+L2+L3 capacity is usable. Cost: every L2-to-L1 promotion has to check L3 too.
- NINE (Intel post-Skylake-X) — middle ground. L3 holds whatever was last evicted from L2; a snoop maintains a separate snoop filter to know what's in upper levels.
Apple's "system-level cache" isn't really an L3 — it's a unified cache shared between CPU, GPU, Neural Engine, and the display controller, used as a coherence backstop for everything that touches DRAM.
Prefetching
Modern CPUs aggressively prefetch lines into cache before the program asks for them. The hardware watches recent miss addresses, identifies a pattern, and issues loads ahead. There are usually two or three prefetchers in parallel:
- Next-line. On a miss to line N, prefetch line N+1. Catches sequential reads. Almost always on.
- Stride. Detects fixed-stride patterns ("read 0, 32, 64, 96, …") and continues. Catches array traversal with non-unit element size.
- Stream. A more general version of stride that tracks multiple concurrent streams, used at L2/L3.
Prefetching is often the difference between memory bandwidth being the limit and
memory latency being the limit. A loop reading 8 GB sequentially can hit ~50 GB/s on
DDR5; the same 8 GB read randomly hits ~1 GB/s because each access pays full DRAM
latency. Software prefetch hints (__builtin_prefetch,
_mm_prefetch) help when the pattern is too irregular for the hardware
but the program knows the future — the canonical case is binary-tree traversal where
you prefetch both children while processing the parent.
Where caches stop helping
Not every workload is cache-shaped. The cases where the hierarchy actively hurts:
- Streaming writes. Writing 10 GB sequentially with no plan to re-read pollutes the cache for nothing. Fix: non-temporal stores (
movnti,vmovntdqon x86,STNPon ARM) bypass the cache entirely. - One-shot huge reads.
mmap+ a single-pass scan fills the page cache with data you'll never see again. Linux exposesposix_fadvise(POSIX_FADV_NOREUSE)andO_DIRECT. - Heavily contended atomics. A counter every thread increments will always be a coherence miss. Architectural fix: per-thread counters reduced occasionally; the cache hierarchy can't help.
- Workloads that exceed L3. Once your working set is bigger than ~30 MB on x86 or ~24 MB on Apple silicon, the cache hierarchy has given up and you're DRAM-bound. The fix is algorithmic — tile the computation so the inner loop fits in L1 or L2, sweep through data in blocks.
Numbers worth remembering
| Operation | Time (3 GHz core) | Cycles |
|---|---|---|
| Register read | ~0.3 ns | 1 |
| L1 hit | ~1 ns | 3–5 |
| L2 hit | ~3–5 ns | 10–16 |
| L3 hit (same socket) | ~12–15 ns | 40–50 |
| DRAM (same socket) | ~80 ns | ~250 |
| DRAM (remote socket, NUMA) | ~140 ns | ~420 |
| Branch mispredict | ~5 ns | ~16–20 |
| Atomic CAS (uncontended, in L1) | ~5 ns | ~15 |
| Atomic CAS (contended, MESI bouncing) | ~50–100 ns | ~150–300 |
| NVMe Gen5 random read (4 KB) | ~10 µs | ~30,000 |
Try it yourself
A live multi-core MESI simulator with a false-sharing scenario built in:
CPU cache & MESI coherence
Eight cores, three cache levels, line-granular MESI states. Click an address to issue a load; watch the protocol invalidate, fetch, or hit. False-sharing demo with padded vs unpadded throughput numbers.
Open simulatorFurther reading
- Ulrich Drepper — What Every Programmer Should Know About Memory (2007) — the canonical 100-page walkthrough. Slightly dated on numbers; the structure is timeless.
- Sorin, Hill & Wood — A Primer on Memory Consistency and Cache Coherence — free PDF, the textbook on coherence protocols.
- Agner Fog — The microarchitecture of Intel, AMD, and VIA CPUs — exact pipeline and cache behaviour for every consumer chip back to Pentium.
- Chips and Cheese — modern microarchitecture deep dives, fresh measurements of L1/L2/L3 latency and bandwidth on newly-released chips.
- CMU 15-213 — Computer Systems: A Programmer's Perspective — the cache lectures (Lecture 11 onward) are the most engineer-facing introduction in print.
- Wikipedia — MESI protocol — fast lookup for the state-transition table.
- man perf-c2c — the tool for finding cache-line contention in production binaries.