09 / 15
Internals / 09

CPU caches

A CPU cache is the small, fast memory between the registers and DRAM. It exists because DRAM has barely improved in latency in 25 years while CPUs got 100× faster, and the only way to keep a core fed is to keep its working set in something closer. Modern chips have three or four cache levels, ~64 bytes per line, set-associative, write-back, and coherent across cores via a protocol called MESI. Almost every "why is this code so slow?" question on modern hardware is a cache question.


Eight orders of magnitude

This is the gap. Each bar is one layer of the hierarchy, drawn on a log scale. A register read and an HDD seek differ by 10⁷ — seven zeros. The whole point of a cache is to keep the memory traffic on the left half of this chart.

Register~0.3 nsL1 cache~1 nsL2 cache~3–5 nsL3 cache~12–15 nsDRAM (local)~80 nsDRAM (NUMA)~140 nsNVMe Gen5~10 µsDatacenter RTT~250 µsHDD seek~5 msIntercontinental RTT~150 ms
The number that drives everything: a 3 GHz CPU executes ~3 instructions per cycle, so an L3-miss stall of 80 ns wastes 720 instruction slots. A 1% DRAM-miss rate on hot loads costs ~7 of every 10 cycles to memory. Tuning a hot loop is mostly tuning the cache miss rate.

The hierarchy on a modern chip

Three levels is now standard. L1 is split between instructions and data so the fetcher and the load/store unit don't fight; L2 is unified per core (Intel, AMD) or per cluster (Apple); L3 is one big slab shared across the whole socket. Sizes and latencies, measured on three current chips:

LevelApple M4 (P-core)Intel Raptor Lake (P-core)AMD Zen 5
L1 instruction192 KB32 KB32 KB
L1 data128 KB48 KB48 KB
L216 MB shared / cluster2 MB / core1 MB / core
L3none (system cache: 24 MB)36 MB shared32 MB / CCX
Cache line128 bytes64 bytes64 bytes
L1 hit latency~3 cycles~5 cycles~4 cycles
L2 hit latency~16 cycles~14 cycles~14 cycles
L3 / SLC latency~50 cycles~45 cycles~50 cycles
DRAM~250 cycles~250 cycles~270 cycles

Apple's L1 data cache is 4× larger than Intel's (physical-address-tagged plus a 16 KB page size lets it stay fast at that capacity), and Apple's cache line is 128 bytes — twice the x86 norm. Code padded to 64-byte boundaries on x86 still false-shares on Apple silicon.

A line, a set, an index

Caches don't store individual bytes — they store fixed-size lines, 64 bytes on x86, the unit of transfer to and from memory and the unit of coherence between cores. A 32 KB L1 with 64-byte lines holds exactly 512 lines. Where each line can live depends on the cache's associativity:

  • Direct-mapped (1-way). Each address has exactly one slot. Cheap but pathological — two hot variables that map to the same slot evict each other on every access.
  • Fully associative. Any address can live in any slot. Best hit rate but the lookup compares the tag against every slot in parallel — only used in tiny caches like the TLB.
  • N-way set-associative. The middle ground every real L1/L2/L3 picks. The cache is divided into S sets, each holding N lines (the "ways"). An address picks its set by bits [6 .. 6 + log₂ S − 1], then any of N ways inside that set will do.

Skylake-era L1: 32 KB, 8-way, 64-byte lines → 64 sets. Bits 0–5 are the byte offset, bits 6–11 are the set index, the remaining bits are the tag. Drag the slider to see how the bit decomposition shifts:

011111111111111110101011110011010001001001000000
tag · 36 bits · which line set index · 6 bits · set #9 byte offset · 6 bits · byte +0
way 0
way 1
way 2
way 3
way 4
way 5
way 6
way 7
set 6
set 7
set 8
set 9
set 10
set 11
set 12
set 13
A window onto eight rows of an 8-way 64-set L1. The hit row is highlighted, way 0 stands in for "your line lives here". The hardware compares all 8 tags in parallel in ~1 cycle.

Replacement — choosing a victim

When all N ways in a set are full and you need to bring in a new line, one of them has to leave. The choice — the replacement policy — is implemented in metadata bits attached to each set and updated on every access.

  • LRU. Tracks usage order across the N ways. Optimal-ish, metadata cost grows fast — perfect LRU on an 8-way set needs ~16 bits of state per set.
  • NRU / pseudo-LRU. One bit per way, periodically reset. Cheaper than LRU and within ~2% on hit rate.
  • RRIP (Re-Reference Interval Prediction). What modern Intel L3 uses. Each line gets a 2-bit "re-reference prediction"; streaming loads are inserted high so they don't pollute the cache. Significantly better than LRU on workloads exceeding capacity.
  • Random. Pick any way. Surprisingly competitive on large caches with random access. ARM Cortex-A series uses pseudo-random in some configurations.

Write policy — through, back, and allocate

A cache that holds reads is half a cache. The other half is what happens on a store. Two orthogonal choices:

  • Write-through updates the cache and writes to the next level on every store. Simple, eats memory bandwidth. Used in some L1-instruction caches and tiny embedded designs.
  • Write-back updates only the cache and marks the line dirty. The dirty data is written out lazily when the line is evicted. Every modern data cache is write-back.
  • Write-allocate / no-write-allocate. When a store misses, do you bring the line in (allocate, then write) or write straight through? Modern CPUs are write-allocate at L1/L2, often no-allocate at L3 for streaming-store instructions like x86's MOVNTDQA.
Why this matters: the write-back miss is more expensive than the read miss. A store that misses brings the line into L1, the old occupant of that L1 slot (if dirty) goes to L2, L2's evicted line (if dirty) goes to L3, and so on. A single dirty cache eviction can cascade three writebacks deep before anything new lands in L1.

Coherence — MESI

One core would be easy. Eight cores all caching the same DRAM is the hard part. Coherence is the property that a load from one core sees a sane value given what another core's store did. Each cache line carries a state and the cores exchange messages to keep states consistent.

core A I
Invalid — no valid copy here
core B I
Invalid — no valid copy here
bus log
Both cores start Invalid (I) — no cached copy.

Modern Intel uses MESIF (adds Forward — one of the S-state copies is designated as the responder so DRAM doesn't have to step in). AMD uses MOESI (adds Owned — a dirty copy can be shared without first writing back to memory; the owner is responsible for eventual writeback). Both are bandwidth optimisations on top of the same idea.

False sharing — the production-grade footgun

MESI works at the granularity of a cache line. Two variables on the same line are coherent as if they were one variable, even if logically they have nothing to do with each other. Two threads on different cores each writing their "own" counter, where those counters happen to be 16 bytes apart, will MESI-invalidate each other on every store.

core A · counters[0]++
cache line
core B · counters[1]++
cache line
unpadded · same line
0
invalidations
padded · separate lines
0
invalidations (≈ 50× fewer)
// Looks fine. Each thread updates its own counter.
struct Counters {
    long a;   // thread 1 increments this
    long b;   // thread 2 increments this  ← same 64-byte line as 'a'
};

// Throughput: ~5 million ops/sec/thread (10 M total)
// Single-thread baseline: 800 million ops/sec.
// MESI bouncing the line between cores: 80× slowdown.

// Fix: pad each counter onto its own line.
struct CountersPadded {
    long a;
    char _pad[56];   // round up to 64 bytes
    long b;
    char _pad2[56];
};

// Throughput: ~750 M ops/sec/thread. Restored.

The Linux kernel has ____cacheline_aligned for this. You'll find it throughout any data structure accessed by per-CPU code. The same pattern shows up in Java's @Contended, .NET's StructLayout, Go's manual padding in runtime/sched.go, and Rust's crossbeam_utils::CachePadded. On Apple silicon every one of those needs to be aware of the 128-byte line — a 64-byte pad is half the protection you think it is.

How to spot it in production. perf c2c on Linux records the cache-line address of every coherence miss and shows you the contended lines and the call sites that touched them. A three-line struct with two unrelated atomics in it will show up as the hottest line in the binary.

Cache-friendly access patterns

Hardware does most of the work — prefetchers detect strided patterns and pull lines ahead, the speculative core hides latency by issuing loads early. Two patterns defeat all that engineering: random access into large structures, and pointer chasing. The classic exhibit:

hits 0
misses 0
hit rate
8×8 matrix in row-major memory layout, simulated with a 4-line cache where each line holds 4 cells. Row-major traversal: each line serves 4 hits. Column-major: every access misses because the stride defeats the prefetcher.

Other patterns worth knowing:

  • Array of structs vs struct of arrays. If the hot loop only touches one field, AoS wastes ~80% of every cache line on data the loop ignores. SoA brings the wasted bytes to zero. Particle simulators that update position in one loop and velocity in another are the canonical case.
  • Pointer chasing. A linked list scattered across the heap can't be prefetched — the address of node N+1 is only known after N loads. Throughput drops to one node per L2/L3 latency. Replacing with a flat array is often a 5–20× win.
  • Hot/cold splitting. If a struct has 200 bytes but the hot loop reads only 16, splitting hot/cold doubles density on the hot half. Used in Postgres tuples, RocksDB indices, ECS layouts in game engines.

Cache misses, classified

A taxonomy from Mark Hill's 1989 thesis, still useful: every miss is one of four kinds, and the fix depends on which.

KindCauseFix
Compulsory (cold)First-touch miss for any lineSoftware prefetch hints; warm-up loops
CapacityWorking set exceeds cacheTiling, blocking, smaller types; level up
ConflictHot lines map to same setSpread addresses across sets; raise associativity
CoherenceAnother core invalidated my lineFalse-sharing pad; rethink sharing pattern

On Linux, perf stat -e LLC-load-misses gives you the rough miss count; perf c2c separates the coherence misses; pmu-tools toplev.py gives a top-down breakdown attributing stalls to the right level. On Apple silicon, Instruments' "CPU Counters" template surfaces the same data through the Apple PMU.

Inclusive, exclusive, and the L3

An L3 can hold a copy of every line in any L1 / L2 (inclusive), only the lines that aren't in any L1 / L2 (exclusive), or anything (non-inclusive non-exclusive — Intel's L3 since Skylake-X). The choice has subtle but real consequences:

  • Inclusive L3 — easy snoop filtering. A request from another socket only checks the L3. Cost: a 1 MB L2 effectively shrinks the usable L3 by 1 MB.
  • Exclusive L3 (AMD's design through Zen 1) — full L1+L2+L3 capacity is usable. Cost: every L2-to-L1 promotion has to check L3 too.
  • NINE (Intel post-Skylake-X) — middle ground. L3 holds whatever was last evicted from L2; a snoop maintains a separate snoop filter to know what's in upper levels.

Apple's "system-level cache" isn't really an L3 — it's a unified cache shared between CPU, GPU, Neural Engine, and the display controller, used as a coherence backstop for everything that touches DRAM.

Prefetching

Modern CPUs aggressively prefetch lines into cache before the program asks for them. The hardware watches recent miss addresses, identifies a pattern, and issues loads ahead. There are usually two or three prefetchers in parallel:

  • Next-line. On a miss to line N, prefetch line N+1. Catches sequential reads. Almost always on.
  • Stride. Detects fixed-stride patterns ("read 0, 32, 64, 96, …") and continues. Catches array traversal with non-unit element size.
  • Stream. A more general version of stride that tracks multiple concurrent streams, used at L2/L3.

Prefetching is often the difference between memory bandwidth being the limit and memory latency being the limit. A loop reading 8 GB sequentially can hit ~50 GB/s on DDR5; the same 8 GB read randomly hits ~1 GB/s because each access pays full DRAM latency. Software prefetch hints (__builtin_prefetch, _mm_prefetch) help when the pattern is too irregular for the hardware but the program knows the future — the canonical case is binary-tree traversal where you prefetch both children while processing the parent.

Where caches stop helping

Not every workload is cache-shaped. The cases where the hierarchy actively hurts:

  • Streaming writes. Writing 10 GB sequentially with no plan to re-read pollutes the cache for nothing. Fix: non-temporal stores (movnti, vmovntdq on x86, STNP on ARM) bypass the cache entirely.
  • One-shot huge reads. mmap + a single-pass scan fills the page cache with data you'll never see again. Linux exposes posix_fadvise(POSIX_FADV_NOREUSE) and O_DIRECT.
  • Heavily contended atomics. A counter every thread increments will always be a coherence miss. Architectural fix: per-thread counters reduced occasionally; the cache hierarchy can't help.
  • Workloads that exceed L3. Once your working set is bigger than ~30 MB on x86 or ~24 MB on Apple silicon, the cache hierarchy has given up and you're DRAM-bound. The fix is algorithmic — tile the computation so the inner loop fits in L1 or L2, sweep through data in blocks.

Numbers worth remembering

OperationTime (3 GHz core)Cycles
Register read~0.3 ns1
L1 hit~1 ns3–5
L2 hit~3–5 ns10–16
L3 hit (same socket)~12–15 ns40–50
DRAM (same socket)~80 ns~250
DRAM (remote socket, NUMA)~140 ns~420
Branch mispredict~5 ns~16–20
Atomic CAS (uncontended, in L1)~5 ns~15
Atomic CAS (contended, MESI bouncing)~50–100 ns~150–300
NVMe Gen5 random read (4 KB)~10 µs~30,000

Try it yourself

A live multi-core MESI simulator with a false-sharing scenario built in:

Simulator

CPU cache & MESI coherence

Eight cores, three cache levels, line-granular MESI states. Click an address to issue a load; watch the protocol invalidate, fetch, or hit. False-sharing demo with padded vs unpadded throughput numbers.

Open simulator

Further reading

Found this useful?