The memory hierarchy
A register read takes 0.3 nanoseconds. An HDD seek takes 5 milliseconds. The same basic operation — "fetch a byte" — spans seven orders of magnitude of latency depending on where the byte lives. The memory hierarchy is the layered stack of storage that bridges this gap. Each layer is faster, smaller, and more expensive than the one below it. Almost every performance question on modern hardware is really a question about this hierarchy.
Why a hierarchy
You can build memory that's fast, you can build memory that's large, and you can build memory that's cheap. You can't build one that's all three. SRAM (the technology used for caches) is fast — under a nanosecond — but each cell uses six transistors, which is why a 36 MB L3 takes ~3 billion transistors. DRAM (used for main memory) packs ~10× more capacity per transistor but takes 80 ns to access. NAND flash (SSDs) packs ~1000× more capacity but takes microseconds. The hierarchy stitches all of these together so the CPU gets fast access most of the time and large capacity when needed.
The physics is unforgiving in a specific way. The speed of a memory cell is set by how few transistors it uses, how close it sits to the core, and how much power you're willing to spend keeping it ready. SRAM wins on speed because its six transistors hold a bit actively, with no refresh and no sensing of a tiny charge, but six transistors per bit is expensive in die area, and die area is the most precious resource on a chip. DRAM trades that away: one transistor and one capacitor per bit, packed densely, but the capacitor leaks and has to be refreshed thousands of times a second, and reading it means amplifying a charge so small it takes tens of nanoseconds to settle. NAND flash goes further still, storing multiple bits per cell in a 3D stack, which is how you get terabytes on a thumb-sized board, but writing means pushing electrons through an insulator, which is slow and wears the cell out. Each step down the stack buys density and cost at the direct expense of latency. No clever engineering escapes the trade; you can only pick where on the curve a given layer sits.
So instead of picking one point, the machine uses all of them at once and lets the access pattern decide which layer answers. The bet the whole design rests on is simple: most accesses hit something you touched recently, or something next to it. When that bet pays off, and on real code it pays off the large majority of the time, the average access lands near the top of the pyramid even though the data physically lives near the bottom. The hierarchy doesn't make any single layer faster. It makes the average fast, which is the only thing that shows up in a wall-clock measurement.
Ten layers, eight orders of magnitude
| Layer | Typical size | Latency | Bandwidth | Cost / GB |
|---|---|---|---|---|
| Register | ~256 B / core | 0.3 ns | 1000 GB/s | — |
| L1 cache | 32–192 KB / core | 1.0 ns | 1000 GB/s | — |
| L2 cache | 1–16 MB / core | 4.0 ns | 500 GB/s | — |
| L3 cache | 16–256 MB | 14.0 ns | 200 GB/s | — |
| DRAM (DDR5) | 8–512 GB | 80.0 ns | 50 GB/s | $4/GB |
| NVMe Gen5 | 500 GB – 30 TB | 10.0 µs | 14 GB/s | $0.07/GB |
| SATA SSD | 256 GB – 8 TB | 100.0 µs | 550 MB/s | $0.05/GB |
| HDD | 1–24 TB | 5.0 ms | 200 MB/s | $0.01/GB |
| Datacenter network | effectively unlimited | 250.0 µs | 25 GB/s | — |
| Tape (LTO-9) | 18 TB / cartridge | 60 s | 400 MB/s | $3/TB |
Latency, on a log scale
Drawn linearly, registers and L1 are invisible specks next to HDD and tape. On a log scale, every layer earns its space:
Latency and bandwidth are different axes
Latency is "how long for one access". Bandwidth is "how much data per second". Modern memory subsystems often have high bandwidth and also high latency — DDR5 sustains ~50 GB/s but each random access pays 80 ns. The two interact via Little's Law:
bandwidth = concurrency / latency
To saturate 50 GB/s of DDR5 with 64-byte cache-line accesses:
concurrency = 50 GB/s × 80 ns = 4000 bytes in flight
= 4000 / 64 = 62 outstanding cache-line requests
If the CPU can have ≤16 outstanding loads (typical), it can only saturate
~16 × 64 = 1024 bytes / 80 ns ≈ 12 GB/s — a quarter of peak.This is why memory-level parallelism matters so much. The bandwidth on the spec sheet is achievable only if you can keep enough requests in flight. Out-of-order execution, software prefetching, and SIMD gather all exist partly to keep the memory pipeline full.
Sequential beats random by 50×
Memory hardware loves predictable, contiguous access. Prefetchers detect sequential patterns and pull lines into cache before the CPU asks. Random access gets none of that — every load pays the full DRAM latency. The throughput delta between the two is roughly 50×. Drag the slider:
The memory wall
In 1980 a CPU instruction took ~10 ns and a DRAM access took ~250 ns — a 25× gap. In 2026 an instruction takes ~0.3 ns and a DRAM access takes ~80 ns — a 250× gap. Memory got faster; CPUs got faster much faster. This is the memory wall: the divergence between CPU performance and DRAM latency, growing for forty years. Wulf and McKee named it in 1995; it has been the dominant constraint on system design since.
The two lines diverge from the start and never reconverge. By 2026 the CPU line is roughly 1000× higher than where it started; the DRAM line is roughly 5× higher. The cumulative gap is the reason every performance technique on modern CPUs — caches, OOO, SIMD, prefetching, speculation — is fundamentally about hiding memory latency.
Locality — temporal and spatial
The hierarchy works because real programs have locality. Two kinds:
- Temporal locality. An address accessed once is likely to be accessed again soon. Loop variables, function-local stack, hash-table buckets being probed multiple times. Caches keep recently-used lines around so re-access is fast.
- Spatial locality. Addresses close to a recent access are likely to be accessed soon. Arrays traversed sequentially, struct fields read together. Caches transfer 64-byte lines (instead of single bytes) so neighbouring data comes along for free.
| Pattern | Locality | Notes |
|---|---|---|
| Sequential array scan | Spatial | Adjacent addresses accessed consecutively. Prefetchers turn this into bandwidth-limited. |
| Hot loop variables | Temporal | Same variable touched repeatedly. Stays in L1 for the duration of the loop. |
| Function call return | Temporal | Return address and stack frame fields touched repeatedly during the call. |
| Hash table lookup | Neither | Random buckets, single touch each. Bandwidth- and latency-limited. |
| Linked list traversal | Neither | Each node is at a different address; pointer chasing defeats prefetching. |
| Matrix transpose (column-major) | Spatial | Bad locality on output if rows are stored column-by-column. Tiling helps. |
Code with both kinds of locality runs at L1 speeds. Code with neither — pointer chasing through a graph in a hash table that doesn't fit in L3 — runs at DRAM speeds. The 100× gap is real and shows up in benchmarks.
How the hierarchy makes the average fast
The hierarchy works on averages, and the average is dominated by the hit rate, not
by how fast any single layer is. The standard way to see this is average memory
access time: the time you actually wait, per access, blended across hits and misses.
Each layer adds its own access time plus its miss rate times the cost of going one
level deeper. Two layers make the idea concrete: if L1 answers in 1 ns and DRAM in
80 ns, then a 95% L1 hit rate gives an average of about
1 + 0.05 × 80 ≈ 5 ns. Push the hit rate to 99% and the average drops to
1 + 0.01 × 80 ≈ 1.8 ns. The same DRAM, the same L1, but a four-point
swing in hit rate nearly tripled or halved the effective speed of memory.
That sensitivity is the whole reason cache-aware code matters so much. You are not making memory faster — DRAM is DRAM — you are moving the hit rate, and the average follows it out of all proportion. It also explains why misses are described as tail events that dominate the mean: the rare 80 ns access outweighs dozens of 1 ns hits, so a small fraction of misses can set the wall-clock time. A loop that misses 5% of the time spends most of its memory budget on that 5%. Cutting the miss rate in half is usually a bigger win than any micro-optimisation of the instructions in between.
This is also why bigger working sets degrade so suddenly rather than gradually. As long as the hot data fits in a layer, the hit rate stays high and the average stays low. Grow the working set past the capacity of that layer and the hit rate falls off a cliff, accesses start falling through to the next level down, and the average access time jumps by the full latency ratio between layers. Performance graphs of a workload against its data size show this as a staircase: flat while you fit, a sharp step at each layer boundary, flat again at the new, slower plateau. Knowing roughly where your working set sits relative to L1, L2, L3, and DRAM sizes tells you which step you're standing on and what it would take to climb back up.
The cache line: the real unit of memory
Here is the fact that reorganises how you think about data layout: the CPU never
reads one byte from memory. It reads a whole cache line, almost
always 64 bytes, aligned to a 64-byte boundary. Ask for a single int
and the hardware fetches the 64-byte block it lives in and parks the entire block
in cache. The 63 other bytes came along whether you wanted them or not, and they
cost nothing extra — the latency is paid per line, not per byte. If your next
access falls inside that same line, it's effectively free. If it falls in a
different line, you pay the full miss again.
This is the hardware mechanism underneath spatial locality. The prefetcher and the cache don't reward you for touching nearby bytes out of politeness; they reward you because nearby bytes were already dragged in for free. Walk an array of 64-byte structs front to back and you touch each line exactly once, with every byte in it used — peak efficiency. Walk the same data by jumping to random indices and you pull a fresh 64-byte line for every access while using only a handful of its bytes, throwing away most of the bandwidth you paid for.
Two consequences follow immediately, and both show up in profilers. The first is that data you use together should live together. If a hot loop reads two fields out of a large struct, packing those two fields side by side means one line fetch serves both; scattering them across a 200-byte struct means two line fetches and a lot of wasted transfer. The second is false sharing: if two threads write to two different variables that happen to share one cache line, the line ping-pongs between the cores' caches on every write, and you get a mysterious slowdown that has nothing to do with the logic and everything to do with the 64-byte granularity. The fix is to pad the variables apart so each owns its own line. The full mechanics of how lines are tagged, indexed, and evicted live on the caches page; here the point is just that the line, not the byte, is the currency.
What this does to your data structures
The textbook teaches data structures by their big-O cost in comparisons or pointer follows. The machine charges you in cache misses, and the two bills can disagree by a wide margin. A structure with a worse asymptotic count can win handily if its memory access pattern is friendlier, because each miss it avoids is worth roughly a hundred comparisons it has to do.
The cleanest example is the array versus the linked list. Both can hold a sequence;
both let you iterate it. But an array stores its elements in one contiguous block,
so a forward scan touches consecutive cache lines, the prefetcher sees the pattern
and runs ahead, and every byte fetched is used. A linked list stores each node
wherever the allocator happened to put it, so traversal is pointer chasing: load a
node, read its next pointer, jump to an unrelated address, stall on the
miss, repeat. There is no pattern for the prefetcher to find, so each hop pays close
to the full DRAM latency. On a list that doesn't fit in cache, iteration can run
10–50× slower than the array holding the same data, even though both are O(n). This
is why, in practice, an array list outperforms a linked list for almost every
workload that touches the elements, and why the linked list's textbook advantage —
O(1) insertion in the middle — rarely pays for the traversal tax it imposes.
The same logic reshapes the rest of the toolkit. A B-tree beats a binary search tree on disk and in memory for the same reason: it packs many keys into one node, so one line fetch makes progress on many comparisons instead of one. Open-addressing hash tables (probing within a contiguous array) beat chained hash tables (a linked list per bucket) because the probe sequence stays inside cache lines you've already pulled in. Flat, packed representations beat pointer-rich ones almost everywhere the access pattern is a scan. Where the elements come from also matters: an allocator that hands back contiguous, well-aligned blocks preserves these patterns, which is one reason memory allocation strategy and cache behaviour are tangled together — fragmented allocations scatter your data across lines and quietly erase the locality you were counting on.
The most direct lever is hot/cold splitting. Suppose each record has a few fields read on every pass (an id, a status flag, a key) and many fields read rarely (timestamps, audit data, free-text notes). If you store them together, every hot scan drags the cold fields through cache line by line, wasting most of each fetch. Split the record into a hot part and a cold part, store the hot parts in their own packed array, and a scan now touches only hot bytes — several records per line instead of a fraction of one. This is the core idea behind columnar storage and struct-of-arrays layouts: group by access frequency and access pattern, not by what feels logically tidy. The same restructuring that makes a database analytics engine fast makes a hot loop in application code fast, for exactly the same reason.
When you want to know whether a given loop is limited by this traffic or by the arithmetic on top of it, the roofline model is the tool: it plots arithmetic intensity against achieved performance and tells you at a glance whether you should be cutting memory traffic or cutting instructions. For most data-structure-heavy code the answer is memory, which is why layout work pays off so reliably.
Roofline — when bandwidth bounds you, when compute does
The roofline model (Williams, Waterman, Patterson, 2009) maps any kernel to a 2D plot: arithmetic intensity (FLOPs per byte loaded) on the x-axis, achieved performance (GFLOPs/s) on the y-axis. The "roof" has two slopes: a rising slope from the memory bandwidth ceiling (low intensity = memory-bound), and a flat ceiling at peak compute (high intensity = compute- bound). Where your kernel sits tells you what to optimise.
Examples on a 50 GB/s DDR5 + 1 TFLOP CPU:
Vector add (a + b → c): 1 FLOP / 12 bytes → 4.2 GFLOPs ceiling — bandwidth-bound
Matrix-vector multiply: 2 FLOP / 12 bytes → 8.3 GFLOPs ceiling — bandwidth-bound
Dense matrix multiply: ~50 FLOP / byte → 1000 GFLOPs ceiling — compute-bound
Bandwidth-bound kernels: optimise data layout, fuse loops, prefetch, reduce loads.
Compute-bound kernels: optimise vector instructions, unroll, use FMA.Most "scientific computing" workloads are bandwidth-bound. Most "AI inference" workloads are mixed (matrix multiply is compute-bound on the inner kernel but the overall pipeline is often bandwidth-bound). The roofline plot is the single most useful first-pass analysis for any performance-sensitive workload.
Persistent memory and storage-class memory
Two technologies tried to fill the gap between DRAM and SSD in the 2010s:
- Intel Optane. 3D XPoint media. Latency ~300 ns (4× DRAM, 30× faster than NVMe). Persistent. Sold both as DIMMs (memory-mapped) and as block devices. Discontinued 2022 — couldn't justify the price-per-GB given DRAM and NAND both got cheaper faster.
- NVDIMM-N / NVDIMM-P. DRAM with battery backup; on power loss, contents copy to onboard flash. Used in some database servers (Aerospike, Oracle Exadata) where commit-on-write needed to be cheap. Mostly displaced by software solutions on commodity DRAM + NVMe.
The conclusion in 2026: there's no broadly-used "between DRAM and SSD" tier. The gap is filled by aggressive caching of NVMe in DRAM, by RDMA across clusters (remote DRAM at ~5 µs is faster than local NVMe at ~10 µs), and by smart software (mmap with hot-set pinning, RocksDB block caches, etc.).
Norvig's table, modernized
Peter Norvig's "Latency Numbers Every Programmer Should Know" became canonical in the early 2010s. The format is so useful that updating it for 2026 silicon is its own exercise. The numbers below assume a 3 GHz x86 with DDR5, NVMe Gen5, and a same-region 25 Gbps datacenter fabric.
| Operation | Time | Cycles |
|---|---|---|
| L1 cache reference | 1 ns | 3 |
| Branch mispredict | 5 ns | 15 |
| L2 cache reference | 4 ns | 12 |
| Mutex lock / unlock (uncontended) | 17 ns | 50 |
| L3 cache reference | 14 ns | 42 |
| Main memory reference | 80 ns | 240 |
| Compress 1 KB with zstd | 1 µs | 3,000 |
| Send 1 KB over 25 Gbps network | 0.3 µs | 900 |
| Read 1 MB sequentially from DRAM | 20 µs | 60,000 |
| NVMe Gen5 random read (4 KB) | 10 µs | 30,000 |
| Datacenter network round trip | 250 µs | 750,000 |
| Read 1 MB sequentially from NVMe | 70 µs | 210,000 |
| HDD seek | 5 ms | 15,000,000 |
| Read 1 MB sequentially from HDD | 10 ms | 30,000,000 |
| Send packet from US to EU | 80 ms | 240,000,000 |
| Send packet from US to AU | 180 ms | 540,000,000 |
Common misconceptions
- "DRAM is fast." Compared to a hard drive, sure. Compared to L1, it's 80× slower. Most performance work is about avoiding DRAM accesses, not about whether DRAM is "fast enough".
- "More RAM means faster." Only if your working set was already exceeding it. Once your hot data fits, doubling RAM does almost nothing for performance — you'd need to grow the L3 to see further gains.
- "SSDs replaced DRAM." They're three orders of magnitude apart in latency. SSDs replaced disk for hot persistent data; DRAM remains the only thing fast enough to feed the CPU.
- "NVMe is fast enough that you don't need to cache." Sequential bandwidth is huge — Gen5 hits 14 GB/s. Random latency is still 10 µs (~30,000 cycles). A query that does 1000 random NVMe reads takes 10 ms; the same query against an in-memory cache takes 80 µs.
Numbers worth remembering
| Quantity | Value |
|---|---|
| L1 cache size, mainstream | 32–192 KB / core |
| L1 latency | ~1 ns (3 cycles) |
| DRAM (DDR5) latency | ~80 ns |
| DRAM bandwidth, single channel DDR5 | ~38 GB/s |
| NVMe Gen5 sequential read | ~14 GB/s |
| NVMe Gen5 random read latency | ~10 µs |
| Datacenter RTT, same region | ~250 µs |
| Intercontinental RTT | ~150 ms |
| Memory-wall growth rate | CPU 2× / 18 mo, DRAM ~7%/yr |
| Cost ratio, register : DRAM : SSD : HDD : tape | —:1:0.02:0.003:0.0008 |
Further reading
- Peter Norvig — Latency Numbers Every Programmer Should Know — the original list. Most numbers in the table above descend from this.
- Drepper — What Every Programmer Should Know About Memory — 100-page deep dive on the entire memory system.
- Williams, Waterman, Patterson — Roofline: An Insightful Visual Performance Model — the original 2009 paper.
- Hennessy & Patterson — Computer Architecture: A Quantitative Approach. Chapter 2 covers memory hierarchy in depth, including bandwidth-vs-latency analysis and cache organisation.
- Wikipedia — Memory hierarchy — fast lookup for layer specs.
- Chips and Cheese — measured DRAM latency and bandwidth across recent chips.
- Wikipedia — Locality of reference — the principle that makes the hierarchy work.