08 / 15
Internals / 08

The memory hierarchy

A register read takes 0.3 nanoseconds. An HDD seek takes 5 milliseconds. The same basic operation — "fetch a byte" — spans seven orders of magnitude of latency depending on where the byte lives. The memory hierarchy is the layered stack of storage that bridges this gap. Each layer is faster, smaller, and more expensive than the one below it. Almost every performance question on modern hardware is really a question about this hierarchy.


Why a hierarchy

You can build memory that's fast, you can build memory that's large, and you can build memory that's cheap. You can't build one that's all three. SRAM (the technology used for caches) is fast — under a nanosecond — but each cell uses six transistors, which is why a 36 MB L3 takes ~3 billion transistors. DRAM (used for main memory) packs ~10× more capacity per transistor but takes 80 ns to access. NAND flash (SSDs) packs ~1000× more capacity but takes microseconds. The hierarchy stitches all of these together so the CPU gets fast access most of the time and large capacity when needed.

The physics is unforgiving in a specific way. The speed of a memory cell is set by how few transistors it uses, how close it sits to the core, and how much power you're willing to spend keeping it ready. SRAM wins on speed because its six transistors hold a bit actively, with no refresh and no sensing of a tiny charge, but six transistors per bit is expensive in die area, and die area is the most precious resource on a chip. DRAM trades that away: one transistor and one capacitor per bit, packed densely, but the capacitor leaks and has to be refreshed thousands of times a second, and reading it means amplifying a charge so small it takes tens of nanoseconds to settle. NAND flash goes further still, storing multiple bits per cell in a 3D stack, which is how you get terabytes on a thumb-sized board, but writing means pushing electrons through an insulator, which is slow and wears the cell out. Each step down the stack buys density and cost at the direct expense of latency. No clever engineering escapes the trade; you can only pick where on the curve a given layer sits.

So instead of picking one point, the machine uses all of them at once and lets the access pattern decide which layer answers. The bet the whole design rests on is simple: most accesses hit something you touched recently, or something next to it. When that bet pays off, and on real code it pays off the large majority of the time, the average access lands near the top of the pyramid even though the data physically lives near the bottom. The hierarchy doesn't make any single layer faster. It makes the average fast, which is the only thing that shows up in a wall-clock measurement.

REGISTERSL1 / L2 / L3 CACHEDRAMSSD (NVMe)HDD · NETWORK · TAPE0.3 ns · ~256 B1–14 ns · KB–MB80 ns · GB10 µs · TB5 ms+ · vastfaster · smallercheaper · larger
Up the pyramid: faster, smaller, costlier per byte. Down it: slower, larger, cheaper. The CPU asks the top first and falls through only on a miss.

Ten layers, eight orders of magnitude

LayerTypical sizeLatencyBandwidthCost / GB
Register ~256 B / core 0.3 ns 1000 GB/s
L1 cache 32–192 KB / core 1.0 ns 1000 GB/s
L2 cache 1–16 MB / core 4.0 ns 500 GB/s
L3 cache 16–256 MB 14.0 ns 200 GB/s
DRAM (DDR5) 8–512 GB 80.0 ns 50 GB/s $4/GB
NVMe Gen5 500 GB – 30 TB 10.0 µs 14 GB/s $0.07/GB
SATA SSD 256 GB – 8 TB 100.0 µs 550 MB/s $0.05/GB
HDD 1–24 TB 5.0 ms 200 MB/s $0.01/GB
Datacenter network effectively unlimited 250.0 µs 25 GB/s
Tape (LTO-9) 18 TB / cartridge 60 s 400 MB/s $3/TB
The number that matters: latency from register to HDD spans 1.7 × 10⁷ — seven orders of magnitude. Bandwidth spans 5 × 10³. Cost per GB spans 3 × 10⁵. There is no single technology that wins on all three; the hierarchy makes you not have to pick.

Latency, on a log scale

Drawn linearly, registers and L1 are invisible specks next to HDD and tape. On a log scale, every layer earns its space:

Register0.3 nsL1 cache1.0 nsL2 cache4.0 nsL3 cache14.0 nsDRAM (DDR5)80.0 nsNVMe Gen510.0 µsSATA SSD100.0 µsHDD5.0 msDatacenter network250.0 µsTape (LTO-9)60 s

Latency and bandwidth are different axes

Latency is "how long for one access". Bandwidth is "how much data per second". Modern memory subsystems often have high bandwidth and also high latency — DDR5 sustains ~50 GB/s but each random access pays 80 ns. The two interact via Little's Law:

bandwidth = concurrency / latency

To saturate 50 GB/s of DDR5 with 64-byte cache-line accesses:
  concurrency = 50 GB/s × 80 ns = 4000 bytes in flight
              = 4000 / 64 = 62 outstanding cache-line requests

If the CPU can have ≤16 outstanding loads (typical), it can only saturate
~16 × 64 = 1024 bytes / 80 ns ≈ 12 GB/s — a quarter of peak.

This is why memory-level parallelism matters so much. The bandwidth on the spec sheet is achievable only if you can keep enough requests in flight. Out-of-order execution, software prefetching, and SIMD gather all exist partly to keep the memory pipeline full.

Sequential beats random by 50×

Memory hardware loves predictable, contiguous access. Prefetchers detect sequential patterns and pull lines into cache before the CPU asks. Random access gets none of that — every load pays the full DRAM latency. The throughput delta between the two is roughly 50×. Drag the slider:

64 MB
throughput
50 GB/s
scan time
1.3 ms
vs sequential
A 64 MB sequential scan takes about 1 ms on DDR5. The same 64 MB read at random addresses takes ~80 ms. The ratio is the prefetcher doing its job. Algorithms that preserve sequential access patterns (radix sort, columnar scans, packed arrays) win this gap; algorithms that don't (hash probes, linked-list traversal, B-tree lookups on cold caches) pay for it.

The memory wall

In 1980 a CPU instruction took ~10 ns and a DRAM access took ~250 ns — a 25× gap. In 2026 an instruction takes ~0.3 ns and a DRAM access takes ~80 ns — a 250× gap. Memory got faster; CPUs got faster much faster. This is the memory wall: the divergence between CPU performance and DRAM latency, growing for forty years. Wulf and McKee named it in 1995; it has been the dominant constraint on system design since.

10⁵10⁴10³10² 1980 1985 1990 1995 2000 2005 2010 2015 2020 2024CPU performanceDRAM access rate

The two lines diverge from the start and never reconverge. By 2026 the CPU line is roughly 1000× higher than where it started; the DRAM line is roughly 5× higher. The cumulative gap is the reason every performance technique on modern CPUs — caches, OOO, SIMD, prefetching, speculation — is fundamentally about hiding memory latency.

Locality — temporal and spatial

The hierarchy works because real programs have locality. Two kinds:

  • Temporal locality. An address accessed once is likely to be accessed again soon. Loop variables, function-local stack, hash-table buckets being probed multiple times. Caches keep recently-used lines around so re-access is fast.
  • Spatial locality. Addresses close to a recent access are likely to be accessed soon. Arrays traversed sequentially, struct fields read together. Caches transfer 64-byte lines (instead of single bytes) so neighbouring data comes along for free.
PatternLocalityNotes
Sequential array scanSpatialAdjacent addresses accessed consecutively. Prefetchers turn this into bandwidth-limited.
Hot loop variablesTemporalSame variable touched repeatedly. Stays in L1 for the duration of the loop.
Function call returnTemporalReturn address and stack frame fields touched repeatedly during the call.
Hash table lookupNeitherRandom buckets, single touch each. Bandwidth- and latency-limited.
Linked list traversalNeitherEach node is at a different address; pointer chasing defeats prefetching.
Matrix transpose (column-major)SpatialBad locality on output if rows are stored column-by-column. Tiling helps.

Code with both kinds of locality runs at L1 speeds. Code with neither — pointer chasing through a graph in a hash table that doesn't fit in L3 — runs at DRAM speeds. The 100× gap is real and shows up in benchmarks.

How the hierarchy makes the average fast

The hierarchy works on averages, and the average is dominated by the hit rate, not by how fast any single layer is. The standard way to see this is average memory access time: the time you actually wait, per access, blended across hits and misses. Each layer adds its own access time plus its miss rate times the cost of going one level deeper. Two layers make the idea concrete: if L1 answers in 1 ns and DRAM in 80 ns, then a 95% L1 hit rate gives an average of about 1 + 0.05 × 80 ≈ 5 ns. Push the hit rate to 99% and the average drops to 1 + 0.01 × 80 ≈ 1.8 ns. The same DRAM, the same L1, but a four-point swing in hit rate nearly tripled or halved the effective speed of memory.

That sensitivity is the whole reason cache-aware code matters so much. You are not making memory faster — DRAM is DRAM — you are moving the hit rate, and the average follows it out of all proportion. It also explains why misses are described as tail events that dominate the mean: the rare 80 ns access outweighs dozens of 1 ns hits, so a small fraction of misses can set the wall-clock time. A loop that misses 5% of the time spends most of its memory budget on that 5%. Cutting the miss rate in half is usually a bigger win than any micro-optimisation of the instructions in between.

This is also why bigger working sets degrade so suddenly rather than gradually. As long as the hot data fits in a layer, the hit rate stays high and the average stays low. Grow the working set past the capacity of that layer and the hit rate falls off a cliff, accesses start falling through to the next level down, and the average access time jumps by the full latency ratio between layers. Performance graphs of a workload against its data size show this as a staircase: flat while you fit, a sharp step at each layer boundary, flat again at the new, slower plateau. Knowing roughly where your working set sits relative to L1, L2, L3, and DRAM sizes tells you which step you're standing on and what it would take to climb back up.

The cache line: the real unit of memory

Here is the fact that reorganises how you think about data layout: the CPU never reads one byte from memory. It reads a whole cache line, almost always 64 bytes, aligned to a 64-byte boundary. Ask for a single int and the hardware fetches the 64-byte block it lives in and parks the entire block in cache. The 63 other bytes came along whether you wanted them or not, and they cost nothing extra — the latency is paid per line, not per byte. If your next access falls inside that same line, it's effectively free. If it falls in a different line, you pay the full miss again.

This is the hardware mechanism underneath spatial locality. The prefetcher and the cache don't reward you for touching nearby bytes out of politeness; they reward you because nearby bytes were already dragged in for free. Walk an array of 64-byte structs front to back and you touch each line exactly once, with every byte in it used — peak efficiency. Walk the same data by jumping to random indices and you pull a fresh 64-byte line for every access while using only a handful of its bytes, throwing away most of the bandwidth you paid for.

CPU: load 1 byte at address 0x40bthe byte you asked forhardware fetches the whole 64-byte line containing itDRAM → cache: one 64-byte line, addresses 0x40–0x7F0x40 +4 +8 +12 +16 +20 +24 +28 +32 +36 +40 +44 +48 +52 +56 +60one DRAM round trip (~80 ns) brings in 64 bytes; the next 63 accesses inside this line are nearly free
A one-byte load pulls a full 64-byte line into cache. Spatial locality is just reuse of bytes you already paid to fetch.

Two consequences follow immediately, and both show up in profilers. The first is that data you use together should live together. If a hot loop reads two fields out of a large struct, packing those two fields side by side means one line fetch serves both; scattering them across a 200-byte struct means two line fetches and a lot of wasted transfer. The second is false sharing: if two threads write to two different variables that happen to share one cache line, the line ping-pongs between the cores' caches on every write, and you get a mysterious slowdown that has nothing to do with the logic and everything to do with the 64-byte granularity. The fix is to pad the variables apart so each owns its own line. The full mechanics of how lines are tagged, indexed, and evicted live on the caches page; here the point is just that the line, not the byte, is the currency.

What this does to your data structures

The textbook teaches data structures by their big-O cost in comparisons or pointer follows. The machine charges you in cache misses, and the two bills can disagree by a wide margin. A structure with a worse asymptotic count can win handily if its memory access pattern is friendlier, because each miss it avoids is worth roughly a hundred comparisons it has to do.

The cleanest example is the array versus the linked list. Both can hold a sequence; both let you iterate it. But an array stores its elements in one contiguous block, so a forward scan touches consecutive cache lines, the prefetcher sees the pattern and runs ahead, and every byte fetched is used. A linked list stores each node wherever the allocator happened to put it, so traversal is pointer chasing: load a node, read its next pointer, jump to an unrelated address, stall on the miss, repeat. There is no pattern for the prefetcher to find, so each hop pays close to the full DRAM latency. On a list that doesn't fit in cache, iteration can run 10–50× slower than the array holding the same data, even though both are O(n). This is why, in practice, an array list outperforms a linked list for almost every workload that touches the elements, and why the linked list's textbook advantage — O(1) insertion in the middle — rarely pays for the traversal tax it imposes.

The same logic reshapes the rest of the toolkit. A B-tree beats a binary search tree on disk and in memory for the same reason: it packs many keys into one node, so one line fetch makes progress on many comparisons instead of one. Open-addressing hash tables (probing within a contiguous array) beat chained hash tables (a linked list per bucket) because the probe sequence stays inside cache lines you've already pulled in. Flat, packed representations beat pointer-rich ones almost everywhere the access pattern is a scan. Where the elements come from also matters: an allocator that hands back contiguous, well-aligned blocks preserves these patterns, which is one reason memory allocation strategy and cache behaviour are tangled together — fragmented allocations scatter your data across lines and quietly erase the locality you were counting on.

The most direct lever is hot/cold splitting. Suppose each record has a few fields read on every pass (an id, a status flag, a key) and many fields read rarely (timestamps, audit data, free-text notes). If you store them together, every hot scan drags the cold fields through cache line by line, wasting most of each fetch. Split the record into a hot part and a cold part, store the hot parts in their own packed array, and a scan now touches only hot bytes — several records per line instead of a fraction of one. This is the core idea behind columnar storage and struct-of-arrays layouts: group by access frequency and access pattern, not by what feels logically tidy. The same restructuring that makes a database analytics engine fast makes a hot loop in application code fast, for exactly the same reason.

When you want to know whether a given loop is limited by this traffic or by the arithmetic on top of it, the roofline model is the tool: it plots arithmetic intensity against achieved performance and tells you at a glance whether you should be cutting memory traffic or cutting instructions. For most data-structure-heavy code the answer is memory, which is why layout work pays off so reliably.

Roofline — when bandwidth bounds you, when compute does

The roofline model (Williams, Waterman, Patterson, 2009) maps any kernel to a 2D plot: arithmetic intensity (FLOPs per byte loaded) on the x-axis, achieved performance (GFLOPs/s) on the y-axis. The "roof" has two slopes: a rising slope from the memory bandwidth ceiling (low intensity = memory-bound), and a flat ceiling at peak compute (high intensity = compute- bound). Where your kernel sits tells you what to optimise.

Examples on a 50 GB/s DDR5 + 1 TFLOP CPU:

Vector add (a + b → c):    1 FLOP / 12 bytes  →  4.2 GFLOPs ceiling — bandwidth-bound
Matrix-vector multiply:    2 FLOP / 12 bytes  →  8.3 GFLOPs ceiling — bandwidth-bound
Dense matrix multiply:    ~50 FLOP / byte    → 1000 GFLOPs ceiling — compute-bound

Bandwidth-bound kernels: optimise data layout, fuse loops, prefetch, reduce loads.
Compute-bound kernels: optimise vector instructions, unroll, use FMA.

Most "scientific computing" workloads are bandwidth-bound. Most "AI inference" workloads are mixed (matrix multiply is compute-bound on the inner kernel but the overall pipeline is often bandwidth-bound). The roofline plot is the single most useful first-pass analysis for any performance-sensitive workload.

Persistent memory and storage-class memory

Two technologies tried to fill the gap between DRAM and SSD in the 2010s:

  • Intel Optane. 3D XPoint media. Latency ~300 ns (4× DRAM, 30× faster than NVMe). Persistent. Sold both as DIMMs (memory-mapped) and as block devices. Discontinued 2022 — couldn't justify the price-per-GB given DRAM and NAND both got cheaper faster.
  • NVDIMM-N / NVDIMM-P. DRAM with battery backup; on power loss, contents copy to onboard flash. Used in some database servers (Aerospike, Oracle Exadata) where commit-on-write needed to be cheap. Mostly displaced by software solutions on commodity DRAM + NVMe.

The conclusion in 2026: there's no broadly-used "between DRAM and SSD" tier. The gap is filled by aggressive caching of NVMe in DRAM, by RDMA across clusters (remote DRAM at ~5 µs is faster than local NVMe at ~10 µs), and by smart software (mmap with hot-set pinning, RocksDB block caches, etc.).

Norvig's table, modernized

Peter Norvig's "Latency Numbers Every Programmer Should Know" became canonical in the early 2010s. The format is so useful that updating it for 2026 silicon is its own exercise. The numbers below assume a 3 GHz x86 with DDR5, NVMe Gen5, and a same-region 25 Gbps datacenter fabric.

OperationTimeCycles
L1 cache reference1 ns3
Branch mispredict5 ns15
L2 cache reference4 ns12
Mutex lock / unlock (uncontended)17 ns50
L3 cache reference14 ns42
Main memory reference80 ns240
Compress 1 KB with zstd1 µs3,000
Send 1 KB over 25 Gbps network0.3 µs900
Read 1 MB sequentially from DRAM20 µs60,000
NVMe Gen5 random read (4 KB)10 µs30,000
Datacenter network round trip250 µs750,000
Read 1 MB sequentially from NVMe70 µs210,000
HDD seek5 ms15,000,000
Read 1 MB sequentially from HDD10 ms30,000,000
Send packet from US to EU80 ms240,000,000
Send packet from US to AU180 ms540,000,000

Common misconceptions

  • "DRAM is fast." Compared to a hard drive, sure. Compared to L1, it's 80× slower. Most performance work is about avoiding DRAM accesses, not about whether DRAM is "fast enough".
  • "More RAM means faster." Only if your working set was already exceeding it. Once your hot data fits, doubling RAM does almost nothing for performance — you'd need to grow the L3 to see further gains.
  • "SSDs replaced DRAM." They're three orders of magnitude apart in latency. SSDs replaced disk for hot persistent data; DRAM remains the only thing fast enough to feed the CPU.
  • "NVMe is fast enough that you don't need to cache." Sequential bandwidth is huge — Gen5 hits 14 GB/s. Random latency is still 10 µs (~30,000 cycles). A query that does 1000 random NVMe reads takes 10 ms; the same query against an in-memory cache takes 80 µs.

Numbers worth remembering

QuantityValue
L1 cache size, mainstream32–192 KB / core
L1 latency~1 ns (3 cycles)
DRAM (DDR5) latency~80 ns
DRAM bandwidth, single channel DDR5~38 GB/s
NVMe Gen5 sequential read~14 GB/s
NVMe Gen5 random read latency~10 µs
Datacenter RTT, same region~250 µs
Intercontinental RTT~150 ms
Memory-wall growth rateCPU 2× / 18 mo, DRAM ~7%/yr
Cost ratio, register : DRAM : SSD : HDD : tape—:1:0.02:0.003:0.0008

Further reading

Found this useful?