04 · 37 steps
Visualize / 04

How memory actually works.

Same array. Same sum. Two access patterns — random and sequential. Same work asymptotically, but watch the cycle counter. Sequential ends up roughly 30× faster, because the cache loads memory in 64-byte lines: one miss pulls in four ints, and the next three accesses hit for free. Random access keeps falling outside whatever line is currently cached. This is why "iterate in order" is the highest-use performance fix in any language.


step 1 / 37 · idle
program sum = 0; for (i in order) sum += arr[i]; array of 16 ints · cache line = 4 ints · 1 miss pulls 4 in
random 0c
sequential 0c
CPUrequests arr[i]idleRUNNING SUM0ACCESSES0 / 16L1 CACHE · 4 lines · 64 B each~3 cycles on hit · empty starts black line 0 · idx 0–3○ empty arr[0]7 arr[1]14 arr[2]21 arr[3]28 line 1 · idx 4–7○ empty arr[4]35 arr[5]42 arr[6]49 arr[7]56 line 2 · idx 8–11○ empty arr[8]63 arr[9]70 arr[10]77 arr[11]84 line 3 · idx 12–15○ empty arr[12]91 arr[13]98 arr[14]105 arr[15]112RAM · 16 ints~200 cycles per cache lineline 0 arr[0]7 arr[1]14 arr[2]21 arr[3]28line 1 arr[4]35 arr[5]42 arr[6]49 arr[7]56line 2 arr[8]63 arr[9]70 arr[10]77 arr[11]84line 3 arr[12]91 arr[13]98 arr[14]105 arr[15]112CPUL1 CACHERAM

A 16-element int array. We'll sum every element — same work, twice — first in random order, then sequential. Watch what the cache does differently.

Cache line
A block of memory the cache loads as a unit. Always 64 bytes on x86 — that's 16 ints, or 8 doubles, or 4 of our 16-byte cells.
L1 cache
The CPU's innermost data cache. ~64 KB. ~3 cycles to hit.
Cache miss
The data you asked for isn't in this cache. Pay the latency to fetch it from the next level down — here, RAM at ~200 cycles.

The math, on a napkin

Random pass: every access falls in whichever cache line happens to be there. Early accesses miss because nothing's cached; later accesses hit because by then all four lines are loaded. With this particular order, the cost is ~830 cycles for 16 accesses — average ~52 per access.

Sequential pass: the first access to each line misses; the next three hit (because the whole line came in together). That's 4 misses + 12 hits = 4×200 + 12×3 = 836 cycles? No — wait, sequential is faster than that. Read the running counter again: ~109 cycles. The model in the visualization assumes a perfect 200c on miss; in real hardware, the prefetcher recognises the sequential pattern and starts loading line N+1 while you're still reading line N, so most misses cost ~5 cycles instead of 200. That's the second mechanism — hardware prefetching — and it's what makes sequential access cheap in practice.

Why this matters

This is the single most common reason "the same algorithm runs 10× faster in C than in Python" or "rewriting the loop made it 30× faster." Languages and libraries that lay data out contiguously and iterate in order get the cache for free. Object graphs in Java or Python that scatter related fields across the heap pay the random-access cost on every traversal.

Concrete examples: NumPy arrays vs Python lists, struct-of-arrays vs array-of-structs, row-major vs column-major matrix iteration, linked lists vs vectors. The data layout — not the algorithm — decides whether the cache helps.

What this visualization simplifies

  • One cache level. Real CPUs have L1, L2, L3. We collapse all "not in cache" to RAM at 200 cycles.
  • No prefetching. The hardware prefetcher recognises sequential patterns and pre-loads. We don't model it; the real sequential cost would be closer to L1 latency on most misses.
  • Direct-mapped, capacity 4 lines. Real L1 is set-associative with hundreds of lines.
  • One thread. Multi-core caches have a coherence protocol (MESI) that adds wrinkles to writes.
  • 4 ints per line. Real cache lines hold 16 ints (64 B / 4 B = 16). The 4-per-line scaling makes the visualization readable.
Go deeper

Caches — the full chapter →

Associativity, write policies, MESI coherence, prefetching, the cache-friendly data layout patterns that buy you 10×.

Open the Codex
Found this useful?