How memory actually works.

Same array. Same sum. Two access patterns — random and sequential. Same work asymptotically, but watch the cycle counter. Sequential ends up roughly 30× faster, because the cache loads memory in 64-byte lines: one miss pulls in four ints, and the next three accesses hit for free. Random access keeps falling outside whatever line is currently cached. This is why "iterate in order" is the highest-use performance fix in any language.

speed 0.70s

step 1 / 37 · idle

program sum = 0; for (i in order) sum += arr[i]; array of 16 ints · cache line = 4 ints · 1 miss pulls 4 in

random 0c

sequential 0c

A 16-element int array. We'll sum every element — same work, twice — first in random order, then sequential. Watch what the cache does differently.

Cache line: A block of memory the cache loads as a unit. Always 64 bytes on x86 — that's 16 ints, or 8 doubles, or 4 of our 16-byte cells.
L1 cache: The CPU's innermost data cache. ~64 KB. ~3 cycles to hit.
Cache miss: The data you asked for isn't in this cache. Pay the latency to fetch it from the next level down — here, RAM at ~200 cycles.

The math, on a napkin

Random pass: every access falls in whichever cache line happens to be there. Early accesses miss because nothing's cached; later accesses hit because by then all four lines are loaded. With this particular order, the cost is ~830 cycles for 16 accesses — average ~52 per access.

Sequential pass: the first access to each line misses; the next three hit (because the whole line came in together). That's 4 misses + 12 hits = 4×200 + 12×3 = 836 cycles? No — wait, sequential is faster than that. Read the running counter again: ~109 cycles. The model in the visualization assumes a perfect 200c on miss; in real hardware, the prefetcher recognises the sequential pattern and starts loading line N+1 while you're still reading line N, so most misses cost ~5 cycles instead of 200. That's the second mechanism — hardware prefetching — and it's what makes sequential access cheap in practice.

Why this matters

This is the single most common reason "the same algorithm runs 10× faster in C than in Python" or "rewriting the loop made it 30× faster." Languages and libraries that lay data out contiguously and iterate in order get the cache for free. Object graphs in Java or Python that scatter related fields across the heap pay the random-access cost on every traversal.

Concrete examples: NumPy arrays vs Python lists, struct-of-arrays vs array-of-structs, row-major vs column-major matrix iteration, linked lists vs vectors. The data layout — not the algorithm — decides whether the cache helps.

What this visualization simplifies

One cache level. Real CPUs have L1, L2, L3. We collapse all "not in cache" to RAM at 200 cycles.
No prefetching. The hardware prefetcher recognises sequential patterns and pre-loads. We don't model it; the real sequential cost would be closer to L1 latency on most misses.
Direct-mapped, capacity 4 lines. Real L1 is set-associative with hundreds of lines.
One thread. Multi-core caches have a coherence protocol (MESI) that adds wrinkles to writes.
4 ints per line. Real cache lines hold 16 ints (64 B / 4 B = 16). The 4-per-line scaling makes the visualization readable.

Go deeper

Caches — the full chapter →

Associativity, write policies, MESI coherence, prefetching, the cache-friendly data layout patterns that buy you 10×.

Open the Codex

Found this useful?