Computer architecture
Modern CPUs are time-machines. They fetch dozens of instructions ahead, execute them out of order, and quietly speculate down both sides of every branch. The memory hierarchy that feeds them spans eight orders of magnitude of latency. Understanding this layer is the difference between code that runs at 3 GHz and code that effectively runs at 30 MHz. This study path is the bare-metal foundation underneath everything else on the site — operating systems, languages, databases, networking — all of them are abstractions over the silicon described here.
Twelve mental models
The set of intuitions that, once internalised, make the rest of computer architecture fall into place. Tagged Day-zero (start here), Practitioner (you write code that lives or dies on this), or Operator (you debug production performance regressions with this).
- 01 · Day-zero
The four parts of a computer
CPU, memory, storage, I/O. The split that has held for fifty years and the three places where it leaks: DMA, MMIO, persistent memory.
- 02 · Day-zero
Transistors → gates → ALU
The abstraction stack. NMOS to NAND to flip-flop to adder to ALU. Apple M3 Max ships ~92 billion transistors.
- 03 · Day-zero
The clock as the heartbeat
Synchronous logic. Why frequency stalled near 5 GHz (power × frequency³) and what we did instead — cores, vectors, accelerators.
- 04 · Practitioner
ISA versus microarchitecture
Same x86 ISA, vastly different chips. Why ARM-on-laptop in 2026 changed nothing for most developers and everything for some.
- 05 · Day-zero
The instruction cycle
Fetch, decode, execute, write-back. The five-stage textbook pipeline and why your laptop's pipeline is closer to 14–20 stages.
- 06 · Practitioner
Pipelining is throughput, not latency
Each instruction takes longer in a pipelined CPU. A billion of them are much faster.
- 07 · Practitioner
Speculation and branch prediction
A 95%-accurate predictor on a 20-stage pipeline still gives you a usable machine. Spectre is what happens when speculation crosses security boundaries.
- 08 · Practitioner
Out-of-order execution
The reorder buffer is a tiny database transaction inside your CPU. ~512 entries on Apple M4, ~448 on Intel Raptor Lake.
- 09 · Practitioner
SIMD and vector throughput
One AVX-512 instruction can do what 8–64 scalar instructions used to. The hidden cost on older Skylake-X: license-down clocks the package.
- 10 · Operator
The memory hierarchy is eight orders of magnitude
Register ~0.3 ns to HDD seek ~5 ms. Norvig's table updated for 2026.
- 11 · Operator
Cache coherence is per-line, not per-variable
Two threads writing different fields of the same 64-byte line can be 100× slower. False sharing.
- 12 · Operator
Every load is two loads
The TLB is the most precious silicon you've never heard of. Huge pages take the second load away.
Featured starting points
Six sub-pages where most readers start. Each one anchored to a real microarchitecture (Apple M4, Intel Raptor Lake, AMD Zen 5, RISC-V SiFive U74) with real cycle counts and named systems.
CPU caches — L1/L2/L3, MESI, false sharing
The most-clicked deep dive on the path. Cache lines, associativity, replacement, the eight orders of magnitude between register and HDD seek.
Out-of-order execution & the reorder buffer
How a 3 GHz core executes ~3 instructions per cycle without you having to ask. Apple M4 has ~512 ROB entries; Raptor Lake ~448.
Branch prediction & speculative execution
A 95%-accurate predictor on a 20-stage pipeline. Spectre is what happens when speculation crosses security boundaries.
Virtual memory & the TLB
Every load is two loads. Page tables, the TLB, huge pages, and why mmap is faster than read until it isn't.
SIMD — AVX, NEON, SVE
One vector instruction can do what 8–64 scalar instructions used to. The hidden cost on older Skylake-X: license-down clocks.
PCIe, DMA & MSI-X
How packets actually move from the wire to memory. Lanes, root complex, IOMMU, RSS interrupt steering.
Open the full deep-dive directory
Every chapter in order — transistors, ALU, clocks, pipelines, ISAs, caches, virtual memory, NUMA, PCIe, SSDs, GPUs, boot.
ContinueBooks, courses, papers, talks
The references this study path leans on. The order matters: Patterson & Hennessy as the textbook spine, Bryant & O'Hallaron for the engineer-facing introduction, Agner Fog for the practitioner-grade specifics.
- Patterson & Hennessy — Computer Organization and Design (RISC-V Edition). The undergraduate textbook. Concrete ISA, end-to-end coverage from gates to caches.
- Hennessy & Patterson — Computer Architecture: A Quantitative Approach. The graduate sequel. Heavier on cache coherence, NUMA, multicore, and the post-Moore turn.
- Bryant & O'Hallaron — Computer Systems: A Programmer's Perspective. The most engineer-facing of the textbooks. Used in CMU 15-213. If you read one book, read this.
- Sorin, Hill & Wood — A Primer on Memory Consistency and Cache Coherence. Free PDF from Synthesis Lectures. The textbook on coherence protocols.
- Drepper — What Every Programmer Should Know About Memory (2007). 100-page paper, dated on numbers but timeless on structure.
- Shen & Lipasti — Modern Processor Design. The reference for out-of-order execution and modern microarchitecture.
- Courses: MIT 6.004 / 6.191; CMU 15-213/15-513; Berkeley CS61C (RISC-V); nand2tetris (free, builds a working computer from NAND gates up).
- Manuals: Agner Fog's microarchitecture/optimization/instruction tables (free); Intel Optimization Reference Manual; AMD Software Optimization Guide; Arm Architecture Reference Manual.
- Talks: Andrei Alexandrescu — "Speed Is Found in the Minds of People" (CppCon); Mike Acton — "Data-Oriented Design" (CppCon 2014); Dick Sites — "Datacenter Computers" (USENIX).
Hands-on tools
What to install and what to point it at. perf on Linux, Instruments on
macOS, VTune on Intel, uProf on AMD. godbolt.org (Compiler Explorer) for
seeing what the compiler emitted. llvm-mca for static cycle prediction.
perf c2c for finding cache-line contention in production binaries.
numastat and numactl for NUMA. Apple's CPU Counters template
in Instruments for Apple silicon. pmu-tools toplev.py for the top-down
methodology that attributes stalls to the right level of the hierarchy.
Latency numbers
Norvig's "latency numbers every programmer should know", updated for 2026 silicon. A 3 GHz core executes ~3 instructions per cycle, so the cycle column doubles as an "instruction-slots lost" estimate.
| Operation | Time | Cycles @ 3 GHz |
|---|---|---|
| Register read | ~0.3 ns | 1 |
| L1 cache hit | ~1 ns | 3–5 |
| L2 cache hit | ~3–5 ns | 10–16 |
| L3 cache hit | ~12–15 ns | 40–50 |
| DRAM (same socket) | ~80 ns | ~250 |
| DRAM (remote NUMA socket) | ~140 ns | ~420 |
| NVMe Gen5 random read 4 KB | ~10 µs | ~30,000 |
| Datacenter network RTT | ~250 µs | ~750,000 |
| HDD seek | ~5 ms | ~15,000,000 |
| Intercontinental RTT | ~150 ms | ~450,000,000 |
2026 microarchitectures, side by side
Same year, three very different design points. Apple silicon optimises for perf-per-watt and ships a wider front-end; Intel and AMD push frequency and out-of-order depth. RISC-V is the open ISA used in research and embedded, catching up year over year.
| Spec | Apple M4 P-core | Intel Raptor Lake P-core | AMD Zen 5 | SiFive U74 |
|---|---|---|---|---|
| ISA | ARMv9 | x86-64 | x86-64 | RISC-V (RV64GC) |
| Decode width | 10-wide | 6-wide | 4+4 wide | 2-wide in-order |
| Reorder buffer | ~512 entries | ~448 entries | ~448 entries | n/a (in-order) |
| L1d / L1i | 192 KB / 192 KB | 48 KB / 32 KB | 48 KB / 32 KB | 32 KB / 32 KB |
| L2 | 16 MB shared | 2 MB per core | 1 MB per core | 2 MB shared |
| L3 | n/a (SLC) | up to 36 MB | up to 32 MB | n/a |
| Cache line | 128 B | 64 B | 64 B | 64 B |
| Vector width | NEON 128 b | AVX-512 disabled | AVX-512 (full) | RVV (varies) |
| Peak clock | 4.4 GHz | 5.7 GHz | 5.7 GHz | 1.5 GHz |
Decode width is the single best proxy for "how out-of-order is this core". Apple goes wider; Intel and AMD go faster; SiFive prioritises area and energy.
Six common mistakes
- Reasoning about CPU performance from instruction count rather than cache-miss count. A "fast" instruction sequence that misses to DRAM ten times is slower than a "slow" sequence that stays in L1.
- Assuming hyperthreading doubles throughput. It adds ~15–30% on integer workloads, can hurt FP-heavy workloads where the sibling thread starves the vector units.
- Treating
volatileas a synchronisation primitive. It isn't. It defeats compiler caching of values into registers, but says nothing about CPU-level reordering or coherence. - Forgetting
mmap'd files share the same hardware as DRAM. Page-cache eviction pressure from a large file scan can trash the working set of every other process on the box. - Designing for one socket and scaling to four without testing. NUMA effects are not subtle; remote-socket DRAM is 2× the latency. A profile on a single socket tells you nothing about what happens at four.
- Padding to 64 bytes on Apple silicon. Apple's cache line is 128. Half your false-sharing fix isn't a fix.
Adjacent paths
- Operating systems. What the kernel does on top of the silicon. Threads, scheduling, virtual memory from the OS side.
- Databases. The most cache-and-storage-aware software written. B-trees, LSM trees, page caches.
- Computer networking. The wire side of the four-parts split, plus where DMA and the NIC offload meet.
- Go internals. A concrete language runtime sitting on top of all of this — scheduler, garbage collector, channels.