Day-0 → Month-N

Study path / 07

Computer architecture

Modern CPUs are time-machines. They fetch dozens of instructions ahead, execute them out of order, and quietly speculate down both sides of every branch. The memory hierarchy that feeds them spans eight orders of magnitude of latency. Understanding this layer is the difference between code that runs at 3 GHz and code that effectively runs at 30 MHz. This study path is the bare-metal foundation underneath everything else on the site — operating systems, languages, databases, networking — all of them are abstractions over the silicon described here.

Deep dives

Inside the silicon

Fifteen sub-pages — transistors, gates, the ALU, the clock, pipelining, branch prediction, out-of-order execution, SIMD, caches, virtual memory, NUMA, PCIe, SSDs, GPUs, and boot. Each chapter is anchored to a real microarchitecture (Apple M4, Intel Raptor Lake, AMD Zen 5, RISC-V SiFive U74) with the cycle counts, pipeline depths, and operational details that almost never make it into textbooks.

Twelve mental models

The set of intuitions that, once internalised, make the rest of computer architecture fall into place. Tagged Day-zero (start here), Practitioner (you write code that lives or dies on this), or Operator (you debug production performance regressions with this).

01 · Day-zero
The four parts of a computer
CPU, memory, storage, I/O. The split that has held for fifty years and the three places where it leaks: DMA, MMIO, persistent memory.
02 · Day-zero
Transistors → gates → ALU
The abstraction stack. NMOS to NAND to flip-flop to adder to ALU. Apple M3 Max ships ~92 billion transistors.
03 · Day-zero
The clock as the heartbeat
Synchronous logic. Why frequency stalled near 5 GHz (power × frequency³) and what we did instead — cores, vectors, accelerators.
04 · Practitioner
ISA versus microarchitecture
Same x86 ISA, vastly different chips. Why ARM-on-laptop in 2026 changed nothing for most developers and everything for some.
05 · Day-zero
The instruction cycle
Fetch, decode, execute, write-back. The five-stage textbook pipeline and why your laptop's pipeline is closer to 14–20 stages.
06 · Practitioner
Pipelining is throughput, not latency
Each instruction takes longer in a pipelined CPU. A billion of them are much faster.
07 · Practitioner
Speculation and branch prediction
A 95%-accurate predictor on a 20-stage pipeline still gives you a usable machine. Spectre is what happens when speculation crosses security boundaries.
08 · Practitioner
Out-of-order execution
The reorder buffer is a tiny database transaction inside your CPU. ~512 entries on Apple M4, ~448 on Intel Raptor Lake.
09 · Practitioner
SIMD and vector throughput
One AVX-512 instruction can do what 8–64 scalar instructions used to. The hidden cost on older Skylake-X: license-down clocks the package.
10 · Operator
The memory hierarchy is eight orders of magnitude
Register ~0.3 ns to HDD seek ~5 ms. Norvig's table updated for 2026.
11 · Operator
Cache coherence is per-line, not per-variable
Two threads writing different fields of the same 64-byte line can be 100× slower. False sharing.
12 · Operator
Every load is two loads
The TLB is the most precious silicon you've never heard of. Huge pages take the second load away.

Featured starting points

Six sub-pages where most readers start. Each one anchored to a real microarchitecture (Apple M4, Intel Raptor Lake, AMD Zen 5, RISC-V SiFive U74) with real cycle counts and named systems.

Memory hierarchy

CPU caches — L1/L2/L3, MESI, false sharing

The most-clicked deep dive on the path. Cache lines, associativity, replacement, the eight orders of magnitude between register and HDD seek.

Microarchitecture

Out-of-order execution & the reorder buffer

How a 3 GHz core executes ~3 instructions per cycle without you having to ask. Apple M4 has ~512 ROB entries; Raptor Lake ~448.

Speculation

Branch prediction & speculative execution

A 95%-accurate predictor on a 20-stage pipeline. Spectre is what happens when speculation crosses security boundaries.

Memory

Virtual memory & the TLB

Every load is two loads. Page tables, the TLB, huge pages, and why mmap is faster than read until it isn't.

Throughput

SIMD — AVX, NEON, SVE

One vector instruction can do what 8–64 scalar instructions used to. The hidden cost on older Skylake-X: license-down clocks.

I/O

PCIe, DMA & MSI-X

How packets actually move from the wire to memory. Lanes, root complex, IOMMU, RSS interrupt steering.

Internals · 15 sub-pages

Open the full deep-dive directory

Every chapter in order — transistors, ALU, clocks, pipelines, ISAs, caches, virtual memory, NUMA, PCIe, SSDs, GPUs, boot.

Continue

Books, courses, papers, talks

The references this study path leans on. The order matters: Patterson & Hennessy as the textbook spine, Bryant & O'Hallaron for the engineer-facing introduction, Agner Fog for the practitioner-grade specifics.

Patterson & Hennessy — Computer Organization and Design (RISC-V Edition). The undergraduate textbook. Concrete ISA, end-to-end coverage from gates to caches.
Hennessy & Patterson — Computer Architecture: A Quantitative Approach. The graduate sequel. Heavier on cache coherence, NUMA, multicore, and the post-Moore turn.
Bryant & O'Hallaron — Computer Systems: A Programmer's Perspective. The most engineer-facing of the textbooks. Used in CMU 15-213. If you read one book, read this.
Sorin, Hill & Wood — A Primer on Memory Consistency and Cache Coherence. Free PDF from Synthesis Lectures. The textbook on coherence protocols.
Drepper — What Every Programmer Should Know About Memory (2007). 100-page paper, dated on numbers but timeless on structure.
Shen & Lipasti — Modern Processor Design. The reference for out-of-order execution and modern microarchitecture.
Courses: MIT 6.004 / 6.191; CMU 15-213/15-513; Berkeley CS61C (RISC-V); nand2tetris (free, builds a working computer from NAND gates up).
Manuals: Agner Fog's microarchitecture/optimization/instruction tables (free); Intel Optimization Reference Manual; AMD Software Optimization Guide; Arm Architecture Reference Manual.
Talks: Andrei Alexandrescu — "Speed Is Found in the Minds of People" (CppCon); Mike Acton — "Data-Oriented Design" (CppCon 2014); Dick Sites — "Datacenter Computers" (USENIX).

Hands-on tools

What to install and what to point it at. perf on Linux, Instruments on macOS, VTune on Intel, uProf on AMD. godbolt.org (Compiler Explorer) for seeing what the compiler emitted. llvm-mca for static cycle prediction. perf c2c for finding cache-line contention in production binaries. numastat and numactl for NUMA. Apple's CPU Counters template in Instruments for Apple silicon. pmu-tools toplev.py for the top-down methodology that attributes stalls to the right level of the hierarchy.

Latency numbers

Norvig's "latency numbers every programmer should know", updated for 2026 silicon. A 3 GHz core executes ~3 instructions per cycle, so the cycle column doubles as an "instruction-slots lost" estimate.

Operation	Time	Cycles @ 3 GHz
Register read	~0.3 ns	1
L1 cache hit	~1 ns	3–5
L2 cache hit	~3–5 ns	10–16
L3 cache hit	~12–15 ns	40–50
DRAM (same socket)	~80 ns	~250
DRAM (remote NUMA socket)	~140 ns	~420
NVMe Gen5 random read 4 KB	~10 µs	~30,000
Datacenter network RTT	~250 µs	~750,000
HDD seek	~5 ms	~15,000,000
Intercontinental RTT	~150 ms	~450,000,000

2026 microarchitectures, side by side

Same year, three very different design points. Apple silicon optimises for perf-per-watt and ships a wider front-end; Intel and AMD push frequency and out-of-order depth. RISC-V is the open ISA used in research and embedded, catching up year over year.

Spec	Apple M4 P-core	Intel Raptor Lake P-core	AMD Zen 5	SiFive U74
ISA	ARMv9	x86-64	x86-64	RISC-V (RV64GC)
Decode width	10-wide	6-wide	4+4 wide	2-wide in-order
Reorder buffer	~512 entries	~448 entries	~448 entries	n/a (in-order)
L1d / L1i	192 KB / 192 KB	48 KB / 32 KB	48 KB / 32 KB	32 KB / 32 KB
L2	16 MB shared	2 MB per core	1 MB per core	2 MB shared
L3	n/a (SLC)	up to 36 MB	up to 32 MB	n/a
Cache line	128 B	64 B	64 B	64 B
Vector width	NEON 128 b	AVX-512 disabled	AVX-512 (full)	RVV (varies)
Peak clock	4.4 GHz	5.7 GHz	5.7 GHz	1.5 GHz

Decode width is the single best proxy for "how out-of-order is this core". Apple goes wider; Intel and AMD go faster; SiFive prioritises area and energy.

Six common mistakes

Reasoning about CPU performance from instruction count rather than cache-miss count. A "fast" instruction sequence that misses to DRAM ten times is slower than a "slow" sequence that stays in L1.
Assuming hyperthreading doubles throughput. It adds ~15–30% on integer workloads, can hurt FP-heavy workloads where the sibling thread starves the vector units.
Treating volatile as a synchronisation primitive. It isn't. It defeats compiler caching of values into registers, but says nothing about CPU-level reordering or coherence.
Forgetting mmap'd files share the same hardware as DRAM. Page-cache eviction pressure from a large file scan can trash the working set of every other process on the box.
Designing for one socket and scaling to four without testing. NUMA effects are not subtle; remote-socket DRAM is 2× the latency. A profile on a single socket tells you nothing about what happens at four.
Padding to 64 bytes on Apple silicon. Apple's cache line is 128. Half your false-sharing fix isn't a fix.

Adjacent paths

Operating systems. What the kernel does on top of the silicon. Threads, scheduling, virtual memory from the OS side.
Databases. The most cache-and-storage-aware software written. B-trees, LSM trees, page caches.
Computer networking. The wire side of the four-parts split, plus where DMA and the NIC offload meet.
Go internals. A concrete language runtime sitting on top of all of this — scheduler, garbage collector, channels.

Continue

Open the internals directory

Fifteen deep dives, one per layer of the bare metal — from transistors and the ALU through pipelining, caches, virtual memory, NUMA, PCIe, SSDs, GPUs, and boot.

Read the deep dives

Computer architecture

Twelve mental models

The four parts of a computer

Transistors → gates → ALU

The clock as the heartbeat

ISA versus microarchitecture

The instruction cycle

Pipelining is throughput, not latency

Speculation and branch prediction

Out-of-order execution

SIMD and vector throughput

The memory hierarchy is eight orders of magnitude

Cache coherence is per-line, not per-variable

Every load is two loads