Day-0 → Month-N
Study path / 07

Computer architecture

Modern CPUs are time-machines. They fetch dozens of instructions ahead, execute them out of order, and quietly speculate down both sides of every branch. The memory hierarchy that feeds them spans eight orders of magnitude of latency. Understanding this layer is the difference between code that runs at 3 GHz and code that effectively runs at 30 MHz. This study path is the bare-metal foundation underneath everything else on the site — operating systems, languages, databases, networking — all of them are abstractions over the silicon described here.


Twelve mental models

The set of intuitions that, once internalised, make the rest of computer architecture fall into place. Tagged Day-zero (start here), Practitioner (you write code that lives or dies on this), or Operator (you debug production performance regressions with this).

  1. 01 · Day-zero

    The four parts of a computer

    CPU, memory, storage, I/O. The split that has held for fifty years and the three places where it leaks: DMA, MMIO, persistent memory.

  2. 02 · Day-zero

    Transistors → gates → ALU

    The abstraction stack. NMOS to NAND to flip-flop to adder to ALU. Apple M3 Max ships ~92 billion transistors.

  3. 03 · Day-zero

    The clock as the heartbeat

    Synchronous logic. Why frequency stalled near 5 GHz (power × frequency³) and what we did instead — cores, vectors, accelerators.

  4. 04 · Practitioner

    ISA versus microarchitecture

    Same x86 ISA, vastly different chips. Why ARM-on-laptop in 2026 changed nothing for most developers and everything for some.

  5. 05 · Day-zero

    The instruction cycle

    Fetch, decode, execute, write-back. The five-stage textbook pipeline and why your laptop's pipeline is closer to 14–20 stages.

  6. 06 · Practitioner

    Pipelining is throughput, not latency

    Each instruction takes longer in a pipelined CPU. A billion of them are much faster.

  7. 07 · Practitioner

    Speculation and branch prediction

    A 95%-accurate predictor on a 20-stage pipeline still gives you a usable machine. Spectre is what happens when speculation crosses security boundaries.

  8. 08 · Practitioner

    Out-of-order execution

    The reorder buffer is a tiny database transaction inside your CPU. ~512 entries on Apple M4, ~448 on Intel Raptor Lake.

  9. 09 · Practitioner

    SIMD and vector throughput

    One AVX-512 instruction can do what 8–64 scalar instructions used to. The hidden cost on older Skylake-X: license-down clocks the package.

  10. 10 · Operator

    The memory hierarchy is eight orders of magnitude

    Register ~0.3 ns to HDD seek ~5 ms. Norvig's table updated for 2026.

  11. 11 · Operator

    Cache coherence is per-line, not per-variable

    Two threads writing different fields of the same 64-byte line can be 100× slower. False sharing.

  12. 12 · Operator

    Every load is two loads

    The TLB is the most precious silicon you've never heard of. Huge pages take the second load away.

Featured starting points

Six sub-pages where most readers start. Each one anchored to a real microarchitecture (Apple M4, Intel Raptor Lake, AMD Zen 5, RISC-V SiFive U74) with real cycle counts and named systems.

Internals · 15 sub-pages

Open the full deep-dive directory

Every chapter in order — transistors, ALU, clocks, pipelines, ISAs, caches, virtual memory, NUMA, PCIe, SSDs, GPUs, boot.

Continue

Books, courses, papers, talks

The references this study path leans on. The order matters: Patterson & Hennessy as the textbook spine, Bryant & O'Hallaron for the engineer-facing introduction, Agner Fog for the practitioner-grade specifics.

  • Patterson & Hennessy — Computer Organization and Design (RISC-V Edition). The undergraduate textbook. Concrete ISA, end-to-end coverage from gates to caches.
  • Hennessy & Patterson — Computer Architecture: A Quantitative Approach. The graduate sequel. Heavier on cache coherence, NUMA, multicore, and the post-Moore turn.
  • Bryant & O'Hallaron — Computer Systems: A Programmer's Perspective. The most engineer-facing of the textbooks. Used in CMU 15-213. If you read one book, read this.
  • Sorin, Hill & Wood — A Primer on Memory Consistency and Cache Coherence. Free PDF from Synthesis Lectures. The textbook on coherence protocols.
  • Drepper — What Every Programmer Should Know About Memory (2007). 100-page paper, dated on numbers but timeless on structure.
  • Shen & Lipasti — Modern Processor Design. The reference for out-of-order execution and modern microarchitecture.
  • Courses: MIT 6.004 / 6.191; CMU 15-213/15-513; Berkeley CS61C (RISC-V); nand2tetris (free, builds a working computer from NAND gates up).
  • Manuals: Agner Fog's microarchitecture/optimization/instruction tables (free); Intel Optimization Reference Manual; AMD Software Optimization Guide; Arm Architecture Reference Manual.
  • Talks: Andrei Alexandrescu — "Speed Is Found in the Minds of People" (CppCon); Mike Acton — "Data-Oriented Design" (CppCon 2014); Dick Sites — "Datacenter Computers" (USENIX).

Hands-on tools

What to install and what to point it at. perf on Linux, Instruments on macOS, VTune on Intel, uProf on AMD. godbolt.org (Compiler Explorer) for seeing what the compiler emitted. llvm-mca for static cycle prediction. perf c2c for finding cache-line contention in production binaries. numastat and numactl for NUMA. Apple's CPU Counters template in Instruments for Apple silicon. pmu-tools toplev.py for the top-down methodology that attributes stalls to the right level of the hierarchy.

Latency numbers

Norvig's "latency numbers every programmer should know", updated for 2026 silicon. A 3 GHz core executes ~3 instructions per cycle, so the cycle column doubles as an "instruction-slots lost" estimate.

OperationTimeCycles @ 3 GHz
Register read~0.3 ns1
L1 cache hit~1 ns3–5
L2 cache hit~3–5 ns10–16
L3 cache hit~12–15 ns40–50
DRAM (same socket)~80 ns~250
DRAM (remote NUMA socket)~140 ns~420
NVMe Gen5 random read 4 KB~10 µs~30,000
Datacenter network RTT~250 µs~750,000
HDD seek~5 ms~15,000,000
Intercontinental RTT~150 ms~450,000,000

2026 microarchitectures, side by side

Same year, three very different design points. Apple silicon optimises for perf-per-watt and ships a wider front-end; Intel and AMD push frequency and out-of-order depth. RISC-V is the open ISA used in research and embedded, catching up year over year.

SpecApple M4 P-coreIntel Raptor Lake P-coreAMD Zen 5SiFive U74
ISAARMv9x86-64x86-64RISC-V (RV64GC)
Decode width10-wide6-wide4+4 wide2-wide in-order
Reorder buffer~512 entries~448 entries~448 entriesn/a (in-order)
L1d / L1i192 KB / 192 KB48 KB / 32 KB48 KB / 32 KB32 KB / 32 KB
L216 MB shared2 MB per core1 MB per core2 MB shared
L3n/a (SLC)up to 36 MBup to 32 MBn/a
Cache line128 B64 B64 B64 B
Vector widthNEON 128 bAVX-512 disabledAVX-512 (full)RVV (varies)
Peak clock4.4 GHz5.7 GHz5.7 GHz1.5 GHz

Decode width is the single best proxy for "how out-of-order is this core". Apple goes wider; Intel and AMD go faster; SiFive prioritises area and energy.

Six common mistakes

  • Reasoning about CPU performance from instruction count rather than cache-miss count. A "fast" instruction sequence that misses to DRAM ten times is slower than a "slow" sequence that stays in L1.
  • Assuming hyperthreading doubles throughput. It adds ~15–30% on integer workloads, can hurt FP-heavy workloads where the sibling thread starves the vector units.
  • Treating volatile as a synchronisation primitive. It isn't. It defeats compiler caching of values into registers, but says nothing about CPU-level reordering or coherence.
  • Forgetting mmap'd files share the same hardware as DRAM. Page-cache eviction pressure from a large file scan can trash the working set of every other process on the box.
  • Designing for one socket and scaling to four without testing. NUMA effects are not subtle; remote-socket DRAM is 2× the latency. A profile on a single socket tells you nothing about what happens at four.
  • Padding to 64 bytes on Apple silicon. Apple's cache line is 128. Half your false-sharing fix isn't a fix.

Adjacent paths

  • Operating systems. What the kernel does on top of the silicon. Threads, scheduling, virtual memory from the OS side.
  • Databases. The most cache-and-storage-aware software written. B-trees, LSM trees, page caches.
  • Computer networking. The wire side of the four-parts split, plus where DMA and the NIC offload meet.
  • Go internals. A concrete language runtime sitting on top of all of this — scheduler, garbage collector, channels.