How a computer runs a program.
Not just the CPU — the whole machine. Disk holds the executable. The OS loader copies it into RAM. The CPU walks the cache hierarchy looking for the next instruction. The ALU does the arithmetic. A syscall hands control to the kernel, which writes the output to the screen. Press play, step through, or scrub — every region lights up when it's active, and the running cycle counter shows exactly where the time goes.
out = a + b · printf("%d\n", out) a is at RAM[0x10] · b is at RAM[0x11] · out lands at RAM[0x12]Cold start. The executable is on disk. RAM is empty. The CPU is waiting for the OS to load something to run.
- Executable
- A file on disk with the instructions and data needed to run a program.
- Loader
- The part of the OS that reads an executable off disk and copies it into RAM.
- Cycle
- One tick of the CPU clock. Modern CPUs do ~3 billion per core per second.
The cycle, in one paragraph
A computer is a loop. The CPU fetches an instruction from the nearest cache that has it. It decodes — figures out what the instruction means. It executes — does the work, which might be arithmetic, a memory read, a memory write, or a system call. It updates the program counter and goes around again. Modern CPUs run this loop ~3 billion times per second per core, with pipelining and out-of-order execution layered on top, but the underlying loop has been the same since the 1940s.
Why the cycle counts matter
The visualization shows running cycle counts because the most surprising thing about computers, once you learn how they work, is the gap between the fast parts and the slow parts. Register access is one cycle. L1 cache is three. L2 is ten. L3 is forty. RAM is two hundred. SSD is a hundred thousand. Network round-trip across a datacentre is about a million.
In the program above, the actual arithmetic took roughly five cycles. The first instruction fetch — cold cache — took four hundred and fifty. The syscall took a thousand for the mode switch alone. The disk load that came before any of it took three hundred thousand. Almost all of the program's time was spent moving data around, not computing. That ratio holds for nearly every program you've ever run.
The pieces, named
| Region | What it does | How fast | How big |
|---|---|---|---|
| Disk | Holds files across power cycles. Programs live here when they're not running. | ~100,000 cycles (SSD); millions (HDD) | Hundreds of GB to TBs |
| RAM | Main memory. Where programs and their data live while running. | ~200 cycles | Gigabytes |
| L3 cache | Last-level cache, shared across cores. | ~40 cycles | ~30 MB |
| L2 cache | Per-core mid-level cache. | ~10 cycles | ~512 KB |
| L1 cache | Per-core innermost cache. Split into instruction and data. | ~3 cycles | ~64 KB |
| CPU core | Registers + ALU + control unit. Runs the actual instructions. | 1 cycle per op (most) | ~32 registers |
| OS kernel | Loader, scheduler, syscall handler, device drivers. Mediates between programs and hardware. | ~1000 cycles per syscall | — |
| Output | The terminal, the screen, the network. Where the program's effect on the world shows up. | variable | — |
What this visualization simplifies
- One instruction at a time. Real CPUs pipeline — they fetch the next instruction while decoding the current one and executing the previous one. Modern cores have 14–20 pipeline stages and execute several instructions per cycle.
- One cache line shown. Each cache level has thousands of lines, organized as N-way set-associative. We show one line of L1 so the cache state is readable.
- No virtual memory. Every address shown is physical. In reality the CPU translates a virtual address to a physical one via the TLB and the page table on every memory access.
- No branch prediction. Modern CPUs guess which way a branch will go and start executing speculatively. A wrong guess wastes ~15 cycles.
- No out-of-order execution. Real CPUs reorder instructions to keep all the execution units busy, then "retire" them in program order.
- One CPU core. Your laptop has 8–16. Each runs its own loop; cache coherence keeps them consistent.
Each of these is its own page in computer architecture internals — pipelines, caches and MESI, branch prediction, out-of-order execution, virtual memory and the TLB.
What happens when you run "hello world"
Same cycle, more steps. You type ./hello. The shell calls execve; the kernel loader maps the executable into RAM. The CPU jumps to main(). printf resolves to instructions in libc, which format the string, copy it into a buffer, and call write(stdout, ...) — a syscall. The kernel takes the bytes, hands them to the terminal driver, which hands them to your terminal emulator (also a program, running on the same CPU). The terminal emulator parses the bytes, decides which pixels to light up, calls into the display driver, which talks to the GPU, which talks to the monitor over HDMI. Every pixel that lights up on your screen is the end of a chain of fetch-decode-execute cycles. Trillions per second.
Computer architecture, the full path →
Fifteen deep dives — pipelining, branch prediction, caches and MESI, SIMD, out-of-order execution, the memory hierarchy in full.
Open the Codex