Out-of-order execution
The CPU runs your instructions in whatever order it likes, as long as the visible result is the same as if it ran them in program order. Independent instructions execute in parallel. Memory loads start early so the data is ready when the consuming instruction shows up. A pile of bookkeeping — reorder buffer, register renaming, reservation stations — keeps the architectural state consistent. The technique was published in 1967 (Tomasulo, IBM) and shipped to consumers in 1995 (Pentium Pro). Almost every CPU since runs out of order.
The problem in-order can't solve
Consider this RISC-V sequence:
load x10, [x1] // L1 hit, ~5 cycles
add x11, x10, x2 // depends on x10 — must wait for load
add x12, x3, x4 // independent of everything above
sub x13, x5, x6 // also independent
add x14, x12, x13 // depends on x12 and x13 from aboveA strict in-order machine would stall after the load: it can't fetch
the next instruction until the previous one finishes (or close to it). Even with
pipelining, the dependent add stalls and freezes the pipeline behind
it. The two independent add / sub instructions sit
waiting for nothing useful.
An out-of-order machine sees that add x12 and sub x13
don't depend on the in-flight load. It launches them in parallel
while the load works through L1. By the time the load returns,
add x12 and sub x13 are already done, and the dependent
add x14 can run immediately. Five instructions complete in the time
an in-order machine would have run two.
That is the whole pitch for out-of-order execution, and it follows directly from where the time actually goes. An L1 hit is a handful of cycles; an L2 hit is a dozen or two; a trip to DRAM is a few hundred. An in-order pipeline that hits one of those misses simply freezes, and every instruction queued behind it freezes too, even the ones that have nothing to do with the stalled load. Out-of-order execution keeps the machine busy by letting the work that is ready run during that wait. The deeper question is not "how do we run instructions in a different order" but "how do we do that and still get exactly the result the program asked for." The rest of this page is the answer to the second half.
Three stages: issue in order, execute out of order, retire in order
The trick that makes this safe is a split. The pipeline reads instructions in program order, lets them execute whenever their inputs are ready, then commits their results back in program order. Three distinct boundaries, three distinct orderings. The front-end is in order so the machine always knows the original program sequence. The execution core is out of order so it can fill the gaps left by stalls. Retirement is in order so the architectural state the software sees only ever moves forward exactly as written.
This is the cleanest way to hold the whole machine in your head. The pipeline you already know handles fetch, decode, and the stages of a single instruction; out-of-order execution wraps a scheduler around the middle of that pipeline so the stages can be filled by whichever instruction is ready, not whichever one happens to be next. Issue width, the number of instructions the front-end pushes in per cycle, and retire width, the number it commits per cycle, are the two ends; the out-of-order window in between is what hides the stalls.
Register renaming
RISC-V exposes 32 architectural registers (x0…x31).
x86-64 exposes 16. Out-of-order execution needs many more — without the rename,
two unrelated instructions that happen to write the same architectural register
create a false dependency they don't deserve. The fix: map architectural
registers to a much larger pool of physical registers — typically
200–600 on modern cores. Every write allocates a fresh physical register; reads
use the most recent mapping for that architectural name.
There are three kinds of dependency to keep straight. A true (read-after-write) dependency is real: an instruction reads a value another one produced, so it has to wait. The other two are accidents of register reuse. A write-after-read (WAR) hazard happens when an instruction writes a register that an earlier instruction still needs to read; a write-after-write (WAW) hazard happens when two instructions write the same register and the later write must not clobber the earlier one out of order. Neither carries any actual data between the instructions. They exist only because the architecture has so few register names that the compiler had to reuse them. Renaming makes them vanish: give each write its own physical register and the two instructions stop colliding.
Step through this sequence and watch the rename table evolve:
mov x12, x10. It allocates
no execution unit; it just remaps x12 to point at the same
physical register as x10. This is what "mov is free" means on
modern x86. Same for add x12, x12, 0 when 0 is recognized as a
constant — Apple silicon and Intel Sapphire Rapids both do these
optimisations in the rename stage with zero downstream cost.Reorder buffer in motion
The reorder buffer (ROB) is the bookkeeping device that makes OOO safe. Every µop gets an entry on issue. The entries are filled in program order — like a queue. Execution happens whenever each µop's inputs are ready, in any order. Retirement commits each µop to architectural state, but only in program order: the head of the ROB retires when its execution is done, even if µops behind it finished long ago.
Tomasulo's algorithm — the original idea
Robert Tomasulo published the algorithm in 1967 for the IBM System/360 Model 91. The core ideas — reservation stations, register renaming via the Common Data Bus, out-of-order issue — are still what every modern CPU uses, with implementation updates. The four parts:
- Reservation stations hold µops waiting for operands. Each station has slots for the operation, the operands (or tags pointing to where they'll come from), and a "ready" flag.
- Register renaming happens at issue: instead of the architectural register number, the µop carries a physical register tag. Two µops writing the same architectural register get different tags, breaking false dependencies.
- Common Data Bus (CDB). When a functional unit produces a result, it broadcasts the (tag, value) on the CDB. Every reservation station snoops, and any whose operands match the tag mark themselves ready.
- Reorder buffer (added later — not in Tomasulo's 1967 paper, but standard since the 1980s). Holds in-flight µops in program order so retirement can commit them deterministically. Required for precise interrupts and exception handling.
Modern variants distribute the reservation stations into per-unit issue queues, use a separate physical register file (the RAT — Register Alias Table — instead of CDB tags), and integrate retirement with the ROB. The principles are unchanged.
In-order retirement is mandatory
Out-of-order execution is fine. Out-of-order retirement would be a disaster. Without in-order retirement:
- Exceptions become imprecise. A page fault at instruction N+5 might commit before instruction N's exception, so the OS can't tell which instruction faulted. Modern OSes need precise interrupts to debug, restart, and emulate.
- Branch mispredict recovery becomes impossible. If µops downstream of a branch already committed when the branch turns out to be wrong, you can't roll back.
- Interrupts can't restart cleanly. An I/O interrupt mid-program needs to know exactly what state the CPU was in.
The ROB's job is to buffer out-of-order results until they can retire in program order, deterministically. Speculative results live in the physical register file; only when a µop retires does its result become architecturally visible. If a branch turns out mispredicted, every younger ROB entry is squashed, the rename table is rolled back, and execution resumes from the correct path.
Memory is the hard part: the load/store queue
Renaming handles registers cleanly because the names are explicit in the
instruction. Memory is harder. A store writes to an address that is
only known once its operands have been computed, and a later load
may or may not read the same address. The core wants to run loads early, because
loads are the long-latency operations whose results everyone downstream is waiting
on. But it cannot let a load read stale data from before a store that was supposed
to write the same location. Sorting out which loads and stores actually overlap is
called memory disambiguation, and it is handled by the
load/store queue: two ordered structures that track every in-flight memory access
with its address and, for stores, its data.
When a load is ready to execute, the hardware checks the store queue for any older store to the same address. If one exists and its data is available, the load takes its value straight from the store queue rather than from cache, a fast path called store-to-load forwarding. If an older store's address is not yet known, the core has a choice: stall the load until the address resolves, which is safe but slow, or predict that the store and load do not overlap and let the load go. A memory dependence predictor makes that call. When it guesses wrong, the load and everything that consumed its value get squashed and replayed, the same recovery machinery used for branch mispredicts.
This is also where memory ordering rules bite. x86 has a relatively strong model, so the load/store queue has to preserve more of the program's apparent order; ARM and Apple silicon allow weaker ordering, which lets loads and stores reorder more freely and is part of why those cores extract more memory-level parallelism. The queue is the structure that enforces whatever ordering the architecture promises while still running as much memory traffic in parallel as it can get away with.
Speculation and the precise state
The out-of-order window only pays off if it stays full, and it cannot stay full if the core stops at every branch to wait for the outcome. So it does not wait. The branch predictor guesses which way each branch will go, and the front-end keeps fetching and issuing down the predicted path. Everything issued past an unresolved branch is speculative: it executes, it produces results, but those results live only in physical registers and the store queue, never in architectural state, until the branch is confirmed.
When the branch resolves correctly, the speculative results are already computed and retirement just walks past them in order, no penalty. When it resolves wrong, the core has to undo everything younger than the branch. Because nothing speculative ever retired, the undo is clean: squash every younger ROB entry, restore the rename table to its state at the branch, drop the bad store-queue entries, and restart the front-end at the correct target. The architectural state that survives is exactly the state after the branch and nothing more. That is what a precise machine means, and it is why the same mechanism handles exceptions: a page fault or divide-by-zero is detected during execution but does not take effect until that instruction reaches the head of the ROB and tries to retire, at which point everything after it is squashed and control transfers with a precise picture of where the program was.
The security fallout: Spectre and Meltdown
Speculation was assumed to be invisible because squashed instructions never touch architectural state. In 2018, Spectre and Meltdown showed that assumption was wrong. Speculative instructions do leave one trace: they change the cache. A load that ran speculatively, even one that was later squashed, can pull a line into the cache, and an attacker who times a later access can tell whether that line is present. The architectural result was rolled back; the microarchitectural side effect was not.
Meltdown used out-of-order execution to read kernel memory the process was not
allowed to see: the permission check and the dependent load were both in flight,
and the load's result leaked into the cache before the fault retired and squashed
it. Spectre is broader. It trains the branch predictor to mis-speculate into a
gadget that reads secret data and then makes a cache access keyed on that secret,
leaking it through timing even though the speculative path was thrown away. These
are not bugs in any one chip; they are consequences of speculation itself, which is
why mitigations have been costly. Fixes range from microcode and barrier
instructions (lfence to stop speculation at a sensitive point) to
kernel page-table isolation to compilers that insert speculation guards. Newer
cores add hardware defenses, but the lesson stands: a side channel can leak data
that the architecture promised was never exposed. When you reason about
out-of-order performance, remember it has a security surface too, and that the
cost of a hot path measured with a tool like
top-down analysis
may include mitigation overhead the hardware is paying on your behalf.
ROB size over the years
The ROB is the first-order knob for how much instruction-level parallelism a CPU can extract. A bigger ROB means more µops in flight, more chances for OOO to find independent work, more memory accesses outstanding. The growth has been steady:
| Chip | ROB size | Notes |
|---|---|---|
| Pentium III (1999) | 40 | 6-wide retire, ~3 IPC peak |
| Core 2 Duo (2006) | 96 | Conroe — 4-wide decode, 4-wide retire |
| Sandy Bridge (2011) | 168 | First major ROB-size jump in years |
| Skylake (2015) | 224 | TAGE branch predictor, deeper machine |
| Apple M1 (2020) | 630 | Massive ROB enabled the 8-wide front-end |
| AMD Zen 4 (2022) | 320 | Bigger than Skylake, smaller than Apple |
| Intel Raptor Lake (2022) | 512 | Catching up to Apple territory |
| AMD Zen 5 (2024) | 448 | Wider issue, similar capacity |
| Apple M3 / M4 (2023–24) | 670 | The current high-water mark |
Apple's 600+ ROB is the most aggressive in industry. It's enabled by Apple's wider decode (8 instructions/cycle from a single L1 instruction cache), much larger physical register file, and more permissive memory ordering. The cost: more silicon, more verification, more power. The gain: dramatically more memory-level parallelism, which on workloads bounded by L3/DRAM latency translates directly to more throughput.
What's in flight on a busy core
On a modern Intel or Apple core executing well-tuned code, you'll see roughly:
- ~250–500 µops in the ROB
- ~80–200 outstanding loads and stores in the load/store queue
- ~10–20 in-flight cache misses to L2 or beyond (memory-level parallelism)
- ~150–300 physical registers carrying speculative values
- 4–8 µops being retired per cycle
- 4–8 µops being dispatched per cycle
This is what "instruction-level parallelism" looks like in 2026. It's a far cry from 1995's "1 instruction per cycle" Pentium. Almost the entire gap between modern and 1990s-era IPC came from making the OOO machinery wider and deeper.
Limits of OOO
OOO can't help when the program offers no independent work. The classic case:
pointer-chasing. x = list->next; x = x->next;
— each load depends on the previous load. The ROB fills with µops that all wait,
the OOO core extracts no parallelism, the chip runs at the speed of the memory
latency. Linked-list traversal of a structure too big for L1 is essentially a
DRAM benchmark; OOO doesn't help.
OOO also can't help when control flow is unpredictable enough that branch mispredicts dominate. Every mispredict throws away tens of µops and forces a restart. A workload with 5% mispredict rate and 16-cycle penalty pays ~0.8 cycles per instruction in pure mispredict overhead, before any other cost. This is why modern CPUs invest as much silicon in branch prediction as in execution units.
Common misconceptions
- "OOO means non-deterministic results." Architecturally, no — every retirement is in program order, every visible state matches what an in-order machine would produce. The microarchitecture is non-deterministic about timing, which leaks through side channels (Spectre, MDS) — but never through architectural state.
- "Modern x86 is RISC because of the µops." The internal µops are RISC-shaped, but the architectural state (16 registers, complex addressing modes, microcode) is still CISC. Software-visible behaviour matches the x86 manual; the µops are an implementation detail.
- "More physical registers always help." Past a point, the rename table itself slows down — large CAMs (content-addressable memories) hurt cycle time. Apple, Intel, and AMD have all settled near 200–600 because that's where the curve starts to bend.
- "OOO and SMT are independent." Related. SMT (hyperthreading) shares the OOO core between two thread contexts. The ROB, register file, and execution units are partitioned dynamically. SMT helps when one thread stalls (memory miss); the other thread fills the slack with its own µops.
Numbers worth remembering
| Quantity | Value | Notes |
|---|---|---|
| Modern ROB size, mainstream | 320–670 entries | Apple ~670, Intel ~512, AMD ~448 |
| Modern physical register file | ~200–500 entries | Roughly 60–80% of ROB size |
| Modern issue width (µops/cycle) | 4–8 | Apple 8-wide, Intel/AMD 4–6 wide |
| Modern retirement width (µops/cycle) | 4–8 | Same as issue, deeper pipelines have wider retire |
| Architectural registers, RISC-V / ARM64 | 32 | Plus 32 FP, 32 SIMD |
| Architectural registers, x86-64 | 16 GP + 16 SIMD | Plus rename pressure from FLAGS, segments |
| Tomasulo's algorithm published | 1967 | IBM System/360 Model 91 |
| First mainstream OOO consumer chip | 1995 | Pentium Pro (P6 microarchitecture) |
| Outstanding loads, modern core | ~80–200 | Memory-level parallelism |
mov latency on modern x86 | 0 cycles | Eliminated in rename, no execution unit consumed |
Further reading
- Wikipedia — Tomasulo's algorithm — the 1967 paper's algorithm in modern notation, with worked examples.
- Hennessy & Patterson — Computer Architecture: A Quantitative Approach. Chapter 3 (Instruction-Level Parallelism) is the canonical graduate treatment of dynamic scheduling, ROB design, and renaming.
- Shen & Lipasti — Modern Processor Design. Specifically chapters 4 and 7. The most engineer-facing treatment of OOO design choices.
- Agner Fog — Microarchitecture of Intel, AMD, and VIA CPUs — sections on renaming, ROB, issue queues for every recent x86. Specific µop counts and ROB sizes.
- Chips and Cheese — reverse-engineered ROB and physical register file sizes for every recent CPU release.
- Wikipedia — Out-of-order execution — history and overview.
- Wikipedia — Register renaming — the rename table mechanics in detail.
The ROB starts empty. Each cycle, the front-end issues new µops into the ROB; the back-end starts execution as inputs become available; retirement commits completed µops in program order from the head.