06 / 15
Internals / 06

Out-of-order execution

The CPU runs your instructions in whatever order it likes, as long as the visible result is the same as if it ran them in program order. Independent instructions execute in parallel. Memory loads start early so the data is ready when the consuming instruction shows up. A pile of bookkeeping — reorder buffer, register renaming, reservation stations — keeps the architectural state consistent. The technique was published in 1967 (Tomasulo, IBM) and shipped to consumers in 1995 (Pentium Pro). Almost every CPU since runs out of order.


The problem in-order can't solve

Consider this RISC-V sequence:

load  x10, [x1]      // L1 hit, ~5 cycles
add   x11, x10, x2   // depends on x10 — must wait for load
add   x12, x3, x4    // independent of everything above
sub   x13, x5, x6    // also independent
add   x14, x12, x13  // depends on x12 and x13 from above

A strict in-order machine would stall after the load: it can't fetch the next instruction until the previous one finishes (or close to it). Even with pipelining, the dependent add stalls and freezes the pipeline behind it. The two independent add / sub instructions sit waiting for nothing useful.

An out-of-order machine sees that add x12 and sub x13 don't depend on the in-flight load. It launches them in parallel while the load works through L1. By the time the load returns, add x12 and sub x13 are already done, and the dependent add x14 can run immediately. Five instructions complete in the time an in-order machine would have run two.

That is the whole pitch for out-of-order execution, and it follows directly from where the time actually goes. An L1 hit is a handful of cycles; an L2 hit is a dozen or two; a trip to DRAM is a few hundred. An in-order pipeline that hits one of those misses simply freezes, and every instruction queued behind it freezes too, even the ones that have nothing to do with the stalled load. Out-of-order execution keeps the machine busy by letting the work that is ready run during that wait. The deeper question is not "how do we run instructions in a different order" but "how do we do that and still get exactly the result the program asked for." The rest of this page is the answer to the second half.

Three stages: issue in order, execute out of order, retire in order

The trick that makes this safe is a split. The pipeline reads instructions in program order, lets them execute whenever their inputs are ready, then commits their results back in program order. Three distinct boundaries, three distinct orderings. The front-end is in order so the machine always knows the original program sequence. The execution core is out of order so it can fill the gaps left by stalls. Retirement is in order so the architectural state the software sees only ever moves forward exactly as written.

ISSUE · in orderEXECUTE · out of orderRETIRE · in orderload x10,[x1]add x11,x10,x2add x12,x3,x4sub x13,x5,x6add x12 readysub x13 readyload x10 longadd x11 waitsload x10add x11add x12sub x13order is scrambled in the middle and restored at the end
Same four instructions, three orderings. Execution runs the ready work first; retirement puts the program back together exactly as written.

This is the cleanest way to hold the whole machine in your head. The pipeline you already know handles fetch, decode, and the stages of a single instruction; out-of-order execution wraps a scheduler around the middle of that pipeline so the stages can be filled by whichever instruction is ready, not whichever one happens to be next. Issue width, the number of instructions the front-end pushes in per cycle, and retire width, the number it commits per cycle, are the two ends; the out-of-order window in between is what hides the stalls.

Register renaming

RISC-V exposes 32 architectural registers (x0x31). x86-64 exposes 16. Out-of-order execution needs many more — without the rename, two unrelated instructions that happen to write the same architectural register create a false dependency they don't deserve. The fix: map architectural registers to a much larger pool of physical registers — typically 200–600 on modern cores. Every write allocates a fresh physical register; reads use the most recent mapping for that architectural name.

There are three kinds of dependency to keep straight. A true (read-after-write) dependency is real: an instruction reads a value another one produced, so it has to wait. The other two are accidents of register reuse. A write-after-read (WAR) hazard happens when an instruction writes a register that an earlier instruction still needs to read; a write-after-write (WAW) hazard happens when two instructions write the same register and the later write must not clobber the earlier one out of order. Neither carries any actual data between the instructions. They exist only because the architecture has so few register names that the compiler had to reuse them. Renaming makes them vanish: give each write its own physical register and the two instructions stop colliding.

before renaming — both write x5add x5, x1, x2mul x5, x3, x4WAWfalse depforced to run in order — even though they share no dataafter renaming — separate physical registersadd p41, p1, p2mul p42, p3, p4independent →parallel
x5 written twice is a false dependency. Two physical registers (p41, p42) make the instructions independent — they can now execute in either order.

Step through this sequence and watch the rename table evolve:

0 / 5
1. add x10, x1, x2
2. add x10, x10, x3
3. add x11, x4, x5
4. add x10, x10, x6
5. mov x12, x10
rename table (architectural → physical)
x0 p0
x1 p1
x2 p2
x3 p3
x4 p4
x5 p5
x6 p6
Notice instruction #5mov x12, x10. It allocates no execution unit; it just remaps x12 to point at the same physical register as x10. This is what "mov is free" means on modern x86. Same for add x12, x12, 0 when 0 is recognized as a constant — Apple silicon and Intel Sapphire Rapids both do these optimisations in the rename stage with zero downstream cost.

Reorder buffer in motion

The reorder buffer (ROB) is the bookkeeping device that makes OOO safe. Every µop gets an entry on issue. The entries are filled in program order — like a queue. Execution happens whenever each µop's inputs are ready, in any order. Retirement commits each µop to architectural state, but only in program order: the head of the ROB retires when its execution is done, even if µops behind it finished long ago.

cycle 0
ROB · in program order
empty — click step to begin
commentary

The ROB starts empty. Each cycle, the front-end issues new µops into the ROB; the back-end starts execution as inputs become available; retirement commits completed µops in program order from the head.

wait — inputs not ready exec — issued to ALU/load unit done — result available, awaiting retire retired — committed to architectural state

Tomasulo's algorithm — the original idea

Robert Tomasulo published the algorithm in 1967 for the IBM System/360 Model 91. The core ideas — reservation stations, register renaming via the Common Data Bus, out-of-order issue — are still what every modern CPU uses, with implementation updates. The four parts:

  • Reservation stations hold µops waiting for operands. Each station has slots for the operation, the operands (or tags pointing to where they'll come from), and a "ready" flag.
  • Register renaming happens at issue: instead of the architectural register number, the µop carries a physical register tag. Two µops writing the same architectural register get different tags, breaking false dependencies.
  • Common Data Bus (CDB). When a functional unit produces a result, it broadcasts the (tag, value) on the CDB. Every reservation station snoops, and any whose operands match the tag mark themselves ready.
  • Reorder buffer (added later — not in Tomasulo's 1967 paper, but standard since the 1980s). Holds in-flight µops in program order so retirement can commit them deterministically. Required for precise interrupts and exception handling.

Modern variants distribute the reservation stations into per-unit issue queues, use a separate physical register file (the RAT — Register Alias Table — instead of CDB tags), and integrate retirement with the ROB. The principles are unchanged.

In-order retirement is mandatory

Out-of-order execution is fine. Out-of-order retirement would be a disaster. Without in-order retirement:

  • Exceptions become imprecise. A page fault at instruction N+5 might commit before instruction N's exception, so the OS can't tell which instruction faulted. Modern OSes need precise interrupts to debug, restart, and emulate.
  • Branch mispredict recovery becomes impossible. If µops downstream of a branch already committed when the branch turns out to be wrong, you can't roll back.
  • Interrupts can't restart cleanly. An I/O interrupt mid-program needs to know exactly what state the CPU was in.

The ROB's job is to buffer out-of-order results until they can retire in program order, deterministically. Speculative results live in the physical register file; only when a µop retires does its result become architecturally visible. If a branch turns out mispredicted, every younger ROB entry is squashed, the rename table is rolled back, and execution resumes from the correct path.

Memory is the hard part: the load/store queue

Renaming handles registers cleanly because the names are explicit in the instruction. Memory is harder. A store writes to an address that is only known once its operands have been computed, and a later load may or may not read the same address. The core wants to run loads early, because loads are the long-latency operations whose results everyone downstream is waiting on. But it cannot let a load read stale data from before a store that was supposed to write the same location. Sorting out which loads and stores actually overlap is called memory disambiguation, and it is handled by the load/store queue: two ordered structures that track every in-flight memory access with its address and, for stores, its data.

When a load is ready to execute, the hardware checks the store queue for any older store to the same address. If one exists and its data is available, the load takes its value straight from the store queue rather than from cache, a fast path called store-to-load forwarding. If an older store's address is not yet known, the core has a choice: stall the load until the address resolves, which is safe but slow, or predict that the store and load do not overlap and let the load go. A memory dependence predictor makes that call. When it guesses wrong, the load and everything that consumed its value get squashed and replayed, the same recovery machinery used for branch mispredicts.

store queue (older → newer)st [0x40], 99st [0x80], 7st [0xC0], 3ld x9, [0x80]address match → forward 7ld x8, [0x200]no match → read cacheL1$a load reads from the store queue when an older store hits the same address
The load to 0x80 forwards directly from the matching store; the load to 0x200 misses the queue and goes to cache. Getting this wrong means a squash.

This is also where memory ordering rules bite. x86 has a relatively strong model, so the load/store queue has to preserve more of the program's apparent order; ARM and Apple silicon allow weaker ordering, which lets loads and stores reorder more freely and is part of why those cores extract more memory-level parallelism. The queue is the structure that enforces whatever ordering the architecture promises while still running as much memory traffic in parallel as it can get away with.

Speculation and the precise state

The out-of-order window only pays off if it stays full, and it cannot stay full if the core stops at every branch to wait for the outcome. So it does not wait. The branch predictor guesses which way each branch will go, and the front-end keeps fetching and issuing down the predicted path. Everything issued past an unresolved branch is speculative: it executes, it produces results, but those results live only in physical registers and the store queue, never in architectural state, until the branch is confirmed.

When the branch resolves correctly, the speculative results are already computed and retirement just walks past them in order, no penalty. When it resolves wrong, the core has to undo everything younger than the branch. Because nothing speculative ever retired, the undo is clean: squash every younger ROB entry, restore the rename table to its state at the branch, drop the bad store-queue entries, and restart the front-end at the correct target. The architectural state that survives is exactly the state after the branch and nothing more. That is what a precise machine means, and it is why the same mechanism handles exceptions: a page fault or divide-by-zero is detected during execution but does not take effect until that instruction reaches the head of the ROB and tries to retire, at which point everything after it is squashed and control transfers with a precise picture of where the program was.

Why this hides cache-miss latency. A load that misses to DRAM takes a few hundred cycles. With a 500-entry ROB the core can keep issuing and executing independent work the whole time the miss is outstanding, including past several predicted branches. Multiple misses can be in flight at once. The visible cost of the miss shrinks to whatever work the core could not find to overlap with it. This is the single biggest reason a modern core sustains several instructions per cycle on code that an in-order machine would crawl through.

The security fallout: Spectre and Meltdown

Speculation was assumed to be invisible because squashed instructions never touch architectural state. In 2018, Spectre and Meltdown showed that assumption was wrong. Speculative instructions do leave one trace: they change the cache. A load that ran speculatively, even one that was later squashed, can pull a line into the cache, and an attacker who times a later access can tell whether that line is present. The architectural result was rolled back; the microarchitectural side effect was not.

Meltdown used out-of-order execution to read kernel memory the process was not allowed to see: the permission check and the dependent load were both in flight, and the load's result leaked into the cache before the fault retired and squashed it. Spectre is broader. It trains the branch predictor to mis-speculate into a gadget that reads secret data and then makes a cache access keyed on that secret, leaking it through timing even though the speculative path was thrown away. These are not bugs in any one chip; they are consequences of speculation itself, which is why mitigations have been costly. Fixes range from microcode and barrier instructions (lfence to stop speculation at a sensitive point) to kernel page-table isolation to compilers that insert speculation guards. Newer cores add hardware defenses, but the lesson stands: a side channel can leak data that the architecture promised was never exposed. When you reason about out-of-order performance, remember it has a security surface too, and that the cost of a hot path measured with a tool like top-down analysis may include mitigation overhead the hardware is paying on your behalf.

ROB size over the years

The ROB is the first-order knob for how much instruction-level parallelism a CPU can extract. A bigger ROB means more µops in flight, more chances for OOO to find independent work, more memory accesses outstanding. The growth has been steady:

ChipROB sizeNotes
Pentium III (1999)406-wide retire, ~3 IPC peak
Core 2 Duo (2006)96Conroe — 4-wide decode, 4-wide retire
Sandy Bridge (2011)168First major ROB-size jump in years
Skylake (2015)224TAGE branch predictor, deeper machine
Apple M1 (2020)630Massive ROB enabled the 8-wide front-end
AMD Zen 4 (2022)320Bigger than Skylake, smaller than Apple
Intel Raptor Lake (2022)512Catching up to Apple territory
AMD Zen 5 (2024)448Wider issue, similar capacity
Apple M3 / M4 (2023–24)670The current high-water mark

Apple's 600+ ROB is the most aggressive in industry. It's enabled by Apple's wider decode (8 instructions/cycle from a single L1 instruction cache), much larger physical register file, and more permissive memory ordering. The cost: more silicon, more verification, more power. The gain: dramatically more memory-level parallelism, which on workloads bounded by L3/DRAM latency translates directly to more throughput.

What's in flight on a busy core

On a modern Intel or Apple core executing well-tuned code, you'll see roughly:

  • ~250–500 µops in the ROB
  • ~80–200 outstanding loads and stores in the load/store queue
  • ~10–20 in-flight cache misses to L2 or beyond (memory-level parallelism)
  • ~150–300 physical registers carrying speculative values
  • 4–8 µops being retired per cycle
  • 4–8 µops being dispatched per cycle

This is what "instruction-level parallelism" looks like in 2026. It's a far cry from 1995's "1 instruction per cycle" Pentium. Almost the entire gap between modern and 1990s-era IPC came from making the OOO machinery wider and deeper.

Limits of OOO

OOO can't help when the program offers no independent work. The classic case: pointer-chasing. x = list->next; x = x->next; — each load depends on the previous load. The ROB fills with µops that all wait, the OOO core extracts no parallelism, the chip runs at the speed of the memory latency. Linked-list traversal of a structure too big for L1 is essentially a DRAM benchmark; OOO doesn't help.

OOO also can't help when control flow is unpredictable enough that branch mispredicts dominate. Every mispredict throws away tens of µops and forces a restart. A workload with 5% mispredict rate and 16-cycle penalty pays ~0.8 cycles per instruction in pure mispredict overhead, before any other cost. This is why modern CPUs invest as much silicon in branch prediction as in execution units.

Why this matters in code you write: if a hot loop is bounded by a long dependency chain, OOO is useless and adding more ALUs won't help. The fix is algorithmic — break the chain. Common transformations: tree reduction instead of linear accumulation, two interleaved sums in one loop instead of one, SIMD vectorization that exposes data-parallelism the OOO engine can pick up.

Common misconceptions

  • "OOO means non-deterministic results." Architecturally, no — every retirement is in program order, every visible state matches what an in-order machine would produce. The microarchitecture is non-deterministic about timing, which leaks through side channels (Spectre, MDS) — but never through architectural state.
  • "Modern x86 is RISC because of the µops." The internal µops are RISC-shaped, but the architectural state (16 registers, complex addressing modes, microcode) is still CISC. Software-visible behaviour matches the x86 manual; the µops are an implementation detail.
  • "More physical registers always help." Past a point, the rename table itself slows down — large CAMs (content-addressable memories) hurt cycle time. Apple, Intel, and AMD have all settled near 200–600 because that's where the curve starts to bend.
  • "OOO and SMT are independent." Related. SMT (hyperthreading) shares the OOO core between two thread contexts. The ROB, register file, and execution units are partitioned dynamically. SMT helps when one thread stalls (memory miss); the other thread fills the slack with its own µops.

Numbers worth remembering

QuantityValueNotes
Modern ROB size, mainstream320–670 entriesApple ~670, Intel ~512, AMD ~448
Modern physical register file~200–500 entriesRoughly 60–80% of ROB size
Modern issue width (µops/cycle)4–8Apple 8-wide, Intel/AMD 4–6 wide
Modern retirement width (µops/cycle)4–8Same as issue, deeper pipelines have wider retire
Architectural registers, RISC-V / ARM6432Plus 32 FP, 32 SIMD
Architectural registers, x86-6416 GP + 16 SIMDPlus rename pressure from FLAGS, segments
Tomasulo's algorithm published1967IBM System/360 Model 91
First mainstream OOO consumer chip1995Pentium Pro (P6 microarchitecture)
Outstanding loads, modern core~80–200Memory-level parallelism
mov latency on modern x860 cyclesEliminated in rename, no execution unit consumed

Further reading

Found this useful?