04 / 15
Internals / 04

Pipelining

Pipelining is the move that turned 1980s CPUs from one-instruction-per-cycle machines into the throughput beasts we have today. The idea is simple: while one instruction is decoding, the next one starts fetching. Each instruction still takes the same five stages to complete, but five instructions can be in flight at once. The result, on paper, is 5× more throughput at the same clock. The reality is more subtle — hazards stall the pipeline, branch mispredicts flush dozens of cycles of work, and the deeper you go the more all this hurts. The arc since 2005 has been about getting the most out of pipelining without overpaying for it.


Start with laundry

The textbook way into pipelining is a load of washing, and it earns its place because it gets the idea across before any silicon is involved. Say you have four loads to do, and the laundry has three steps: wash (30 minutes), dry (40 minutes), fold (20 minutes). The naive way is to run one load all the way through before touching the next. Wash load one, dry load one, fold load one, then start load two. Each load takes 90 minutes, so four loads take six hours. The dryer sits idle while you fold; the washer sits idle while the dryer runs. Most of your equipment is doing nothing most of the time.

The smarter way is obvious once you see it. The moment load one comes out of the washer and goes into the dryer, load two goes straight into the washer. While load two dries, you fold load one and wash load three. Now all three machines are busy at once, each working on a different load. The first load still takes 90 minutes start to finish, but once the line is full you finish a load every 40 minutes, the length of the slowest step. Four loads drop from six hours to about three and a half. You did not buy a faster dryer. You stopped letting your machines idle.

time →load 1load 2load 3load 4washdryfoldwashdryfoldwashdrywashthree machines busy at once, one load finished every "dry" length
Staggering the loads keeps all three machines working. A CPU pipeline does the same with the stages of an instruction.

A CPU does exactly this with the steps of an instruction. The machines are the pipeline stages, the loads are instructions, and the slowest stage sets the clock. Everything that follows is this idea plus the complications that show up when one load needs something another load has not finished yet.

Latency stays the same; throughput goes up

Without a pipeline, every instruction takes 5 cycles end to end. The CPU finishes one, then starts the next. With a 5-stage pipeline, every instruction still takes 5 cycles end to end — but the next one can start one cycle later, not five. In steady state the chip retires one instruction per cycle. Single-instruction latency is unchanged. Aggregate throughput is 5× higher.

No pipeline:
                cy 1  cy 2  cy 3  cy 4  cy 5  cy 6  cy 7  cy 8  cy 9 cy 10
instr 1         IF    ID    EX    MEM   WB
instr 2                                       IF    ID    EX    MEM  WB
                ↑ 1 instruction every 5 cycles → 0.2 IPC

Pipelined:
                cy 1  cy 2  cy 3  cy 4  cy 5  cy 6  cy 7  cy 8
instr 1         IF    ID    EX    MEM   WB
instr 2               IF    ID    EX    MEM   WB
instr 3                     IF    ID    EX    MEM   WB
instr 4                           IF    ID    EX    MEM   WB
                                              ↑ 1 instruction every cycle → 1.0 IPC

The price you pay: every instruction now lives in five different stages of execution at five different points in time, and any time one needs information another hasn't finished computing, the pipeline stalls. The next few sections are about what happens when that goes wrong.

The five stages, one instruction at a time

The classic RISC pipeline splits an instruction into five stages. They map onto the instruction cycle you already know, just chopped into pieces that can each finish in one clock tick. Walking a single add x10, x1, x2 through them:

  • IF — instruction fetch. Read the instruction from the instruction cache at the address in the program counter, and bump the program counter to point at the next one.
  • ID — instruction decode. Work out what the instruction is, and read its source registers (x1 and x2) out of the register file. On a load-store machine this is also where the immediate is sign-extended.
  • EX — execute. The arithmetic logic unit does the work: add x1 and x2. For a branch this is where the target and the taken/not-taken decision are computed; for a load or store, where the memory address is calculated.
  • MEM — memory access. Read or write the data cache. Pure arithmetic instructions have nothing to do here and pass straight through, which is one reason the stage exists at all — it keeps every instruction the same length so they stay in lockstep.
  • WB — write back. Write the result (the sum) back into the destination register x10, where the next instruction can read it.

Each stage hands its work to the next across a set of pipeline registers — latches that hold the partial result at the clock edge. The clock period has to be long enough for the slowest single stage to finish, which is the whole point: by cutting one long job into five short ones, the slowest piece is far shorter than the whole, so the clock can run much faster. The cost is the latch delay added at every boundary and the bookkeeping needed when instructions step on each other. For the full single-instruction version of this walk, see the instruction cycle page; pipelining is what you get when you run five of those cycles overlapped.

cy 1cy 2cy 3cy 4cy 5cy 6cy 7cy 8addaddisubandorIFIDEXMEMWBIFIDEXMEMWBIFIDEXMEMWBIFIDEXMEMWBIFIDEXMEMthe diagonal is the pipeline filling; the vertical slice at any cycle is five instructions in flight
Five instructions overlapped. Read down any column to see what the hardware is doing in that cycle: in cycle 5, five different instructions occupy five different stages.

The diagonal shape is the signature of a working pipeline. The first few cycles are the fill, where the pipeline ramps up and isn't yet retiring an instruction every cycle; the last few are the drain, as it empties out. In between, in steady state, one instruction completes per cycle. The longer the run of instructions, the more the fill and drain costs amortise away, which is why pipelining pays off on real programs that execute millions of instructions in a row but does nothing for a single isolated one.

Pipeline in motion

Click play to step through cycles. Each row is one instruction; each column is one cycle. As cycles advance, instructions move rightward, one stage per cycle. The clean scenario shows the steady-state best case. The hazard scenarios show what stalls look like. The branch-mispredict scenario shows the cost of speculation gone wrong.

instruction
cy 1
cy 2
cy 3
cy 4
cy 5
cy 6
cy 7
cy 8
cy 9
cy 10
cy 11
cy 12
add x10, x1, x2
IF
ID
EX
MEM
WB
addi x11, x10, 4
IF
ID
EX
MEM
WB
sub x12, x3, x4
IF
ID
EX
MEM
WB
and x13, x5, x6
IF
ID
EX
MEM
WB
or x14, x7, x8
IF
ID
EX
MEM
WB
IF ID EX MEM WB bubble · stall ✗ flushed
cycle
0
retired
0
stall / flush
0
CPI
0

Hazards — three kinds

Anything that stops an instruction from advancing on schedule is a hazard. The textbook taxonomy:

  • Structural hazards. Two instructions need the same hardware unit at the same time. Example: a single memory port can't service an IF and a MEM access in the same cycle. Fixed by duplicating the unit (separate I-cache and D-cache, the Harvard split) or by stalling.
  • Data hazards. An instruction needs a value that's still being computed by an earlier in-flight instruction. The classic case is RAW (Read-After-Write): add x10, x1, x2 followed by addi x11, x10, 4. Without forwarding, the second instruction has to wait until the first completes WB before it can read x10.
  • Control hazards. A branch instruction is in flight, and the CPU doesn't yet know whether the branch is taken or not. Two options: stall (waste cycles), or speculate (predict, fetch ahead, recover if wrong).

Modern CPUs handle data hazards with forwarding (also called "bypassing") and control hazards with branch prediction + speculative execution. Both are covered in their own deep dives.

Forwarding turns RAW into a non-event

The fix for the RAW hazard above is straightforward: the result of the add is available at the output of the EX stage at the end of cycle 3, even though it doesn't formally land in the register file until WB at cycle 5. Wire that EX-output back into the input of the next instruction's EX stage and the next instruction can use the value immediately. This is EX-to-EX forwarding.

There are several forwarding paths in a real chip: EX-to-EX, MEM-to-EX (for the cycle after that), MEM-to-MEM (for store-after-load patterns). Add them all up and most RAW hazards disappear. The cases that remain are load-use hazards — a load followed immediately by a use of the loaded value. The data isn't available until end of MEM, so the consumer's EX has to wait one cycle. This is the one unavoidable single-cycle bubble in classical 5-stage pipelines.

cy 1cy 2cy 3cy 4cy 5cy 6cy 7cy 8add x10,…addi …,x10subIFIDEXMEMWBIFstallstallIDEXMEMWBIFIDEXMEMx10 ready in WBtwo bubbles inserted; every later instruction shifts right by two cycles
A bubble. Without forwarding, addi can't read x10 until add writes it back in cycle 5, so two stall cycles are injected and everything behind it slides right.
Try it above: select the data-hazard scenario, toggle forwarding off, and watch the second instruction stall for two cycles. Toggle it back on and the bubbles vanish. Real chips have forwarding on by design — there's no switch. Compilers schedule code knowing this, which is why a load-use sequence is often split apart by the compiler with unrelated work between them.

What a stall actually does to the hardware

A bubble is not the pipeline pausing. The hardware never stops clocking. A stall is the control logic deciding that the stalled instruction and everything behind it should hold their position for a cycle, while a nop — an instruction that does nothing — flows forward into the stages ahead. That nop is the bubble. It occupies EX, then MEM, then WB on successive cycles, doing no useful work, while the real instruction waits one stage back. From the outside it looks like the pipeline froze; inside, it kept running and simply pushed a hole through.

This matters because the cost of a stall is counted in retired instructions, not wall time you can see. The metric is cycles per instruction, CPI. A perfect pipeline retires one instruction per cycle, CPI of 1.0. Every bubble pushes CPI above 1.0 because a cycle went by with nothing retired. The interactive grid above tracks this: watch CPI climb the moment a stall or flush appears, then settle back toward 1.0 as the pipeline recovers and keeps retiring. A single load-use bubble in a tight loop that runs a billion times is a billion wasted cycles, which is why this small detail is worth a compiler's attention.

Branch mispredicts hurt — a lot

A branch's outcome isn't known until the branch instruction reaches EX (cycle 3 in the textbook 5-stage pipeline; cycle 12 or so on a deep modern pipeline). If the CPU sat idle waiting, every branch would cost ~3 cycles in the textbook model and ~14 cycles on a modern one. Instead, modern CPUs predict the outcome and speculatively fetch down the predicted path. When the prediction is correct (most of the time), the speculation is free. When it's wrong, every speculatively fetched instruction has to be flushed and the pipeline restarted at the correct target.

On the textbook pipeline, that's 3 instructions flushed and a 3-cycle bubble. On Skylake (14 stages), it's ~16 cycles flushed. On Pentium 4 Prescott (31 stages), it was ~30 cycles per mispredict — which is why a 2% mispredict rate could cost 60% of throughput. The mispredict penalty is the single biggest reason deep pipelines fell out of fashion in 2006. How modern predictors push accuracy past 97% to keep that penalty rare is the subject of the branch prediction page.

Branch mispredict cost:
penalty = pipeline depth from fetch to branch resolution
        ≈ 3 cycles (5-stage)
        ≈ 16 cycles (Skylake)
        ≈ 30 cycles (Pentium 4 Prescott)

If predictor accuracy is p and mispredict penalty is m cycles:
average overhead per branch ≈ (1 − p) × m

20-stage pipeline, 95% accuracy:
  overhead ≈ 0.05 × 20 = 1 cycle/branch
  Branches are ~15% of instructions → 0.15 × 1 = 15% throughput hit.

How deep modern pipelines actually go

The textbook 5-stage pipeline is a teaching device. Real pipelines are deeper to let the chip run at higher clocks (more stages = less work per stage = less combinational delay = higher achievable frequency). The ceiling is set by the mispredict penalty and the cost of forwarding wires across many stages.

Chip / modelApprox. depthFrequency target
Textbook RISC5 stagesn/a
Pentium (P5, 1993)5 stages~60–200 MHz
Pentium 4 Northwood (2002)20 stages~3 GHz
Pentium 4 Prescott (2004)31 stages~3.8 GHz
Intel Core (Conroe, 2006)14 stages~2.4 GHz
Intel Skylake / Raptor Lake14 stages~5+ GHz boost
AMD Zen 4 / Zen 519 stages~5 GHz
Apple M1 / M3 / M4 (P-core, int)7 stages~3.5–4.5 GHz
IBM POWER1021 stages~4 GHz

The Pentium 4 push to 31 stages (Prescott, 2004) was the high-water mark of "go deep, push frequency". Intel retreated to 14 stages with Core (2006) and has stayed there ever since. Apple's silicon is unusual at the other extreme — short integer pipelines (~7 stages) paired with huge reorder buffers and very wide decode. Different architectures, same compromise: figure out where in the depth-vs-mispredict-penalty trade you want to live.

Beyond the textbook 5

Real chips don't have a clean 5-stage IF/ID/EX/MEM/WB pipeline. They have something more like:

  • Front-end: branch prediction, fetch (often 2–3 cycles), instruction queue, decode (1–4 cycles for x86 because of variable-length parsing).
  • Rename / dispatch: µops are renamed onto physical registers and dispatched into the issue queue.
  • Out-of-order execute: µops wait until their inputs are ready, then any free execution unit can pick them up. This is its own deep dive.
  • Memory: loads and stores have their own pipeline, often 4–6 cycles to L1, more to L2.
  • Retire: µops commit to architectural state in program order. The reorder buffer is up to 512 entries on Apple M4.

The IF/ID/EX/MEM/WB names live on as a teaching aid and as the rough shape of RISC-V's reference implementation. The principles — pipelining, hazards, forwarding, branch prediction — all transfer directly to the modern designs.

The next step: superscalar

A single pipeline tops out at one instruction retired per cycle. That was the ceiling until designers asked the obvious question: why have one pipeline? If you build two fetch slots, two decoders, and more than one execution unit, you can start two instructions in the same cycle and, in the best case, retire two per cycle. This is a superscalar processor, and it is why a CPU with no clock-speed advantage can still do far more work per cycle. The Pentium in 1993 was the first mainstream superscalar x86, with two integer pipelines. A modern core fetches and decodes four to eight instructions per cycle and has a dozen or more execution units behind them.

cy 1cy 2cy 3cy 4cy 5IF · IFIF · IFID · IDID · IDEX · EXEX · EXMEM·MEMMEM·MEMWB · WBWB · WBtwo instructions per stage per cycle → up to 2 retired per cycle (IPC 2.0)scalar pipeline:depth raises clock · width raises IPC · the two are independent leverspipelining gets you to IPC 1; superscalar gets you past it
A two-wide superscalar. Pipelining and width are separate dials: depth chases a higher clock, width chases more instructions per cycle.

Going wide multiplies the hazard problem. If two instructions issue together and the second reads a register the first writes, you cannot just forward — they are in the same stage at the same time. And real code has long stretches where the next instruction depends on the previous one, so a wide machine that only ever issues consecutive instructions in program order would stall constantly and rarely fill its slots. The fix is to stop insisting on program order.

And the step after that: out-of-order

An out-of-order core decodes instructions in program order, then lets them execute in whatever order their inputs become ready, and finally retires them back in program order so the visible state stays correct. If one instruction is stalled waiting on a slow load from memory, the core reaches past it and runs later independent instructions that are ready now, filling the execution units that pipelining and superscalar width left idle. A large structure called the reorder buffer holds instructions in flight — 300 to 600 on current chips — so the machine can look a long way ahead for work to do.

Out-of-order is what makes a wide superscalar pipeline actually wide in practice rather than just on paper. It depends on everything on this page: it needs forwarding so results move between units without round-tripping through the register file, it needs aggressive branch prediction so there is a deep, mostly-correct stream of speculative instructions to pick from, and it needs the basic pipeline so each unit stays busy cycle to cycle. The full mechanism — register renaming, reservation stations, the reorder buffer, and how retirement makes speculation safe — is its own deep dive on the out-of-order execution page. Branch prediction, which feeds the whole speculative front end, gets its own treatment on the branch prediction page. Pipelining is the foundation all of it sits on.

Throughput vs latency duality

A pipelined CPU doesn't make a single instruction faster. It makes the next instruction start sooner. This duality is everywhere in computer systems:

  • An assembly line builds a car in 8 hours, but completes one car per minute. The single-car latency is high; the throughput is high too.
  • A network connection has 50 ms of round-trip latency but 1 GB/s of throughput. The bandwidth-delay product (50 MB) is the amount in flight at any instant.
  • A pipelined L2 cache has 14-cycle latency but 1 access per cycle of throughput. Multiple loads can be in flight simultaneously.

The pattern: pipelining trades single-task latency for aggregate throughput. Whenever you see one of these systems, asking "throughput or latency?" is the right first question.

Common misconceptions

  • "Pipelining makes individual instructions faster." It doesn't. Each instruction still takes 5 (or 14, or 20) cycles end to end. Pipelining only helps when there are many instructions to overlap. A single isolated instruction sees no benefit.
  • "Modern CPUs run at 1 IPC because of the 5-stage pipeline." Modern CPUs run at 4–8 IPC because they're superscalar — multiple instructions retire per cycle, not just one. Pipelining gets you to 1 IPC; superscalar issue gets you the rest.
  • "Deeper pipelines are always better for frequency." Up to a point. The Pentium 4 Prescott showed where the point ends — 31 stages, 3.8 GHz, dismal IPC because every mispredict cost the full depth. The market chose 14-stage Core at 2.4 GHz instead.
  • "All hazards are bad." Some compiler optimizations introduce register pressure (and hence near-RAW patterns) deliberately, because forwarding makes them free and the alternative is more memory traffic. The cost of a hazard depends on whether forwarding handles it.

Numbers worth remembering

QuantityValueNotes
Textbook RISC pipeline depth5 stagesIF / ID / EX / MEM / WB
Apple M4 P-core integer pipeline~7 stagesShort pipeline + huge ROB + 8-wide decode
Intel Raptor Lake pipeline~14 stagesMainstream "wide-and-medium-deep"
AMD Zen 5 pipeline~19 stagesSlightly deeper to chase frequency
Pentium 4 Prescott pipeline31 stagesFrequency-first, IPC-last
Branch mispredict penalty (Skylake)~16 cyclesRoughly the depth of the front-end
Branch frequency in typical code~10–20%~1 in 5–10 instructions is a branch
Modern branch predictor accuracy97–99%TAGE on Intel, perceptron on Apple/AMD
Load-use bubble (textbook)1 cycleThe one unavoidable RAW with forwarding
Modern reorder-buffer size~320–600 entriesZen 5 ~320, Raptor Lake ~512, M4 ~600

Further reading

  • Patterson & Hennessy — Computer Organization and Design (RISC-V Edition). Chapter 4 builds the pipelined RISC-V CPU step by step, with hazard analysis and forwarding diagrams.
  • Hennessy & Patterson — Computer Architecture: A Quantitative Approach. Chapter 3 (Instruction-Level Parallelism) is the graduate-level treatment, including dynamic scheduling and Tomasulo's algorithm.
  • Wikipedia — Instruction pipelining — fast lookup for the standard hazard taxonomy.
  • Wikipedia — Operand forwarding — the bypass paths spelled out diagram by diagram.
  • Agner Fog — Microarchitecture of Intel, AMD, and VIA CPUs — every modern x86 pipeline documented in detail, including branch prediction internals.
  • Chips and Cheese — measured pipeline depths and mispredict penalties on every recent CPU release.
Found this useful?