CPU pipelining.

A single instruction takes five cycles to walk through fetch, decode, execute, memory, and writeback. A pipelined CPU keeps all five busy at once — five instructions in flight, one finishing per cycle. Until a dependency or a branch shows up. Watch a 5-stage pipeline run six instructions: one RAW hazard saved by forwarding, one branch misprediction that costs two cycles.

speed 1.3s

cycle 0 · 0 retired · IPC 0.00

running 6 instructions · 1 RAW dep · 1 branch (predicted not-taken, actually taken) 5-stage in-order pipeline · full forwarding · branch resolved in EX

MEM

SQUASH

Cycle 0 — pipeline empty

A 5-stage pipeline can have up to 5 instructions in flight, one per stage. At steady state we retire one per cycle — IPC = 1. The chart on the right will fill in as instructions enter and walk down the stages.

5-stage pipeline: The classic MIPS layout: IF (fetch) → ID (decode + register read) → EX (ALU) → MEM (load/store) → WB (write back to register file). Real CPUs use 10-20+ stages and run several in parallel.
IPC: Instructions per cycle. With one instruction issued per cycle and no hazards, IPC = 1. Branch mispredictions and stalls push it below 1. Superscalar machines push it above.

What pipelining buys

If a single instruction takes 5 cycles end-to-end, you might think a CPU can run at most 200 MHz / 5 = 40 MIPS at 200 MHz. The trick is that the five stages need different parts of the chip — fetch uses the I-cache, decode uses the register file, execute uses the ALU, memory uses the D-cache, writeback uses the register file again — so they can all run on different instructions in parallel. Stage by stage, an instruction enters and another finishes, and the throughput is one instruction per cycle once the pipeline fills.

In practice IPC is below 1. Cache misses stall MEM. Dependencies sometimes can\'t be forwarded. Branches mispredict. The deeper the pipeline, the worse a flush costs. The history of CPU design after about 2000 is largely a story of: how do we hide these stalls.

Why forwarding exists

Without forwarding, every dependent instruction would stall 2-3 cycles waiting for the producer to write back. Real code is full of dependencies — virtually every assembly sequence has them — so the pipeline would spend half its time idle. The forwarding network is a few muxes and wires inside the CPU that route results from the end of EX or MEM directly into the next EX input. It removes most RAW stalls. The exception is a load followed immediately by a use: the load\'s value isn\'t ready until after MEM, so the dependent instruction stalls one cycle. Compilers schedule around this — fill that slot with an unrelated instruction.

Branches and the prediction problem

A branch resolves in EX in this pipeline — two cycles after it was fetched. If we wait, those two cycles are wasted every time. So the fetch unit predicts: with no information, predict not-taken (skip the branch). Better predictors use a history table indexed by the branch address; they reach 95-99% on real code. When they\'re wrong, the two instructions fetched and decoded behind the branch get squashed and we restart from the actual target. Branch penalty = pipeline depth between fetch and resolve.

This is why deep pipelines hurt so much when prediction breaks down. Pentium 4\'s 31 stages meant a misprediction cost 30+ cycles. Modern designs go shallow-but-wide and pour transistors into the predictor.

What this simplifies

Single-issue, in-order. Real CPUs since ~2000 are superscalar (issue 2-8 instructions per cycle) and out-of-order (re-shuffle instructions to fill stalls). IPC can exceed 1 on the same instruction count.
Full forwarding only. Some hazards still stall — load-use is 1 cycle, multi-cycle ops (divide, FP) stall consumers longer.
Static branch prediction. We "predict not-taken." A real CPU has a two-bit-counter predictor or a TAGE predictor with thousands of history entries.
No memory hierarchy. Every load hits the D-cache in 1 cycle. Real L1 hits are 4 cycles, L2 ~12, L3 ~40, main memory ~200+.
No exceptions. Page faults, illegal-opcode traps, interrupts all complicate the pipeline (squash everything, flush, jump to handler).

Why this matters when you\'re writing software

You won\'t hand-schedule instructions — the compiler does. But you can write code that the compiler can schedule well. Tight inner loops with predictable branches and few false dependencies hit close to peak IPC. Pointer-chasing through a linked list does the opposite: every load is a cache miss waiting on the previous one, and the pipeline drains. Branchy code in a hot path slows down even when the predictor mostly wins — the small misprediction rate compounds at the depth of modern pipelines.

Go deeper

Computer Architecture Codex →

Pipelines, superscalar issue, out-of-order execution, register renaming, branch predictors, and the whole stack of tricks modern CPUs use to keep the silicon busy.

Open the Codex →

Found this useful?