03 / 15
Internals / 03

The instruction cycle

The basic loop every CPU runs forever: fetch an instruction from memory, decode it into control signals, execute it in the ALU, maybe touch memory, write the result back. Five stages, named in textbooks IF / ID / EX / MEM / WB, hardwired into every RISC core since the early 1980s. This page traces a four-instruction RISC-V program through those stages one cycle at a time, then shows how x86 decomposes complex instructions into the same RISC-shaped µops underneath.


Five stages

RISC-V, MIPS, ARM, and pretty much every other RISC ISA share the same five-stage textbook pipeline. The names matter; you'll see them across every microarchitecture diagram for the next forty years of your career.

StageWhat happens
IF · Instruction FetchRead the 32-bit instruction word at PC from instruction memory; PC ← PC + 4.
ID · Instruction DecodeSplit the word into opcode, source registers, destination, immediate. Read source registers from the register file.
EX · ExecuteThe ALU performs the operation. Branches resolve. Address arithmetic for loads/stores happens here.
MEM · Memory accessFor loads, read from data memory. For stores, write. ALU and branch instructions do nothing here.
WB · WritebackWrite the result back into the destination register.

In a single-cycle CPU, all five happen in one clock period. In a pipelined CPU (every modern one), each stage runs every cycle, on a different instruction — covered in the pipelining deep dive. This page focuses on what one instruction does as it crosses the five stages.

The datapath: where the bits actually move

Before walking the stages one by one, it helps to see the physical hardware they run on. A CPU core is mostly a few large blocks wired together by buses. The program counter holds the address of the next instruction. Instruction memory hands back the 32-bit word at that address. The register file is a small, fast array holding the 32 architectural registers. The ALU does the arithmetic and logic. Data memory holds everything that does not fit in registers. Multiplexers pick which value flows down each wire, and the control unit, fed by the opcode, throws all those switches.

PCaddrIMEMinstructionregisterfilex0..x31ALUDMEMdatawritebackmux to rdcontrol unitopcode to signalsthe control unit (dotted) gates every block; the result loops back to the register file
The single-cycle datapath. Solid lines carry data left to right; the dashed steel line is writeback; the dotted copper line is a control signal from the decoder.

Read the diagram as a river. Data starts at the program counter and flows right, picking up an instruction, reading registers, passing through the ALU, maybe touching memory, and finally looping back to the register file along the dashed writeback path. The control unit sits above the flow and decides, per instruction, which gates open. A load opens the data-memory read; an add bypasses memory entirely; a store sends a register value down into memory and writes nothing back. The five textbook stages are just five points along this river, and the latches that separate them in a pipelined design are the subject of the pipelining page.

A program traced through the datapath

Four RISC-V instructions: load 5 into x10, load 7 into x11, add them into x12, store x12 to memory address 0. Click step to advance one stage at a time. Each click moves the active instruction across IF / ID / EX / MEM / WB. Active components highlight; the register file and data memory update; the bus log explains what just happened.

program memory
0x00 0x00500513 addi x10, x0, 5
0x04 0x00700593 addi x11, x0, 7
0x08 0x00B50633 add x12, x10, x11
0x0c 0x00C02023 sw x12, 0(x0)
PC
0x00
stage
IF
Fetch — read the instruction at PC from instruction memory; PC advances by 4.
cycle
0
IMEMinstructionmemoryIRinstructionregisterregfilex0..x3132 × 64-bitALUDMEMdatamemoryWBback toregfileIFIDEXMEMWB
register file (non-zero)
All zero.
data memory
Empty.
bus log
Click step to begin.

A 32-bit RISC-V instruction, decoded

Every instruction in the trace above is one 32-bit word. RISC-V splits those 32 bits into a few fixed slots: opcode, destination register, source registers, function select, immediate. Decode is little more than wires routing each slot to the right place in the datapath.

00000000101101010000011000110011
funct7 (7) = 0 rs2 (5) = x11 rs1 (5) = x10 funct3 (3) = 0 rd (5) = x12 opcode (7) = 0b0110011
The seven-bit opcode tells the decoder which instruction format applies (R-type register-register, I-type immediate, S-type store, etc.). Different formats rearrange the immediate-bit layout, but rd and rs1 always sit in the same positions — a deliberate choice in the RISC-V design that lets register reads start before the opcode is even decoded.

The program counter and how control flow moves it

The program counter is the one piece of state that makes a CPU a CPU and not a calculator. It holds the address of the next instruction, and updating it is the whole of control flow. For a normal instruction the update is boring: at the end of fetch the hardware computes PC + 4 (the instruction is four bytes wide) and that becomes the next PC. The CPU runs straight down memory, one word at a time, and the trace above does exactly this for its four instructions.

Branches and jumps are the interesting case, because they overwrite the PC with something other than PC + 4. A conditional branch like beq x10, x11, target ("branch if equal") computes its outcome in the execute stage: the ALU subtracts the two registers, the result's zero flag says whether they matched, and if the branch is taken the PC is loaded with PC + offset instead of PC + 4. An unconditional jump (jal, jalr) loads the target directly and, for a call, also writes the return address (PC + 4) into a link register so the function knows where to come back to. A return is just a jump through that saved register. Everything you think of as if, for, function calls, and returns compiles down to arithmetic on this one register.

This is also where the simple model starts to creak. In a single-cycle machine the branch outcome is known before the next fetch, so the PC update is free. In a pipelined machine the next instruction is already being fetched while the branch is still resolving, so the CPU has to guess which way the branch will go and pay a penalty when it guesses wrong. That guess is branch prediction, and the cleanup is a pipeline flush — both covered in the pipelining page. The system call instruction (ecall on RISC-V, syscall on x86) is a special kind of control transfer: it changes the PC and the privilege level, handing control to the kernel at a fixed entry point. That handoff is the start of the path described in how system calls work.

Registers and the ALU, up close

Two blocks in the datapath deserve a closer look because the whole cycle is organised around them. The register file is a tiny memory — 32 entries on RISC-V, each 64 bits wide on a 64-bit core — built for speed rather than size. It is multi-ported: in one cycle it can read two source registers and write one destination, which is exactly what an instruction like add x12, x10, x11 needs. Reads are combinational, meaning the value appears on the output wires as soon as you present the register number; the write lands on the clock edge at the end of the cycle. That timing is why writeback is the last stage. Register x0 is special: it is hardwired to zero and ignores writes, which is why the trace shows a write to x0 being suppressed.

The ALU — arithmetic logic unit — is the part that actually computes. Given two inputs and an operation select from the control unit, it produces a result and a few status flags (zero, negative, carry, overflow). A single ALU handles add, subtract, AND, OR, XOR, shifts, and comparisons; multiply and divide usually live in separate, slower units. The same ALU does triple duty across the instruction set: it adds for arithmetic instructions, it computes base + offset addresses for loads and stores, and it does the comparison that resolves a branch. That reuse is why the execute stage is drawn as one block even though it serves very different instruction types. When you read in the out-of-order page that a modern core has "six integer ALUs," it means six copies of this block running in parallel on independent instructions.

The clock, cycles, and CPI

The fetch-execute loop is the heartbeat of the machine, and the clock is the beat. A clock signal is a square wave toggling billions of times a second; every storage element in the core — the PC, the register file, the pipeline latches — updates on the rising edge. The clock period is the time between two edges, and it has to be long enough for the slowest path of logic between two storage elements to settle. That slowest path is the critical path, and it sets the maximum clock frequency. A 4 GHz core has a clock period of 250 picoseconds; everything the hardware does in one "cycle" has to finish inside that window.

In the single-cycle model an instruction takes exactly one cycle, but that cycle has to be long enough for the worst instruction (usually a load, which fetches, decodes, computes an address, reads memory, and writes back all in one period). That is wasteful: fast instructions wait for the slow ones. Pipelining fixes this by shortening the clock to one stage's worth of work and overlapping instructions, which is why every real chip pipelines. The number that captures all of this is cycles per instruction, or CPI. A perfect single-issue pipeline approaches a CPI of 1 — one instruction finished every cycle. Real code does worse because of cache misses, branch mispredictions, and dependencies. Wide out-of-order cores do better, finishing several instructions per cycle, which is usually quoted as the reciprocal, IPC (instructions per cycle). The relationship that ties hardware to wall-clock time is one worth memorising:

The iron law of performance: time = instructions × CPI × clock period. You can go faster by running fewer instructions (better compiler or algorithm), lowering CPI (better microarchitecture), or shortening the clock period (higher frequency). Every CPU design trade-off is a fight over these three terms, and pushing one often worsens another — a deeper pipeline raises frequency but also raises the misprediction penalty.

x86 splits into RISC-shaped µops

x86 instructions are not fixed-length, not aligned, and can mix register-register and memory operands. Modern x86 cores decode each one into one or more µops — micro-operations of fixed shape, much like RISC-V instructions — before dispatching them to the execution engine. This decode step is why every modern x86 has a trace cache (Pentium 4, 2000) or µop cache (Sandy Bridge, 2011) holding the post-decode µop stream so the same instructions don't get re-decoded on every loop iteration.

add [rdi+8], rax
↓ decoded into µops
LOAD t1, [rdi+8]
ADD t2, t1, rax
STORE [rdi+8], t2
Memory-destination operand fans out to 3 µops: load, ALU, store. ~5 cycles latency.

The control unit

Behind every datapath sits a control unit — a tiny lookup table that, given an opcode, drives the dozen or so control signals that route data through the datapath. ALU op-select, register-file write-enable, memory read / write, branch-taken — all of it. In a textbook RISC core, this is an actual ROM indexed by opcode bits. In Intel and AMD, it's a complicated chain — fast paths for common simple instructions, microcode ROM for the rest.

The classic Bryant & O'Hallaron diagram shows the control unit as a "fan-out" from the opcode bits to a vector of 1-bit signals: ALUSrc, RegWrite, MemRead, MemWrite, Branch, MemtoReg, ALUOp[1:0]. Each opcode maps to a different bit pattern, and those bits gate the datapath multiplexers. This is what "decoding" actually does, mechanically.

Why CISC translates to RISC inside

Intel went to internal RISC in 1995 with the Pentium Pro (P6). AMD followed in 1996 with the K5. The reasoning was simple: variable-length, multi-operand x86 instructions are hard to pipeline directly. If you decompose them into uniform µops, you can pipeline, reorder, and rename µops the same way a RISC chip does. What you lose is some decode efficiency on the front end. What you gain is the rest of the modern microarchitecture — out-of-order, register renaming, multi- issue — without the entire ISA being a museum.

ARM and Apple silicon have it easier: ARMv8 / ARM64 is fixed 32-bit, every instruction is at most 1 µop in most cases, and the front-end is correspondingly simpler. Apple M-series chips can decode 8 instructions per cycle in the front end; mainstream Intel decodes 6, mainstream AMD 4. This is most of the gap.

The number that matters: for an x86 core, the µop cache is roughly as important as the L1 instruction cache. Hits in the µop cache deliver ~6 µops per cycle without re-decoding; misses fall back to the legacy decoder at ~4 instructions per cycle and pay the variable-length-decode penalty.

Micro-ops: the real unit of execution

The clean five-stage story is a teaching model. Inside a modern core the unit that actually flows through the pipeline is not the architectural instruction you wrote but a micro-op — a small, fixed-format operation the front-end produces by translating the instruction stream. On a RISC ISA the translation is nearly one-to-one: most instructions are already micro-op shaped, so the decoder mostly relabels them. On x86 the translation does real work, splitting a single CISC instruction into the load, ALU, and store pieces that the back-end can schedule independently, as the interactive above shows.

Why bother? Because uniform micro-ops are what make the rest of a high-performance core possible. Once every operation has the same shape — a few source registers, one destination, one ALU or memory action — the hardware can rename their registers to remove false dependencies, queue them up, and issue whichever ones have their inputs ready, regardless of program order. None of that works on raw variable-length x86. The micro-op layer is the adapter that lets a forty-year-old instruction set run on a modern engine. This is the bridge to the out-of-order page, where micro-ops are reordered, renamed, and retired.

It is also why instruction counting is a poor proxy for performance. A single rep movsb can expand into thousands of micro-ops; a single memory-destination add is three. Two programs with the same number of machine instructions can have wildly different micro-op counts and therefore different run times. The thing the back-end retires, and the thing performance counters mostly measure, is micro-ops, not instructions.

How the simple model gets complicated

Everything on this page describes one instruction crossing five stages in order. That is the mental model to keep, but it is not how a fast chip runs. Two layers of complexity sit on top of it, and both exist to keep the execution units busy instead of idle.

The first is pipelining. Rather than finish one instruction before starting the next, the core keeps all five stages working every cycle on five different instructions: while one is in writeback, the one behind it is in memory, the next in execute, and so on. Throughput jumps toward one instruction per cycle without any one instruction getting faster. The cost is hazards — an instruction that needs a result still in flight, or a branch whose direction is not yet known — and the machinery to handle them: forwarding, stalls, and branch prediction. That is the subject of the pipelining page.

The second is out-of-order execution. A pipelined core still issues instructions in program order, so one stalled load can block everything behind it. An out-of-order core breaks that constraint: it renames registers, buffers a few hundred micro-ops in flight, and executes whichever ones have their inputs ready, then puts the results back in order at the end so the program still appears to run sequentially. This is where most of the performance of a modern P-core comes from, and it is the topic of the out-of-order page. The five stages are still in there — fetch, decode, execute, memory, writeback all still happen — they are just spread across a much wider, deeper, and more cleverly scheduled machine.

Common misconceptions

  • "One instruction equals one cycle." Single-cycle CPUs exist only in textbooks. Real chips pipeline (next deep dive). One instruction takes ~5–14 cycles latency on a modern pipeline; the throughput is closer to 4 instructions retired per cycle on Apple and Intel cores.
  • "x86 is slower than ARM because of CISC." The µop layer makes the actual execution similar. The real gap is in decode width, branch prediction quality, and energy-per-instruction — all of which can be (and have been) closed by spending silicon. ARM-on-laptop's 2026 dominance is more about Apple's design than about CISC vs RISC.
  • "The PC always increments by 4." On RISC-V and ARM64, yes — for non-branch instructions. On a branch / jump, PC changes to the target. On compressed RISC-V (the C extension), PC increments by 2 for 16-bit instructions. On x86, PC increments by however long the just-fetched instruction was, which is anywhere from 1 to 15 bytes.
  • "Microcode is just for old instructions." Most modern security mitigations (Spectre, MDS, retbleed) ship as microcode patches. The microcode ROM is updated by the OS at boot and can rewrite instruction behaviour live. This is what keeps decade-old chips patchable against new vulnerabilities.

Numbers worth remembering

QuantityValueNotes
Classic RISC pipeline depth5 stagesIF / ID / EX / MEM / WB
Apple M4 pipeline depth (P-core)~7 stages on integerShorter for integer ALU; longer for memory and FP
Intel Raptor Lake pipeline depth~14 stagesFront-end decode + execution + retire
RISC-V instruction size32 bits16 bits with the optional C extension
x86 instruction size range1–15 bytesThe variable-length decode tax
x86 decode width, mainstream Intel~6 µops/cycleFrom the µop cache; ~4 from legacy decoder
ARM64 decode width, Apple M4~8 instructions/cycleDirect from L1 instruction cache, no µop cache needed
RISC-V opcode field7 bits, bits [6:0]Always lowest 7 bits, so decode can start immediately
Number of architectural registers, RISC-V32x0 (hardwired zero) through x31
Number of physical registers, modern core~200–400Used by register renaming; covered in the OOO deep dive

Further reading

Found this useful?