Why do CPUs predict branches?

In a deep pipeline, a branch's outcome isn't known until it reaches the execute stage many cycles after fetch. If the CPU sat idle waiting, every branch would cost a dozen or more cycles. Instead it predicts the outcome and speculatively fetches down the predicted path; if the prediction is correct (most of the time), the speculation is free. Without prediction, modern deep pipelines would barely run at half their effective throughput.

What is a 2-bit saturating counter?

It's the simplest dynamic predictor. Each branch has two bits of state: Strong-Not-Taken, Weak-Not-Taken, Weak-Taken, Strong-Taken. The predictor outputs Taken if the state is in the upper half, Not-Taken otherwise. On every branch outcome, the state moves toward the actual direction. The two bits give the predictor inertia — a single anomaly doesn't flip the prediction, so loop-end branches that go the other way once don't poison the prediction.

Spectre is a class of attacks that abuse speculative execution to leak information across security boundaries. The CPU speculatively executes code past a security check (mispredicting a bounds check, or training the indirect branch predictor with attacker-controlled patterns), then leaves measurable cache-state side effects from the speculatively-executed instructions. The architectural results are discarded when the misprediction is detected, but the cache footprint remains and can be timed. Disclosed in January 2018; mitigations are still ongoing.

05 / 15

Internals / 05

Branch prediction and speculation

Branches are everywhere. Roughly 1 in 5–10 instructions is a branch. On a deep pipeline, a branch's outcome isn't known until it reaches the execute stage — fourteen cycles later, on Intel. The CPU can't sit idle for fourteen cycles every branch, so it predicts. A 99%-accurate predictor on a 14-stage pipeline gives you ~97% of peak throughput. A 95%-accurate one gives you ~85%. Predictor quality is the single biggest gap between modern microarchitectures and 1990s ones.

Why prediction is mandatory

This is a problem the moment you build a pipeline. A pipeline gets its speed by starting the next instruction before the current one finishes. That works fine for arithmetic, where the next instruction's address is just the current one plus a few bytes. A branch breaks the assumption: the address of the next instruction depends on the branch's own result, and that result is not known until the branch has travelled most of the way down the pipe. So the front end has a choice every time it meets a branch — stall until the outcome is known, or guess and keep fetching. Stalling is safe and slow. Guessing is fast and occasionally wrong. Every modern CPU guesses.

In a 5-stage pipeline, the branch resolves at EX (cycle 3). Three speculatively fetched instructions are at risk on every branch. In a 14-stage pipeline, the branch resolves around stage 12. Twelve cycles of speculative fetch hang on every prediction. If you fetch nothing while waiting, you pay the full pipeline depth on every single branch. Branches are 10–20% of instructions, so doing nothing means throughput drops by roughly 1.5×–3×. The actual loss in CPI is the fraction-of-branches × pipeline-depth:

No prediction, 14-stage pipeline, 15% branch density:
  cost per instruction = 0.15 × 14 = 2.1 cycles wasted on average

98% prediction, 14-stage pipeline, 15% branch density:
  cost per instruction = 0.15 × 0.02 × 14 = 0.042 cycles wasted

Difference: 50× less waste. The whole modern CPU depends on this.

Modern predictors hit 97–99% accuracy on real workloads. The remaining 1–3% is where the pipeline still bleeds, which is why branch prediction is the most actively-researched part of microarchitecture even in 2026.

Static prediction

The cheapest predictor is no predictor at all — just always guess one direction. Always-not-taken is the trivial baseline; it's right roughly 50% of the time on balanced code. Slightly better is the heuristic from the early MIPS days: backward branches taken, forward branches not-taken (BTFNT). Loops branch backward and are usually taken; ifs branch forward and are usually not-taken. This static rule gets you ~65% accuracy with no state at all. It's still in some embedded cores and as a fallback when the dynamic predictor doesn't have history yet.

Predict, run ahead, roll back

Prediction on its own only answers "what should I fetch next." The harder part is what the rest of the machine does with that guess. A prediction lets the front end keep fetching down the predicted path, but those instructions then go on to decode, rename, issue, and execute — and some of them write results, read memory, or feed other instructions long before the branch that justified them has resolved. The CPU has to run all of that speculatively while keeping a way to undo it if the guess turns out wrong.

This is where branch prediction meets out-of-order execution. Speculative instructions execute and produce results, but those results are held in the reorder buffer rather than committed to the registers and memory the program can see. A branch only retires once its real outcome is known. If the prediction was right, the speculative results retire in order and become visible — the speculation was free. If it was wrong, the CPU squashes every instruction younger than the branch, restores the register-rename state to the point just before the branch, and steers the front end to the correct target. The machinery that makes mispredicts cheap to undo is the same machinery that makes out-of-order execution possible, which is why the two topics are always taught together.

So a mispredict is not just a wasted fetch. It is a full flush: a dozen or more instructions, some of which had already executed, all discarded, plus the rename-table rollback, plus the refetch from the right address. That is why the penalty is roughly the pipeline depth and not just one or two cycles. The deeper and wider the machine, the more in-flight work a single bad guess throws on the floor.

The 2-bit saturating counter

The classic dynamic predictor. Each branch has two bits of state representing one of four positions on a confidence axis: Strong-Not-Taken / Weak-Not-Taken / Weak-Taken / Strong-Taken. The predictor outputs Taken if the state is in the upper half, Not-Taken otherwise. After each branch resolves, the state slides one notch toward the actual direction.

The two bits give the predictor inertia. A single anomaly doesn't flip the prediction. A loop that goes T-T-T-T-T-N (the loop-end branch flipping once) keeps predicting T because one miss only moves Strong-T to Weak-T. The next iteration starts at Strong-T again. Without inertia (a 1-bit predictor), that loop-end miss would cost two mispredicts per loop iteration — once at the end, once at the start.

The 2-bit state machine. Taken slides right, not-taken slides left, and the ends saturate. Crossing from 01 to 10 is the only place the prediction changes, so one stray outcome at a saturated end costs nothing.

Strong N

→ predict N

Weak N

→ predict N

Weak T

→ predict T

Strong T

→ predict T

accuracy

—

0 / 0 correct

current state

Weak N

step	state in	predicted	actual	state out
No history yet. Click feed T or feed N.

Three predictors on the same stream

Pick a pattern and watch three predictors race: static-not-taken, 1-bit, and 2-bit saturating. The 2-bit predictor wins on patterns with occasional deviations; the 1-bit wins nothing; the static loses everywhere except on a heavily not-taken stream.

loop · TTTT…N TTTT…N alternating · TNTN fizzbuzz · TTNTTTN random

stream

TTTTTTTTTNTTTTTTTTTN

static (always N) 10%

NNNNNNNNNNNNNNNNNNNN

1-bit 80%

NTTTTTTTTTNTTTTTTTTT

2-bit saturating 85%

NTTTTTTTTTTTTTTTTTTT

The static predictor is wrong on every Taken branch. The 1-bit predictor flips on every transition — it bleeds two mispredicts per loop iteration. The 2-bit predictor's inertia absorbs the loop-end miss without flipping its prediction for the next iteration. This is what 1992-era hardware bought for two extra bits per branch.

Two-level / gshare

The 2-bit predictor still misses anything that depends on which other branches just executed. Consider if (a) { … } if (b) { … } where the second branch is correlated with the first. A predictor that only sees its own branch's history can't pick this up. Two-level predictors add a global history register — the last k branch outcomes across the whole program — and use that to index a Pattern History Table.

Gshare (1992, McFarling) does this elegantly: XOR the program counter with the global history register, use the result to index a table of 2-bit counters. Branches at different addresses with the same recent history get different table entries. Branches at the same address with different histories also get different entries. With a 4 KB table and 12 bits of history, gshare hits ~93–95% on most workloads. Pentium Pro (1995) was the first mainstream chip to ship one.

gshare(PC, history) = PHT[PC ⊕ history]

  PC:        0x4012a8         (the branch's program counter)
  history:   0b1011 0010 1110  (last 12 outcomes; T=1, N=0)
  XOR:       0x401_a96         → table index
  PHT[idx]:  2-bit saturating counter → predict T or N

TAGE — what mainstream chips actually use

TAGE (TAgged GEometric history length) is the current state of the art for direction prediction. Multiple tables indexed by geometrically growing history lengths — say 8, 16, 32, 64, 128 bits of recent history. Each entry carries a tag derived from PC + history, so different (PC, history) combinations don't collide. On lookup, all tables check in parallel; the longest-history hit that has confidence wins. Branches with simple short-history correlation use the short table; branches with long-distance correlation use the long table.

Intel's Core architecture has used a TAGE variant since Skylake (2015). AMD's Zen-series uses a similar geometric design. Modern accuracy on integer benchmarks like SPEC CPU is 97–99%. The improvements since 2015 have been almost entirely in indirect branch prediction (jump tables, virtual calls) and in handling pathological patterns better.

Perceptron predictors

AMD shipped the first perceptron-based predictor in Phenom (2007) and refined it in Zen. Apple uses perceptrons in the M-series. The idea: instead of a counter, store a small set of weights — one per history bit. To predict, dot the recent history vector with the weights; if positive, predict T; if negative, predict N. Train by adjusting weights toward the actual outcome.

Perceptrons are good at handling long histories without exploding table size (weights scale linearly, not exponentially). They're also relatively easy to extend with hashed history (combine recent outcomes with PC bits to get richer input). Apple's M4 perceptron front-end is rumored to track ~96 bits of branch history per branch, which is why pathological loop patterns that defeat 16-bit gshare often work on Apple silicon.

Forty years of predictor evolution

Year	Predictor	Accuracy	First used in
1985	Static (always not-taken)	~50%	MIPS R2000 — simple, no hardware, hopeless on loops
1990	BTFNT — backward taken / forward not-taken	~65%	Static heuristic; loops backward, ifs forward — works without state
1992	1-bit dynamic	~80%	Pentium prefetch buffer; one bit per branch in a small table
1993	2-bit saturating	~88%	Pentium era; takes two consecutive misses to flip — handles loop ends
1996	Two-level / gshare	~94%	Pentium Pro; uses recent branch history XOR PC to index a PHT
2000	Tournament (hybrid)	~96%	Alpha 21264; meta-predictor picks between local and global predictors
2006	Perceptron	~97%	AMD Phenom, Apple silicon — simple neural-net per branch
2014	TAGE-SC-L (TAgged GEometric)	~99%	Intel since Skylake; tagged tables indexed by geometrically-growing histories
2020	Hashed perceptron + ITTAGE	99%+	Apple M-series, AMD Zen 5; combines perceptron for direction with TAGE for indirect targets

BTB and RAS — predicting where

Direction prediction (T or N) is half the problem. The other half is predicting the target address of taken branches and indirect jumps. The CPU needs the target by the next fetch cycle, but the branch target hasn't been decoded yet. Two structures handle this:

Branch Target Buffer (BTB). A cache keyed by branch PC, holding the recently-observed target for each branch. ~4–16 KB on modern chips, holding thousands of entries. Hits 95%+ of taken branches. Misses fall back to the slower decoder.
Return Address Stack (RAS). A small hardware stack — typically 16–64 entries — that mirrors the call stack. call pushes the return address; ret pops. The pop predicts the return target instantly, even if the calling pattern is impossible to capture in the BTB. Modern CPUs back this with overflow recovery so deeply nested calls don't break it.
Indirect branch predictor (ITTAGE). A separate predictor for jump tables, virtual function calls, and computed gotos. Targets vary by call site and history. Modern Intel/AMD have dedicated structures for this; mispredictions on indirect branches are roughly 5–10× more expensive than direction mispredictions.

What a mispredict actually costs

Use the sliders to model your own scenario. Pipeline depth determines the mispredict penalty — every flushed cycle is silicon you paid for and threw away. Branch density determines how often you pay it. Predictor accuracy determines what fraction of the time you pay it.

The picture below shows what those words mean on the pipeline itself. The top track is a correct prediction: the CPU guesses the branch is taken, fetches the predicted instructions, and by the time the branch resolves the speculative work is already done. The bottom track is a mispredict: every instruction fetched after the branch belongs to the wrong path, so when the branch resolves the CPU throws all of it away and refetches from the correct target. The gap between the two is the penalty, and it is exactly the number of stages the branch had to walk before its outcome was known.

Predicted-right keeps every speculative instruction. Mispredict discards all of them (dashed) and refetches from the real target, leaving a bubble the size of the pipeline depth. Deeper pipelines make the bubble bigger, which is why predictor accuracy and pipeline depth are joined at the hip.

pipeline depth (stages) 14

predictor accuracy (%) 95%

branch density (%) 15%

mispredict penalty

cycles per misprediction

cycles lost per 100 instr

10.5

≈ throughput hit

throughput hit

10.5%

of peak retired

Try (depth=31, accuracy=95%, density=15%) — that's a Pentium 4 Prescott. The throughput hit is severe enough that 1990s-era branch predictor research wasn't optional. Then try (depth=14, accuracy=99%, density=15%) — that's a modern Skylake. The cost drops by an order of magnitude.

Spectre — the cost of speculation

Speculative execution doesn't just speed up your code. It also leaks information across security boundaries. The Spectre class of attacks, disclosed in January 2018, abuses the predictor's training to make the CPU speculatively execute code past a security check (a bounds check, a permission check, a virtualization boundary). The architectural results are discarded when the misprediction is detected — but the cache footprint of the speculatively-executed instructions persists. An attacker times subsequent loads to learn what was loaded, and infers the secret.

Mitigations are still ongoing eight years later. Some categories:

Hardware fences. LFENCE on x86 stops speculation across a marker. Used inside libraries to protect bounds-check loops. Costly — every fence is ~5–20 cycles.
Indirect Branch Restricted Speculation (IBRS). Microcode patch that stops one privilege ring from training the predictor of another. Throughput hit: ~5–30% on workloads with many context switches.
Retpolines. Compiler turns indirect calls into a return-based trampoline that the CPU's RAS predicts independently of the BTB. Linux kernel-wide for years; AMD chips eventually got hardware-level IBRS that made retpolines unnecessary.
Process isolation. Some side-channel attacks become much harder if attacker and victim aren't on the same physical core. SMT can be disabled in security-sensitive deployments.

The fundamental issue isn't that branch prediction is bad. It's that speculation crosses security boundaries the architecture didn't expect to be boundaries. Every modern CPU now has explicit microarchitectural state that distinguishes "speculative" from "committed", and timing channels are an active area of research and patching.

Why a sorted array is faster

The most-asked question in this whole area comes from a Stack Overflow post: why does the same loop run several times faster over a sorted array than an unsorted one? The loop is something like this — sum the elements that pass a threshold:

for (int i = 0; i < N; i++)
    if (data[i] >= 128)        // the branch the predictor sees
        sum += data[i];

The arithmetic is identical either way. The only thing that changes is the branch. Fill data with random values and the data[i] >= 128 test comes out taken about half the time, in no pattern the predictor can learn. Accuracy collapses toward 50%, and on a deep pipeline each mispredict costs the full flush. Sort the array first and the branch becomes trivially predictable: it is not-taken for the whole first half of the array, then taken for the whole second half. The predictor learns "not-taken" instantly, eats one mispredict at the crossover point, and then learns "taken" for the rest. Accuracy jumps to ~100% and the loop runs at full speed.

Nothing about the cache, the memory traffic, or the instruction count changed. The entire speedup is the difference between a predictable branch (nearly free) and an unpredictable one (a flush roughly half the time). It is the cleanest possible demonstration that data-dependent branches on random data are where prediction falls apart, and that the cost is real and measurable.

Branchless code, and when it actually helps

When a branch is intrinsically unpredictable, the fix is sometimes to remove the branch instead of trying to predict it. The same threshold-sum can be written without a conditional jump:

for (int i = 0; i < N; i++) {
    int v = data[i];
    sum += v & -(v >= 128);   // mask: keeps v when true, 0 when false
}

The comparison still happens, but it produces a 0/1 value used as a mask rather than a jump. There is no branch for the predictor to get wrong, so the random-data case stops mispredicting. On x86 the compiler can also emit a cmov (conditional move), which selects between two values with no change in control flow. SIMD code does the same trick at width: compute both sides, build a mask, and blend.

Branchless is not a free win, though, and reaching for it everywhere is a mistake. A branch the predictor gets right is close to free, and removing it forces the CPU to compute both sides every iteration instead of skipping one. So branchless helps when the branch is unpredictable and the work on each side is cheap; it hurts when the branch is predictable or the skipped side is expensive. The honest workflow is to measure first — a top-down performance analysis will tell you whether your stall is bad-speculation (mispredicts) before you spend effort rewriting code branchless. If the top-down breakdown does not show a bad-speculation bottleneck, going branchless usually just makes the code slower and harder to read.

Rule of thumb. Predictable branch: leave it alone. Unpredictable branch with cheap both-sides: try branchless or cmov. Unpredictable branch with expensive both-sides: you may be stuck — restructure the data (like sorting) so the branch becomes predictable instead.

Common misconceptions

"Mispredicts only cost a few cycles." Only on shallow pipelines. On Skylake, ~16 cycles. On Zen 5, ~19. On Pentium 4 Prescott, ~30. A 1% mispredict rate on Skylake wastes ~16 cycles per 100 instructions if every branch were mispredicted; in practice modern code's mispredict cost shows up as ~5% of cycles lost.
"Branch prediction is the same as speculative execution." Related but distinct. Prediction tells you what to fetch next. Speculative execution actually runs the speculative instructions, accumulating side effects in the speculative core state. Both have to be undone on a mispredict.
"The compiler can hint the predictor." Limited. __builtin_expect on GCC influences code layout (so the predicted-taken side is the fall-through), but most modern predictors override static hints with their own learned data quickly. The hint helps mostly the first time a branch is seen.
"All branches are predictable." No. Random data, hash-table lookups, virtual calls on data-dependent receivers — these can be intrinsically unpredictable. The fix is sometimes to remove the branch entirely (branchless code with cmov, SIMD masks) rather than try to predict it.

Numbers worth remembering

Quantity	Value	Notes
Modern direction-predictor accuracy	97–99%	Integer code, modern x86 / ARM64
Branch density in typical code	10–20%	1 in 5–10 instructions
Skylake mispredict penalty	~16 cycles	Roughly the front-end depth
Zen 5 mispredict penalty	~19 cycles	Slightly deeper to chase frequency
Apple M4 P-core mispredict penalty	~9 cycles	Shorter pipeline, slightly higher recovery cost amortised against bigger ROB
Pentium 4 Prescott mispredict penalty	~30 cycles	The reason Prescott was retired
BTB size, mainstream Intel	~4096 entries	Per core, 2-way to 4-way set-associative
RAS depth, modern x86	16–32 entries	Apple silicon ~64
Indirect branch mispredict cost	5–10× direction mispredict	Cross-page fetch + BTB miss + decode + execute
Spectre disclosed	January 2018	Mitigations ongoing 2026

Branch prediction and speculation

Why prediction is mandatory

Static prediction

Predict, run ahead, roll back

The 2-bit saturating counter

Three predictors on the same stream

Two-level / gshare

TAGE — what mainstream chips actually use

Perceptron predictors

Forty years of predictor evolution

BTB and RAS — predicting where

What a mispredict actually costs

Spectre — the cost of speculation

Why a sorted array is faster

Branchless code, and when it actually helps

Common misconceptions

Numbers worth remembering

Further reading

06 — Out-of-order execution