Branch prediction and speculation
Branches are everywhere. Roughly 1 in 5–10 instructions is a branch. On a deep pipeline, a branch's outcome isn't known until it reaches the execute stage — fourteen cycles later, on Intel. The CPU can't sit idle for fourteen cycles every branch, so it predicts. A 99%-accurate predictor on a 14-stage pipeline gives you ~97% of peak throughput. A 95%-accurate one gives you ~85%. Predictor quality is the single biggest gap between modern microarchitectures and 1990s ones.
Why prediction is mandatory
This is a problem the moment you build a pipeline. A pipeline gets its speed by starting the next instruction before the current one finishes. That works fine for arithmetic, where the next instruction's address is just the current one plus a few bytes. A branch breaks the assumption: the address of the next instruction depends on the branch's own result, and that result is not known until the branch has travelled most of the way down the pipe. So the front end has a choice every time it meets a branch — stall until the outcome is known, or guess and keep fetching. Stalling is safe and slow. Guessing is fast and occasionally wrong. Every modern CPU guesses.
In a 5-stage pipeline, the branch resolves at EX (cycle 3). Three speculatively fetched instructions are at risk on every branch. In a 14-stage pipeline, the branch resolves around stage 12. Twelve cycles of speculative fetch hang on every prediction. If you fetch nothing while waiting, you pay the full pipeline depth on every single branch. Branches are 10–20% of instructions, so doing nothing means throughput drops by roughly 1.5×–3×. The actual loss in CPI is the fraction-of-branches × pipeline-depth:
No prediction, 14-stage pipeline, 15% branch density:
cost per instruction = 0.15 × 14 = 2.1 cycles wasted on average
98% prediction, 14-stage pipeline, 15% branch density:
cost per instruction = 0.15 × 0.02 × 14 = 0.042 cycles wasted
Difference: 50× less waste. The whole modern CPU depends on this.Modern predictors hit 97–99% accuracy on real workloads. The remaining 1–3% is where the pipeline still bleeds, which is why branch prediction is the most actively-researched part of microarchitecture even in 2026.
Static prediction
The cheapest predictor is no predictor at all — just always guess one direction. Always-not-taken is the trivial baseline; it's right roughly 50% of the time on balanced code. Slightly better is the heuristic from the early MIPS days: backward branches taken, forward branches not-taken (BTFNT). Loops branch backward and are usually taken; ifs branch forward and are usually not-taken. This static rule gets you ~65% accuracy with no state at all. It's still in some embedded cores and as a fallback when the dynamic predictor doesn't have history yet.
Predict, run ahead, roll back
Prediction on its own only answers "what should I fetch next." The harder part is what the rest of the machine does with that guess. A prediction lets the front end keep fetching down the predicted path, but those instructions then go on to decode, rename, issue, and execute — and some of them write results, read memory, or feed other instructions long before the branch that justified them has resolved. The CPU has to run all of that speculatively while keeping a way to undo it if the guess turns out wrong.
This is where branch prediction meets out-of-order execution. Speculative instructions execute and produce results, but those results are held in the reorder buffer rather than committed to the registers and memory the program can see. A branch only retires once its real outcome is known. If the prediction was right, the speculative results retire in order and become visible — the speculation was free. If it was wrong, the CPU squashes every instruction younger than the branch, restores the register-rename state to the point just before the branch, and steers the front end to the correct target. The machinery that makes mispredicts cheap to undo is the same machinery that makes out-of-order execution possible, which is why the two topics are always taught together.
So a mispredict is not just a wasted fetch. It is a full flush: a dozen or more instructions, some of which had already executed, all discarded, plus the rename-table rollback, plus the refetch from the right address. That is why the penalty is roughly the pipeline depth and not just one or two cycles. The deeper and wider the machine, the more in-flight work a single bad guess throws on the floor.
The 2-bit saturating counter
The classic dynamic predictor. Each branch has two bits of state representing one of four positions on a confidence axis: Strong-Not-Taken / Weak-Not-Taken / Weak-Taken / Strong-Taken. The predictor outputs Taken if the state is in the upper half, Not-Taken otherwise. After each branch resolves, the state slides one notch toward the actual direction.
The two bits give the predictor inertia. A single anomaly doesn't flip the prediction. A loop that goes T-T-T-T-T-N (the loop-end branch flipping once) keeps predicting T because one miss only moves Strong-T to Weak-T. The next iteration starts at Strong-T again. Without inertia (a 1-bit predictor), that loop-end miss would cost two mispredicts per loop iteration — once at the end, once at the start.
| step | state in | predicted | actual | state out |
|---|---|---|---|---|
| No history yet. Click feed T or feed N. | ||||
Three predictors on the same stream
Pick a pattern and watch three predictors race: static-not-taken, 1-bit, and 2-bit saturating. The 2-bit predictor wins on patterns with occasional deviations; the 1-bit wins nothing; the static loses everywhere except on a heavily not-taken stream.
Two-level / gshare
The 2-bit predictor still misses anything that depends on which other branches
just executed. Consider if (a) { … } if (b) { … } where the
second branch is correlated with the first. A predictor that only sees its own
branch's history can't pick this up. Two-level predictors add a
global history register — the last k branch outcomes across the whole
program — and use that to index a Pattern History Table.
Gshare (1992, McFarling) does this elegantly: XOR the program counter with the global history register, use the result to index a table of 2-bit counters. Branches at different addresses with the same recent history get different table entries. Branches at the same address with different histories also get different entries. With a 4 KB table and 12 bits of history, gshare hits ~93–95% on most workloads. Pentium Pro (1995) was the first mainstream chip to ship one.
gshare(PC, history) = PHT[PC ⊕ history]
PC: 0x4012a8 (the branch's program counter)
history: 0b1011 0010 1110 (last 12 outcomes; T=1, N=0)
XOR: 0x401_a96 → table index
PHT[idx]: 2-bit saturating counter → predict T or NTAGE — what mainstream chips actually use
TAGE (TAgged GEometric history length) is the current state of the art for direction prediction. Multiple tables indexed by geometrically growing history lengths — say 8, 16, 32, 64, 128 bits of recent history. Each entry carries a tag derived from PC + history, so different (PC, history) combinations don't collide. On lookup, all tables check in parallel; the longest-history hit that has confidence wins. Branches with simple short-history correlation use the short table; branches with long-distance correlation use the long table.
Intel's Core architecture has used a TAGE variant since Skylake (2015). AMD's Zen-series uses a similar geometric design. Modern accuracy on integer benchmarks like SPEC CPU is 97–99%. The improvements since 2015 have been almost entirely in indirect branch prediction (jump tables, virtual calls) and in handling pathological patterns better.
Perceptron predictors
AMD shipped the first perceptron-based predictor in Phenom (2007) and refined it in Zen. Apple uses perceptrons in the M-series. The idea: instead of a counter, store a small set of weights — one per history bit. To predict, dot the recent history vector with the weights; if positive, predict T; if negative, predict N. Train by adjusting weights toward the actual outcome.
Perceptrons are good at handling long histories without exploding table size (weights scale linearly, not exponentially). They're also relatively easy to extend with hashed history (combine recent outcomes with PC bits to get richer input). Apple's M4 perceptron front-end is rumored to track ~96 bits of branch history per branch, which is why pathological loop patterns that defeat 16-bit gshare often work on Apple silicon.
Forty years of predictor evolution
| Year | Predictor | Accuracy | First used in |
|---|---|---|---|
| 1985 | Static (always not-taken) | ~50% | MIPS R2000 — simple, no hardware, hopeless on loops |
| 1990 | BTFNT — backward taken / forward not-taken | ~65% | Static heuristic; loops backward, ifs forward — works without state |
| 1992 | 1-bit dynamic | ~80% | Pentium prefetch buffer; one bit per branch in a small table |
| 1993 | 2-bit saturating | ~88% | Pentium era; takes two consecutive misses to flip — handles loop ends |
| 1996 | Two-level / gshare | ~94% | Pentium Pro; uses recent branch history XOR PC to index a PHT |
| 2000 | Tournament (hybrid) | ~96% | Alpha 21264; meta-predictor picks between local and global predictors |
| 2006 | Perceptron | ~97% | AMD Phenom, Apple silicon — simple neural-net per branch |
| 2014 | TAGE-SC-L (TAgged GEometric) | ~99% | Intel since Skylake; tagged tables indexed by geometrically-growing histories |
| 2020 | Hashed perceptron + ITTAGE | 99%+ | Apple M-series, AMD Zen 5; combines perceptron for direction with TAGE for indirect targets |
BTB and RAS — predicting where
Direction prediction (T or N) is half the problem. The other half is predicting the target address of taken branches and indirect jumps. The CPU needs the target by the next fetch cycle, but the branch target hasn't been decoded yet. Two structures handle this:
- Branch Target Buffer (BTB). A cache keyed by branch PC, holding the recently-observed target for each branch. ~4–16 KB on modern chips, holding thousands of entries. Hits 95%+ of taken branches. Misses fall back to the slower decoder.
- Return Address Stack (RAS). A small hardware stack — typically 16–64 entries — that mirrors the call stack.
callpushes the return address;retpops. The pop predicts the return target instantly, even if the calling pattern is impossible to capture in the BTB. Modern CPUs back this with overflow recovery so deeply nested calls don't break it. - Indirect branch predictor (ITTAGE). A separate predictor for jump tables, virtual function calls, and computed gotos. Targets vary by call site and history. Modern Intel/AMD have dedicated structures for this; mispredictions on indirect branches are roughly 5–10× more expensive than direction mispredictions.
What a mispredict actually costs
Use the sliders to model your own scenario. Pipeline depth determines the mispredict penalty — every flushed cycle is silicon you paid for and threw away. Branch density determines how often you pay it. Predictor accuracy determines what fraction of the time you pay it.
The picture below shows what those words mean on the pipeline itself. The top track is a correct prediction: the CPU guesses the branch is taken, fetches the predicted instructions, and by the time the branch resolves the speculative work is already done. The bottom track is a mispredict: every instruction fetched after the branch belongs to the wrong path, so when the branch resolves the CPU throws all of it away and refetches from the correct target. The gap between the two is the penalty, and it is exactly the number of stages the branch had to walk before its outcome was known.
Spectre — the cost of speculation
Speculative execution doesn't just speed up your code. It also leaks information across security boundaries. The Spectre class of attacks, disclosed in January 2018, abuses the predictor's training to make the CPU speculatively execute code past a security check (a bounds check, a permission check, a virtualization boundary). The architectural results are discarded when the misprediction is detected — but the cache footprint of the speculatively-executed instructions persists. An attacker times subsequent loads to learn what was loaded, and infers the secret.
Mitigations are still ongoing eight years later. Some categories:
- Hardware fences.
LFENCEon x86 stops speculation across a marker. Used inside libraries to protect bounds-check loops. Costly — every fence is ~5–20 cycles. - Indirect Branch Restricted Speculation (IBRS). Microcode patch that stops one privilege ring from training the predictor of another. Throughput hit: ~5–30% on workloads with many context switches.
- Retpolines. Compiler turns indirect calls into a return-based trampoline that the CPU's RAS predicts independently of the BTB. Linux kernel-wide for years; AMD chips eventually got hardware-level IBRS that made retpolines unnecessary.
- Process isolation. Some side-channel attacks become much harder if attacker and victim aren't on the same physical core. SMT can be disabled in security-sensitive deployments.
The fundamental issue isn't that branch prediction is bad. It's that speculation crosses security boundaries the architecture didn't expect to be boundaries. Every modern CPU now has explicit microarchitectural state that distinguishes "speculative" from "committed", and timing channels are an active area of research and patching.
Why a sorted array is faster
The most-asked question in this whole area comes from a Stack Overflow post: why does the same loop run several times faster over a sorted array than an unsorted one? The loop is something like this — sum the elements that pass a threshold:
for (int i = 0; i < N; i++)
if (data[i] >= 128) // the branch the predictor sees
sum += data[i];The arithmetic is identical either way. The only thing that changes is the branch.
Fill data with random values and the data[i] >= 128
test comes out taken about half the time, in no pattern the predictor can learn.
Accuracy collapses toward 50%, and on a deep pipeline each mispredict costs the
full flush. Sort the array first and the branch becomes trivially predictable: it
is not-taken for the whole first half of the array, then taken for the whole second
half. The predictor learns "not-taken" instantly, eats one mispredict at the
crossover point, and then learns "taken" for the rest. Accuracy jumps to ~100% and
the loop runs at full speed.
Nothing about the cache, the memory traffic, or the instruction count changed. The entire speedup is the difference between a predictable branch (nearly free) and an unpredictable one (a flush roughly half the time). It is the cleanest possible demonstration that data-dependent branches on random data are where prediction falls apart, and that the cost is real and measurable.
Branchless code, and when it actually helps
When a branch is intrinsically unpredictable, the fix is sometimes to remove the branch instead of trying to predict it. The same threshold-sum can be written without a conditional jump:
for (int i = 0; i < N; i++) {
int v = data[i];
sum += v & -(v >= 128); // mask: keeps v when true, 0 when false
}The comparison still happens, but it produces a 0/1 value used as a mask rather
than a jump. There is no branch for the predictor to get wrong, so the random-data
case stops mispredicting. On x86 the compiler can also emit a cmov
(conditional move), which selects between two values with no change in control
flow. SIMD code does the same trick at width: compute both sides, build a mask, and
blend.
Branchless is not a free win, though, and reaching for it everywhere is a mistake. A branch the predictor gets right is close to free, and removing it forces the CPU to compute both sides every iteration instead of skipping one. So branchless helps when the branch is unpredictable and the work on each side is cheap; it hurts when the branch is predictable or the skipped side is expensive. The honest workflow is to measure first — a top-down performance analysis will tell you whether your stall is bad-speculation (mispredicts) before you spend effort rewriting code branchless. If the top-down breakdown does not show a bad-speculation bottleneck, going branchless usually just makes the code slower and harder to read.
cmov. Unpredictable
branch with expensive both-sides: you may be stuck — restructure the data (like
sorting) so the branch becomes predictable instead.Common misconceptions
- "Mispredicts only cost a few cycles." Only on shallow pipelines. On Skylake, ~16 cycles. On Zen 5, ~19. On Pentium 4 Prescott, ~30. A 1% mispredict rate on Skylake wastes ~16 cycles per 100 instructions if every branch were mispredicted; in practice modern code's mispredict cost shows up as ~5% of cycles lost.
- "Branch prediction is the same as speculative execution." Related but distinct. Prediction tells you what to fetch next. Speculative execution actually runs the speculative instructions, accumulating side effects in the speculative core state. Both have to be undone on a mispredict.
- "The compiler can hint the predictor." Limited.
__builtin_expecton GCC influences code layout (so the predicted-taken side is the fall-through), but most modern predictors override static hints with their own learned data quickly. The hint helps mostly the first time a branch is seen. - "All branches are predictable." No. Random data, hash-table lookups, virtual calls on data-dependent receivers — these can be intrinsically unpredictable. The fix is sometimes to remove the branch entirely (branchless code with
cmov, SIMD masks) rather than try to predict it.
Numbers worth remembering
| Quantity | Value | Notes |
|---|---|---|
| Modern direction-predictor accuracy | 97–99% | Integer code, modern x86 / ARM64 |
| Branch density in typical code | 10–20% | 1 in 5–10 instructions |
| Skylake mispredict penalty | ~16 cycles | Roughly the front-end depth |
| Zen 5 mispredict penalty | ~19 cycles | Slightly deeper to chase frequency |
| Apple M4 P-core mispredict penalty | ~9 cycles | Shorter pipeline, slightly higher recovery cost amortised against bigger ROB |
| Pentium 4 Prescott mispredict penalty | ~30 cycles | The reason Prescott was retired |
| BTB size, mainstream Intel | ~4096 entries | Per core, 2-way to 4-way set-associative |
| RAS depth, modern x86 | 16–32 entries | Apple silicon ~64 |
| Indirect branch mispredict cost | 5–10× direction mispredict | Cross-page fetch + BTB miss + decode + execute |
| Spectre disclosed | January 2018 | Mitigations ongoing 2026 |
Further reading
- Wikipedia — Branch predictor — comprehensive coverage of every predictor variant.
- JILP Championship Branch Prediction — academic competition for predictor designs; the source of many ideas in commercial chips.
- Agner Fog — Microarchitecture of Intel, AMD, and VIA CPUs — Section 3 documents the actual predictor structures (BTB sizes, history lengths) on every recent x86.
- Spectre & Meltdown — the original disclosure pages with worked exploits.
- Hennessy & Patterson — Computer Architecture: A Quantitative Approach. Section 3.3 covers predictor design at graduate-level depth, including Tomasulo + speculation interactions.
- Chips and Cheese — measured branch-predictor accuracy and BTB sizes on every recent CPU release.
- Wikipedia — Speculative execution — the broader concept; covers how predicted-down paths interact with reorder buffers and renaming.