04 / 08

Methods / 04 · CPU pipeline

Top-down microarchitecture

Ahmad Yasin's framework for finding the pipeline stall behind a slow piece of code. Every CPU cycle gets classified into one of four buckets — front-end stall, back-end stall, bad speculation, or actually retiring work — and the largest bucket tells you which physical part of the pipeline to chase. Runs on Linux perf with toplev.py, or on Intel VTune.

The CPU pipeline in one paragraph

A modern out-of-order x86 core has two halves. The front end fetches instructions from memory, decodes them, and feeds micro-ops into a queue. The back end reads micro-ops out of that queue, executes them on functional units (ALU, FPU, load/store), and retires them. Each cycle, the CPU either retires a useful instruction or it doesn't — and if it doesn't, it's because the front end didn't deliver a micro-op, or the back end couldn't finish one, or the work it did do was thrown away because of a misprediction.

That two-halves picture hides a lot of machinery, and the machinery is the reason top-down works the way it does. The front end is a small assembly line of its own: it predicts where execution will go next, fetches the bytes from the instruction cache, decodes the variable-length x86 bytes into fixed micro-ops, and parks those micro-ops in a queue. Sitting next to the decoders is a micro-op cache (Intel calls it the DSB) that holds already-decoded loops so the expensive decode step can be skipped on the second pass. The back end is the out-of-order engine: it renames registers to break false dependencies, schedules micro-ops onto execution ports as their inputs become ready, and retires them in program order once they finish. Reorder buffers and scheduler entries are finite, so the back end can fill up and refuse new work even while the front end is still delivering. For the full mechanics of how that engine reorders and retires, the out-of-order execution page is the companion read; this page is about measuring where the resulting cycles went.

The unit top-down counts in is the pipeline slot. A modern Intel core is, roughly, four micro-ops wide: every cycle the front end can issue up to four micro-ops into the back end, so a cycle offers four issue slots. Multiply the width by the cycles a workload ran for and you get the total number of slots that existed during the measurement. Every one of those slots is then classified. A slot that delivered a micro-op which eventually retired is a retiring slot. A slot that delivered a micro-op which got cancelled is a bad-speculation slot. A slot that stayed empty because the front end had nothing to give is a front-end-bound slot. A slot that stayed empty because the back end refused to accept work is a back-end-bound slot. The four counts cover every slot exactly once, which is the whole reason the percentages add to a clean 100.

The four categories

Top-down's contribution is the partition. Every slot goes into exactly one bucket. The percentages sum to 100. The largest bucket is the bottleneck, and each bucket points at a different physical resource and a different class of fix. The picture below is the whole method in one diagram: take all the issue slots a workload consumed, ask one question per slot, and sort.

The top-down decision per slot. Two yes/no questions sort every issue slot into one of four buckets. The biggest bucket is your bottleneck.

Category	What it means	Typical cause
Retiring	The pipeline did productive work this cycle.	None — this is what you want. High retiring % with low IPC suggests the workload is just inherently serial.
Front-end bound	The back end was ready but the front end didn't deliver micro-ops.	Instruction cache misses, branch-target buffer misses, decoder bottlenecks, large hot loops that don't fit in the µop cache.
Back-end bound	Micro-ops were available but the back end couldn't execute them fast enough.	Cache/memory stalls (most common), dependency chains, port contention. Subdivides into memory-bound and core-bound.
Bad speculation	The CPU executed work that got thrown away.	Branch mispredictions, machine clears (memory-ordering violations, SMC). Usually points at unpredictable branches.

Why this partition is the key insight. Before top-down, the usual question was "what's my IPC?" — instructions per cycle, a single number. Low IPC tells you something is wrong; it doesn't tell you what. Top-down replaces one number with a four-way split that points at a specific physical mechanism. The fix for front-end stalls (code layout, instruction cache) is nothing like the fix for back-end stalls (data access pattern, vectorisation), so knowing which one matters more than knowing IPC ever did.

The counters underneath

Top-down is not a model you run in software. It reads physical hardware counters baked into the CPU — the performance monitoring unit, the PMU. The PMU is a small bank of registers that each tick up when a chosen micro-architectural event happens. You program a register to watch "uops issued" or "cycles the reorder buffer was full" and the silicon counts it for free, at full speed, with no instrumentation in your code. The four top-down buckets are not native events; they are simple formulas over a handful of these raw counters, and the formulas are what Yasin's paper specified so that vendors could expose them consistently.

At level 1 the arithmetic is almost embarrassingly direct. Retiring is the share of slots that retired a micro-op, computed as UOPS_RETIRED.RETIRE_SLOTS / (4 × CPU_CLK_UNHALTED) on a four-wide core. Bad speculation is the share of issued slots that never retired plus the recovery bubbles after a flush. The total stall is the slots where nothing issued, and front-end versus back-end is split by asking whether the back end signalled that it was full at the moment the slot went empty — if the reorder buffer was not stalling, the empty slot was the front end's fault. You do not type these formulas yourself; the tool does. But knowing they are ratios of real counters is what tells you why short runs are noise and why the numbers are trustworthy when the run is long enough.

The buckets are not magic. They are ratios of a few raw PMU counters over the total slot budget. The tool programs the counters and does the division.

One practical wrinkle: there are only a few PMU counter registers, usually four to eight general-purpose ones per logical core. Top-down at deeper levels needs more events than fit at once, so the tool time-multiplexes — it watches one set for a slice, swaps in the next set, and scales the readings back together. That works only if the workload behaves the same across slices, which is another reason the method wants a long, steady run rather than a brief spike. In a cloud VM the hypervisor may hide some counters entirely, so a level that works on bare metal can come back blank on an instance; Brendan Gregg's writeup on which counters survive virtualisation is the reference for that.

Reading IPC alongside the buckets

IPC — instructions retired per cycle — is the number top-down was built to replace, but you should still read it, because it gives the buckets their scale. A four-wide core can retire at most about four instructions per cycle, so an IPC near 3 to 4 means the machine is running close to flat out and there is little room left at the micro-architectural level. An IPC under 1 means the core spends most cycles waiting, and the buckets tell you what for. The trap is reading either number alone. High retiring with high IPC is a healthy, busy core. High retiring with low IPC is the most easily missed result: the pipeline is not stalling on anything fixable, it is just doing a long chain of dependent work that cannot be done faster on this hardware. When you see that combination, the honest conclusion is that micro-optimisation is finished and the next move is algorithmic.

The mirror image is also worth naming. Low retiring with low IPC is the common, hopeful case — the core is stalling, the stall has a category, and the category has a fix. Low retiring with surprisingly high IPC usually means a lot of issued work is being thrown away, which shows up as a fat bad-speculation bucket and points straight at branch prediction. Carry IPC as the headline and the four buckets as the diagnosis; neither is complete without the other.

Running it

On Linux, the easiest entry point is Andi Kleen's toplev.py — a wrapper around perf stat that reads the right PMU counters for the CPU you're on and prints the top-down breakdown directly. On Intel machines you can also use VTune, which renders the same hierarchy in a GUI.

# Install perf and toplev (pmu-tools)
sudo apt install linux-tools-common linux-tools-$(uname -r)
git clone https://github.com/andikleen/pmu-tools
export PATH=$PATH:$(pwd)/pmu-tools

# Run top-down on a workload for 10 seconds at level 1 (the four buckets)
toplev -l1 -v --no-desc -- ./your-workload

# Example output
# FE             Frontend_Bound:        12.4 %
# BAD            Bad_Speculation:        4.1 %
# BE             Backend_Bound:         63.8 %  <-- bottleneck
# RET            Retiring:              19.7 %

# Drill into level 2 — splits BE into Memory_Bound vs Core_Bound
toplev -l2 -v --no-desc -- ./your-workload

# BE/Mem  Backend_Bound.Memory_Bound:    51.2 %  <-- memory!
# BE/Core Backend_Bound.Core_Bound:      12.6 %

# Drill into level 3 — splits Memory_Bound by cache level
toplev -l3 -v --no-desc -- ./your-workload

# BE/Mem/L1   L1_Bound:        8.4 %
# BE/Mem/L2   L2_Bound:        4.1 %
# BE/Mem/L3   L3_Bound:       15.8 %
# BE/Mem/DRAM DRAM_Bound:     22.9 %   <-- DRAM-bound: working set spills L3

The hierarchy is what makes top-down useful in practice. Level 1 tells you which of the four buckets dominates; level 2 splits the bound bucket into its sub-categories; level 3 splits further. You don't have to memorise the hundred PMU events Intel exposes — toplev picks the right ones for each level and your specific CPU.

What each bucket sends you to fix

The four buckets are useful because each one points at a different part of the chip and a different kind of code change. Reading the dominant bucket as a direction, not an answer, is the skill.

Front-end bound means the back end was idle and willing but the front end could not feed it. The usual culprits are instruction supply problems: the hot code does not fit in the instruction cache, so fetches miss and stall; or a big loop overflows the micro-op cache and pays for decode every iteration; or the branch-target buffer mispredicts where to fetch next and the front end fetches the wrong bytes. The fixes live in code layout rather than data. Profile-guided optimisation reorders functions so hot paths sit together and cold paths fall out of the working set. Inlining decisions matter both ways — too little inlining costs call overhead, too much bloats the hot loop past the micro-op cache. When the front end is bound, you are tuning the shape and placement of instructions in memory.

Back-end bound is the common case and it splits in two. The back end stalled either because data was not ready (memory-bound) or because the execution resources themselves were the limit (core-bound). Memory-bound is the more frequent of the two and subdivides again by cache level: an L1 or L2 stall is a near miss you can often prefetch or restructure around; an L3 or DRAM stall means the working set has outgrown the cache and the fix is to make the data smaller or access it more locally. Cache blocking, switching array-of-structs to struct-of-arrays so each cache line carries only the fields you touch, and software prefetch all attack memory-bound code. Core-bound means the micro-ops were ready but the ports could not retire them fast enough — a long dependency chain where each operation waits on the previous one, or contention for a particular execution port. The cure there is to shorten dependency chains, hand-vectorise so more work happens per instruction, or spread work across ports that are sitting idle.

Drilling the back-end bucket. Level 2 asks memory or core; level 3 names the cache level or the resource. The right column is the kind of fix each leaf calls for.

Bad speculation means the core ran ahead on a guess and the guess was wrong, so the work was discarded and the pipeline had to refill from the right path. Almost always this is branch misprediction on a branch the predictor cannot learn — data-dependent conditions, a comparison on unsorted input, a virtual dispatch that hits a different target each call. The fixes are about removing the branch or making it predictable: replace a conditional with a branchless cmov or a bitwise select, sort the data so the branch goes one way for long runs, or restructure the hot path so the unpredictable decision happens once instead of per element. The deeper background on why some branches are learnable and others are not lives on the branch prediction page; for top-down purposes the signal is simple — a fat bad-speculation bucket means find the branch the predictor keeps missing.

Retiring is the bucket you want to grow, but it is also the one that ends the investigation. A high retiring share with healthy IPC means the core is busy and there is nothing micro-architectural left to chase. A high retiring share with low IPC, as covered above, means the work is inherently serial and the lever has moved from the hardware to the algorithm. Either way, when retiring dominates, top-down has done its job and is telling you to stop.

Patterns and what to do about them

A few characteristic signatures and the kind of fix each one points at. None of these are universal — top-down narrows the search; the fix still has to come from understanding the code.

Signature	Likely cause	Direction to chase
Backend bound > 50%, DRAM_Bound dominates	Working set doesn't fit in cache.	Cache blocking, smaller hot data structures, AoS→SoA layout, prefetching.
Backend bound, L3_Bound dominates	Hot loop spills L2.	Loop tiling, reduce per-element memory traffic, vectorisation that streams.
Backend bound, Core_Bound dominates	Long dependency chains, port contention.	Break dependency chains, hand-vectorise with SIMD intrinsics, check if a different functional unit is free.
Front-end bound, ICache or BACLEARS high	Code is too large or branches mispredict at decode.	PGO (profile-guided optimization), function inlining decisions, code layout reordering, smaller hot loops.
Bad speculation > 10%	Unpredictable branches.	Branchless code (cmov, bitwise tricks), branch hints, restructure to make outcomes more predictable, sort data so branches predict well.
Retiring > 70%, IPC low	Workload inherently serial.	Algorithmic change is the only remaining lever — micro-optimisation has run out.

A worked example

A hash-map lookup loop, profiled before and after a small change.

# Before: linear probing hash map, capacity = 1M, hot loop reads
#         32 entries scattered across the table per outer iteration.

toplev -l3 ./hash-bench
# Frontend_Bound:                       9 %
# Bad_Speculation:                      6 %
# Backend_Bound:                       71 %
#   Backend_Bound.Memory_Bound:        61 %
#     Memory_Bound.DRAM_Bound:         42 %   <-- DRAM!
#     Memory_Bound.L3_Bound:           14 %
# Retiring:                            14 %
# IPC: 0.39

# Diagnosis: DRAM-bound. Each lookup pays a 200-cycle L3 miss.
# Fix: software prefetch one iteration ahead.

# After: __builtin_prefetch(table + hash(next_key)) one iter ahead.

toplev -l3 ./hash-bench
# Frontend_Bound:                      10 %
# Bad_Speculation:                      6 %
# Backend_Bound:                       42 %
#   Backend_Bound.Memory_Bound:        28 %
#     Memory_Bound.DRAM_Bound:         11 %   <-- much better
#     Memory_Bound.L3_Bound:           13 %
# Retiring:                            42 %
# IPC: 1.18

# 3x throughput. DRAM_Bound dropped from 42% to 11% because prefetch
# overlaps the miss latency with useful work.

This is the shape of a top-down session. Run, read the breakdown, identify the largest bucket, decide whether the fix is worth doing, change one thing, re-run, compare. The change-one-thing discipline matters — if you modify two things at once, top-down can't tell you which one helped.

When this is the right tool

Top-down earns its keep on hot inner loops — the small piece of code that a profiler says runs for most of the wall-clock time. That is the regime where a few percent of micro-architectural efficiency is worth chasing, because the loop runs billions of times and any per-iteration win multiplies. Numeric kernels, parsers, serialisers, hash tables, codecs, the inner step of a simulation: these are the workloads where the bottleneck lives inside the pipeline and where knowing whether it is memory or branches or ports changes what you do next.

It is the wrong tool for code that is not CPU-bound or not hot. A request handler that spends its life waiting on a database has no micro-architectural story worth telling; a function that runs once at startup is not worth the analysis. The order of operations matters: a sampling profiler finds the hot function first, then top-down explains why that function is slow at the hardware level, and only then do you reach for a fix. Skipping the profiler and running top-down on a whole program gives you an average across everything, which usually means a muddy back-end-bound result that points nowhere. Narrow to the loop, measure the loop, fix the loop.

Top-down also pairs naturally with a higher-level model. The roofline model answers the coarser question — is this kernel limited by memory bandwidth or by compute? — by plotting achieved performance against arithmetic intensity. Roofline tells you which ceiling you are under; top-down tells you which pipeline mechanism is holding you there. A kernel that roofline calls memory-bound will usually show a fat memory-bound bucket in top-down, and the two views reinforce each other: roofline sets the budget, top-down spends the investigation. Reach for roofline when you are deciding whether an optimisation is even possible, and for top-down when you have decided it is and need to know where to cut.

Where top-down falls short

Top-down is a CPU-pipeline tool. It assumes the bottleneck lives in the processor, and it's brilliant when it does. There are several cases where it isn't the right tool:

The bottleneck isn't the CPU. If the process is blocked on I/O, on a lock, or on a network round trip, the CPU is idle and top-down sees almost nothing. Use USE first to confirm the CPU is the resource that's hot.
Cross-thread contention. Top-down profiles cycles per core, not the global picture. A lock that bounces between cores shows up as back-end stalls on every core but doesn't tell you the contention is the cause; profiling with lock tracing is the better tool there.
Workload too short. The PMU counters need a meaningful sample. Anything under about 100 ms of CPU time produces noise. Wrap a hot kernel in a loop that takes 10 seconds to measure cleanly.
Non-x86 architectures. The four categories are an Intel formalism; AMD's are similar but not identical; ARM has its own performance-counter taxonomy. toplev targets Intel/AMD only.

Production checklist

Confirm CPU is the bottleneck first. If USE shows the CPU is idle, top-down has nothing useful to say.
Use toplev -l1 as the entry point. Read the four percentages. The largest one names the bucket.
Drill one level at a time. -l2 after -l1; -l3 when level 2 narrows to one bucket. Don't start at -l5 — you'll drown in numbers.
Always measure long enough. 10 seconds of CPU time minimum. Short measurements lie.
Change one thing at a time. Re-run after each change. Top-down's value is the comparison; if you change two things you've lost the signal.
Match the fix to the bucket. Front-end ↔ code layout. Back-end memory ↔ data layout. Back-end core ↔ dependency chains. Bad speculation ↔ unpredictable branches. Don't generalise.
If you hit "retiring high, IPC low", you're done micro-optimising. The workload is serial. Algorithmic change is the only remaining lever.