05 / 08

Methods / 05 · Compute kernels

The roofline model

Sam Williams' single-chart model for compute kernels. One axis is arithmetic intensity — operations per byte of memory traffic. The other is throughput. Two ceilings cap the chart: the compute peak (FLOPS the chip can do) and the memory-bandwidth peak (FLOPS the bandwidth can sustain at this intensity). Plot the kernel; the closer ceiling is the bottleneck.

Why one chart is enough

Most performance work starts with a vague feeling — "this kernel is slow" — and no way to tell whether slow means the chip is starved of data or the chip is simply maxed out. Those two failures look identical from the outside: the wall-clock time is high either way. They need opposite fixes. If the chip is starved, you spend your effort cutting memory traffic. If the chip is maxed out, cutting memory traffic does nothing and you spend your effort on the arithmetic units. Pick the wrong story and you can burn a week tuning the thing that was never the limit.

The roofline model exists to settle that question before you write any tuning code. It puts two hard machine limits on a single chart and asks one thing of your kernel: how much arithmetic does it do per byte it moves? Answer that, plot the point, and the chart tells you which limit you are pressed against and how far you are from it. Sam Williams, Andrew Waterman, and David Patterson published it in 2009 for multicore CPUs, and it has aged well precisely because the two limits it draws — peak arithmetic and peak bandwidth — are the two limits every processor has, CPU or GPU. The numbers move from chip to chip; the shape does not.

The chart is a back-of-the-envelope tool with teeth. It will not tell you the exact runtime of your kernel. It will tell you the best runtime this machine can give your kernel as written, and whether you are near it. That ceiling is the useful part. A kernel running at three percent of its ceiling has a story to chase. A kernel at eighty percent does not, and the chart stops you wasting time on it.

The model in one diagram

The roofline chart is log–log. The horizontal axis is arithmetic intensity in FLOPS per byte. The vertical axis is throughput in GFLOPS. Two lines form the "roof":

A flat horizontal line at the machine's peak FLOPS — the compute roofline. No kernel can run faster than this no matter how arithmetic-dense it is.
A sloped line at peak_bandwidth × intensity — the memory roofline. For low-intensity kernels, this is the binding ceiling: you can only do as much arithmetic as the memory bus can feed you.

Read the sloped line carefully, because it is the part people miss. The line is not the bandwidth itself. It is bandwidth converted into a throughput limit at a given intensity. If your kernel does two FLOPS for every byte it touches, and the bus moves a hundred billion bytes a second, then the most arithmetic the bus can support is two hundred billion FLOPS a second. Double the intensity and the supported throughput doubles, which is why the line slopes up. The bus speed never changed; you just got more arithmetic out of each byte it delivered.

The two lines meet at the ridge point: the intensity at which the chip can simultaneously consume all the bandwidth and saturate the FLOPS. Left of the ridge, kernels are memory-bound — the sloped line is lower, so bandwidth caps you first. Right of the ridge, they are compute-bound — the flat line is lower, so the arithmetic units cap you first. The ridge is the break-even intensity for the machine, and it is a single number you can compute once: peak FLOPS divided by peak bandwidth. The diagram below is the canonical shape.

Notice where the sample kernels land. Stream copy sits far left and low — it does almost no arithmetic, so the memory roof pins it near the floor. Dense matrix multiply sits far right and high — it reuses data heavily, so it lives under the flat compute roof. The blocked stencil and sparse matrix–vector product fall in between, which is where the interesting tuning happens: close enough to the ridge that moving an inch in either direction changes the answer.

Arithmetic intensity

Arithmetic intensity (sometimes "operational intensity") is the workload's ratio of useful arithmetic to memory traffic. You measure two things: (1) how many floating-point operations the kernel performs, (2) how many bytes it loads or stores from DRAM. Divide. The result is intrinsic to the algorithm — it doesn't depend on the machine.

That last sentence is the reason intensity is worth defining carefully. The number of FLOPS your kernel does is fixed by the maths it computes. A thousand-by-thousand matrix multiply does two billion floating-point operations whether it runs on a laptop or a supercomputer. The bytes side is where everything interesting happens, because "bytes moved" is not fixed by the maths — it is fixed by how the kernel touches memory. The same algorithm can move sixteen gigabytes or half a gigabyte depending on whether it reuses data while it is still in cache. Intensity is the dial that this reuse turns, and the whole craft of memory-aware optimisation is the craft of turning that dial up.

There is a subtlety in which bytes you count. Roofline draws a separate roof for each level of the memory system, so you have a choice of denominator. Count only the bytes that reach DRAM and you get the DRAM intensity, which is the one that matters when your working set spills out of cache. Count the bytes that reach the L2 cache and you get a different, higher intensity against a steeper roof, because L2 is faster than DRAM. The right level to look at is the one your kernel is actually bound by, and a good profiler will plot all of them so you can see which roof is closest. For a first pass, DRAM bytes is the right denominator, because for most code the trip to main memory is the slow one.

Kernel	FLOPS	Bytes from DRAM	Intensity (FLOPS/byte)
STREAM copy (a[i] = b[i])	0	16 (one load + one store)	~0
AXPY (y[i] = a*x[i] + y[i])	2	24 (two loads + one store)	0.083
SpMV (sparse matrix × vector)	~2 per non-zero	~12 per non-zero	0.16
Dense GEMM (N×N matmul, naive)	2N³	3N² (one read each of A, B, C)	~2N/3 — grows with N
Dense GEMM with cache blocking	2N³	~N² (reused from cache)	~2N — much higher
3D stencil (untiled)	~10 per point	~80 per point	0.125
3D stencil (tiled)	~10 per point	~16 per point	0.625

Why blocking and tiling exist. Cache blocking and loop tiling don't reduce the FLOPS the kernel performs; they reduce the bytes loaded from DRAM by reusing what's already in cache. That moves the kernel right on the roofline chart — from memory-bound territory toward the ridge. Same algorithm, same answer; the only thing that changed is the memory access pattern. This is why "data layout" usually beats "instruction count" in modern performance work.

Reading the chart

Plotting a kernel on a roofline chart tells you three things in one glance — and each one points at a different next step.

Which side of the ridge? Left → memory-bound; the kernel's intensity is too low to keep the FLOPS units busy. Right → compute-bound; bandwidth isn't the limit.
How close to the ceiling? A kernel running at 80% of its applicable roofline is doing about as well as it can. A kernel at 10% has 8× of headroom — and the strategy depends on which roofline applies.
What does increasing intensity buy? For memory-bound kernels, raising intensity (blocking, tiling, smaller types, AoS→SoA) walks the kernel along the sloped memory roofline up to the ridge. For compute-bound kernels, more intensity buys nothing — the FLOPS ceiling is the cap.

The first two questions, taken together, decide whether you should be tuning at all. There are really three positions a kernel can hold. It can sit well below the sloped roof on the left, which means it is memory-bound and leaving bandwidth on the table — usually a sign of a poor access pattern that does not even use the bandwidth it could. It can sit right on a roof, which means it is doing as well as the machine allows along that axis and your only moves are to raise intensity or change machine. Or it can sit in open space below both roofs, which is the unhappy middle: neither bandwidth nor arithmetic is saturated, and the limit is something roofline does not draw — latency stalls, poor vectorisation, branch misprediction, or simply not enough parallelism to hide the memory latency. That last case is where roofline hands you off to top-down analysis, which splits the wasted cycles inside the core into named buckets.

The strategy split is worth stating plainly because it is the entire payoff of the model. Memory-bound means your problem is traffic: you reduce the bytes by reusing data, shrinking types, or laying data out so the cache lines you pull in are fully used rather than half wasted. Compute-bound means your problem is the arithmetic units: you make sure every cycle issues the widest vector instruction it can, that fused multiply-add is in play, that all the cores or warps are busy. These are different teams of techniques and they barely overlap. The chart is the thing that tells you which playbook to open, and it does so before you have spent any effort guessing.

Where the rooflines come from

The two ceiling values are machine constants. You compute them once per machine, or look them up.

Ceiling	How to compute	Example (Intel Xeon Gold 6248R)
Compute roofline (single precision)	cores × clock × FLOPS_per_cycle	24 cores × 3.0 GHz × 32 SP-FLOPS/cycle (AVX-512 FMA) = ~2,300 GFLOPS
Compute roofline (double precision)	same, halved (AVX-512 DP does 16 ops/cycle)	~1,150 GFLOPS
Memory bandwidth roofline	channels × DDR rate × bytes/transfer	6 channels × DDR4-2933 × 8 B = ~141 GB/s
Ridge intensity	compute_peak / bandwidth_peak	2,300 / 141 ≈ 16 FLOPS/byte (single precision)

Two things worth noting. First, the nominal peaks are theoretical; real bandwidth is usually 60–80% of nameplate. Tools like Intel Advisor measure the achievable roofline using benchmarks. Second, a chip has multiple memory rooflines — one for each cache level. The full chart can show L1, L2, L3, and DRAM rooflines stacked together; a kernel that's "L3-bound" is limited by L3 bandwidth, not DRAM.

The ridge intensity in that table — about sixteen FLOPS per byte for single precision — is the single most telling number on the whole chart, so it is worth dwelling on. It says that to keep this Xeon's arithmetic units fed, your kernel must do sixteen useful floating-point operations for every byte it pulls from DRAM. Anything below that and bandwidth runs out before the arithmetic does. Sixteen sounds modest until you count how many real kernels clear it. Adding two arrays clears one FLOP per several bytes. A dot product is barely better. Most of the everyday numerical work an engineer writes is nowhere near sixteen, which is the quiet reason the next section is true.

Why most real code is memory-bound

The ridge has crept rightward for thirty years. Arithmetic throughput has grown far faster than memory bandwidth — more cores, wider vectors, fused multiply-add, and on GPUs thousands of lanes — while the bytes per second a chip can pull from DRAM has grown slowly by comparison. The gap between the two compounds every generation. The practical effect is that the ridge intensity keeps rising, so the bar a kernel must clear to be compute-bound keeps moving up, and more and more code falls to the left of it. A kernel that was compute-bound on a machine from a decade ago can be memory-bound on today's, having changed not one line.

This is why "data layout beats instruction count" has become the default advice in modern performance work. The arithmetic units are usually idle, waiting for data. Shaving an instruction off the inner loop helps a compute-bound kernel and does nothing for a memory-bound one, which is most of them. Reducing traffic — fewer cache misses, fuller cache lines, reused data, smaller types — helps the kernel that is actually waiting. The roofline chart is, in a sense, a formal argument for why the boring memory work pays off more often than the clever arithmetic work.

It also reframes what a cache is for. A cache does not make memory faster; it lets you avoid the trip to DRAM when the data is already nearby, which is the same as raising the kernel's effective intensity. Every technique in the locality toolbox — blocking, tiling, fusing passes, choosing structure-of-arrays over array-of-structures — is a way of getting more arithmetic out of each byte that crosses the slow boundary. The memory hierarchy page covers the boundaries themselves; roofline tells you when crossing one less often is the win to chase.

The rule of thumb. If you have not measured anything yet, assume your kernel is memory-bound. You will be right more often than not, and the first experiment — does cutting traffic speed it up? — is cheap to run and tells you immediately whether the assumption held.

Roofline and LLM inference

The clearest modern example of a memory-bound workload is the decode phase of a large language model. When a model generates text it runs in two regimes that sit on opposite sides of the ridge, and seeing them on the roofline chart explains why serving these models is a bandwidth problem rather than an arithmetic one.

During prefill — when the model first reads your whole prompt — it processes many tokens at once. Each weight loaded from memory is multiplied against many token vectors before it is discarded, so the arithmetic intensity is high and prefill sits to the right of the ridge, compute-bound. This is the part that uses the GPU's enormous FLOPS, and it is why long prompts cost compute time roughly in proportion to their length.

During decode — generating one token at a time — the picture inverts. To produce a single token the model must read its entire set of weights from memory: tens or hundreds of gigabytes, every step. But each weight is now used in just one small matrix–vector product before being thrown away, because there is only one new token in flight. The arithmetic per byte collapses. Decode lands far to the left of the ridge, deeply memory-bound, and the GPU's arithmetic units sit mostly idle while the memory system grinds through the weights. The speed of token generation is set almost entirely by how fast the hardware can stream those weights, which is why a chip's memory bandwidth, not its FLOPS, predicts its tokens-per-second.

This diagnosis points straight at the main serving trick. The reason decode is memory-bound is that one token in flight reuses each weight only once. Batch many requests together and each weight, loaded once, is multiplied against many tokens from many users — the intensity rises and the batched decode point walks rightward along the memory roof toward the ridge, where it finally starts using the arithmetic the chip has been holding in reserve. That is why throughput-oriented serving systems work so hard to keep batches full, and why a model that is fast for one user can be far more cost-efficient for a thousand. The mechanics of prefill, decode, the KV cache, and batching are covered on the inference and serving page; roofline is the model that explains why those mechanics take the shape they do.

Tools that draw the chart

You don't draw roofline charts by hand for production work. Modern profilers generate them directly from a real run.

Tool	Targets	Notes
Intel Advisor	Intel CPUs (and Intel GPUs)	The reference implementation. Runs the kernel, measures actual bandwidth and FLOPS, plots multiple rooflines (L1/L2/L3/DRAM) on one chart. Free with oneAPI Base Semicolony.
NVIDIA Nsight Compute	NVIDIA GPUs	Roofline for CUDA kernels. Built into Nsight Compute; works at warp level.
LIKWID	Linux x86	Open source. Computes intensity and throughput from PAPI counters; pair with gnuplot for the chart.
ERT (Empirical Roofline Semicolony)	CPUs, GPUs, accelerators	Berkeley Lab's measurement kit. Produces the empirical rooflines for the machine you're on.
roofline-on-nvidia-gpus	NVIDIA GPUs	Berkeley Lab project for CUDA kernels; complements Nsight.

Worked example: GEMM

Matrix multiply is the canonical example because cache blocking moves it cleanly across the roofline chart.

# Naive C = A × B on 1000×1000 matrices, single precision.
for (i = 0; i < 1000; i++)
  for (j = 0; j < 1000; j++)
    for (k = 0; k < 1000; k++)
      C[i][j] += A[i][k] * B[k][j];

# Counts:
#   FLOPS = 2 × 1000³ = 2 × 10⁹
#   Memory traffic: each B[k][j] streams from DRAM per inner k,
#     A[i][k] partially cached, C[i][j] one read+one write per outer.
#     Effective bytes ≈ 4 × 1000³ × 4 B = 16 GB.
#   Intensity ≈ 2×10⁹ / 16×10⁹ = 0.125 FLOPS/byte
#   → memory-bound, well below the ridge.

# Achieved on a 100 GB/s machine: ~100 GB/s × 0.125 = ~12.5 GFLOPS
# (against a 2,000 GFLOPS peak — 0.6% efficiency!)

# Same algorithm, blocked for L1 (block size 64):
for (ii = 0; ii < 1000; ii += 64)
  for (jj = 0; jj < 1000; jj += 64)
    for (kk = 0; kk < 1000; kk += 64)
      for (i = ii; i < ii+64; i++)
        for (j = jj; j < jj+64; j++) {
          float sum = C[i][j];
          for (k = kk; k < kk+64; k++) sum += A[i][k] * B[k][j];
          C[i][j] = sum;
        }

# Counts:
#   FLOPS unchanged: 2 × 10⁹
#   Memory traffic: each block of A, B, C loaded once per outer tile.
#     Effective bytes ≈ 3 × N² + cross-tile traffic ≈ 0.5 GB
#   Intensity ≈ 2×10⁹ / 0.5×10⁹ = 4 FLOPS/byte
#   → still memory-bound, but 32× higher intensity.

# Achieved: ~100 GB/s × 4 = ~400 GFLOPS — 32× the naive version.

# Vendor BLAS (OpenBLAS, MKL): intensity ≈ 30-100 FLOPS/byte through
# nested L1/L2/L3 blocking. Pushes the kernel past the ridge.
# Achieved: 1,500–1,800 GFLOPS — 80% of compute peak.

The progression captures the whole roofline lesson. The algorithm is identical at each step — the same multiplies and adds happen in the same order. The only thing that changes is the memory access pattern. Each blocking step raises intensity, which moves the kernel right on the chart, which raises the applicable ceiling, which raises achievable throughput.

Where roofline doesn't help

Roofline is purpose-built for compute kernels — code with measurable FLOPS and a clean memory-access pattern. It's a poor fit for:

Branchy, irregular code. Tree traversals, graph algorithms with pointer chasing, JSON parsing. "Bytes moved" and "operations performed" don't have stable counts; the model degenerates.
I/O-bound or RPC-bound workloads. If the bottleneck is the network or the disk, neither roofline applies. Use USE first.
Workloads dominated by integer or branch ops. Roofline's vertical axis is FLOPS; integer-heavy code (compression, hashing, parsing) lives on a different chart entirely (instructions per byte, etc.).
Low intensity that's already at the ceiling. A kernel running at 95% of the memory roofline can't be sped up by tuning — it needs an algorithmic change to raise intensity, or a hardware change to raise bandwidth.

Production checklist

Compute the machine's two peaks. Compute roofline and memory bandwidth roofline. Either from nameplate (theoretical) or from a benchmark (empirical). Use the empirical numbers.
Estimate the kernel's intensity. FLOPS performed ÷ bytes moved. The bytes-moved estimate matters most — measure with hardware counters or Intel Advisor.
Place the kernel on the chart. Where is it relative to the ridge? How close to the applicable ceiling?
If memory-bound: raise intensity. Blocking, tiling, smaller data types, structure-of-arrays layout, prefetching. Each one walks the kernel rightward.
If compute-bound: hit the actual peak. Vectorisation (AVX-512, NEON, GPU warps), FMA, occupying all functional units. The cap is the cap.
If at the ceiling: stop tuning, change the algorithm. Sub-3% headroom isn't a tuning problem; it's a "you've reached this machine's limit" signal.
Watch for multiple memory rooflines. L1, L2, L3, DRAM. A kernel that's "L3-bound" is limited by L3 bandwidth, not DRAM — different fix.

The roofline model

Why one chart is enough

The model in one diagram

Arithmetic intensity

Reading the chart

Where the rooflines come from

Why most real code is memory-bound

Roofline and LLM inference

Tools that draw the chart

Worked example: GEMM

Where roofline doesn't help

Production checklist

Further reading

Queueing theory for engineers