14 / 15
Internals / 14

GPUs and accelerators

A CPU is a latency machine — it makes one thread go fast. A GPU is a throughput machine — it lets thousands of threads share execution units, with massive memory bandwidth to feed them. Modern AI training has made GPUs the most economically important silicon on the planet; modern inference has made TPUs, NPUs, and on-CPU matrix accelerators (Intel AMX, Apple AMX) the post-Moore answer to the question of "how do we keep getting more performance per watt?". This page covers the SIMT execution model, the GPU memory hierarchy, and where the dedicated accelerators fit alongside.


CPU vs GPU — the latency / throughput split

A modern CPU core has ~6 execution units, a ~600-entry reorder buffer, branch prediction, register renaming, and deep speculation. It is built to run one thread fast. A modern GPU has ~16,000 execution units across a single chip, no branch predictors per unit, no speculation, and much weaker single-thread performance. It is built to run thousands of threads at once. The two chips answer two different questions. A CPU answers "how do I finish this one task as soon as possible?" A GPU answers "how do I finish the most tasks per second?" Almost every design difference falls out of that one split.

Look at where the transistors go. On a CPU die, most of the silicon is spent on the machinery that makes a single instruction stream go fast: large caches to keep data close, branch predictors and speculation to keep the pipeline full, out-of-order scheduling to find work to do while a load is in flight, and wide issue to run several instructions per cycle. The arithmetic units themselves are a small fraction of the area. On a GPU die, the ratio flips. The arithmetic units dominate, the control logic is shared across many of them, and the caches are small relative to the compute they feed. A GPU spends its transistor budget on doing math, not on cleverly scheduling math.

CPU — a few fat corescontrolcacheGPU — thousands of small coressolid blocks are arithmetic units; faint blocks are control and cache
Where the transistors go. A CPU spends most of its area on control and cache around a few large cores; a GPU fills the die with arithmetic units and shares the control logic across them.

The trade is real and it cuts both ways. A CPU thread that does sequential pointer chasing through 1 KB of data will beat a GPU thread on the same task — the GPU has higher latency to its caches and no sophisticated prediction to hide it. A workload that does the same operation on 4 million elements in parallel will run 100× faster on the GPU, because the GPU has 100× the execution units and the memory bandwidth to feed them. Neither chip is "better." They sit at opposite ends of the latency/throughput curve, and the right one depends entirely on the shape of the work. The GPU's whole design is a bet that you have a great deal of identical, independent work to do, and that you care about finishing all of it rather than any single piece quickly.

This is the same bet that vector units make inside a CPU, just taken much further. If you have read the SIMD and vector units page, a GPU will feel familiar: it is SIMD's idea — one instruction driving many lanes of arithmetic — scaled from the 8 or 16 lanes of an AVX register up to thousands of lanes across a whole chip, with hardware threading layered on top to hide the long memory latencies that come with that scale.

SIMT — Single Instruction, Multiple Threads

NVIDIA's execution model groups 32 threads together as a warp. AMD calls them wavefronts and they are 32 or 64 threads. All threads in a warp execute the same instruction at the same time, but each has its own register state, its own program counter, and its own data. From the programmer's view they are independent threads that happen to be scheduled in lockstep. From the hardware's view they share a single instruction fetch and decode, and only the arithmetic fans out across the 32 lanes. That sharing is the whole trick: one expensive front end (fetch, decode, schedule) drives 32 cheap back ends (the lanes), so the chip pays for control logic once and gets 32 results.

fetch + decode ONE instructionb[i] = a[i] * 2broadcast to all lanes ▾ lane 0 a[0] × 2 b[0] lane 1 a[1] × 2 b[1] lane 2 a[2] × 2 b[2] lane 3 a[3] × 2 b[3] lane 31 a[31] × 2 b[31]one instruction, 32 lanes, 32 different data elements — the SIMT model
SIMT in one picture: a warp fetches and decodes one instruction, then every lane runs it on its own slice of the data. The front end is paid for once and shared 32 ways.
Warp executes:  load a[tid]    // 32 different a[tid]s loaded in parallel
                add b[tid] = a[tid] * 2
                store b[tid]

If the warp diverges (some threads take an if-branch, others don't):
  → both sides are executed serially, with masking
  → throughput drops by the divergence factor
  → "branch divergence" — the CUDA programmer's biggest performance pitfall

If memory accesses are coalesced (consecutive threads access consecutive bytes):
  → one 128-byte memory transaction serves the whole warp
  → "coalesced access" — the second-biggest performance lever

The two failure modes in that listing are worth dwelling on, because they are where most GPU performance is won or lost. Branch divergence is what happens when an if sends some lanes of a warp one way and the rest another. Because the warp shares one instruction stream, the hardware cannot run both sides at once. It runs the first branch with the other lanes masked off and idle, then runs the second branch with the first set masked off. A warp where half the lanes take each side of a branch does twice the work for the same result. Code full of data-dependent branches turns a wide machine into a narrow one.

Coalescing is the memory-side version of the same idea. If the 32 lanes of a warp read 32 consecutive addresses, the memory system can satisfy them with a single wide transaction. If they read scattered addresses, it may need 32 separate transactions, each fetching a full cache line of which only a few bytes are used. The difference is often an order of magnitude in effective bandwidth, and bandwidth is the resource a GPU is built to spend. The practical upshot for anyone writing GPU code: keep lanes doing the same thing, and keep their memory accesses next to each other.

Streaming multiprocessor — the GPU's "core"

An NVIDIA GPU is built from Streaming Multiprocessors (SMs); AMD calls them Compute Units (CUs). An H100 has 132 SMs. Each SM contains:

  • ~64 FP32 ALUs ("CUDA cores")
  • 4 specialised matrix-multiply units (Tensor Cores)
  • ~256 KB register file (yes, KB — far larger than a CPU's 256 bytes)
  • ~256 KB L1 cache + shared memory (programmer-managed scratchpad)
  • 4 warp schedulers, each issuing one instruction every 1–2 cycles
  • Texture units, special-function units (sin, sqrt, etc.)

At any time, an SM holds 32–64 active warps and switches between them cycle-by-cycle. When one warp stalls on a memory load, another warp's instruction runs. This is how GPUs hide DRAM latency — not with predictors and reorder buffers but with sheer thread parallelism.

Occupancy and latency hiding

A CPU hides the ~400 cycles it takes to fetch from DRAM by predicting the load early and finding other work in the same thread to run while it waits. A GPU has no such machinery per lane. Instead it hides the same latency by having so many warps resident that there is always one ready to run. When warp A issues a load and stalls, the scheduler picks warp B, which is ready; when B stalls, it picks C. By the time the wheel comes back around to A, the load has landed. The latency never disappears — it is simply covered by other warps' work. This is latency hiding by oversubscription, and it is the central idea of GPU performance.

The metric for "how many warps are resident" is occupancy: the ratio of active warps to the hardware maximum. Occupancy is bounded by resources. Each SM has a fixed register file and a fixed slab of shared memory, and every warp that wants to be resident must fit its registers and its block's shared memory into those budgets. A kernel that uses many registers per thread, or a lot of shared memory per block, can fit fewer warps and so has lower occupancy. The tension is real: registers and shared memory make each thread faster, but using more of them leaves fewer threads in flight to hide latency. Tuning a GPU kernel is often a search for the point where you have just enough warps to keep the memory pipeline covered without spending so much per thread that occupancy collapses.

An important corollary: high occupancy is a means, not an end. You do not need the maximum number of warps, only enough to hide the latency you actually have. A kernel that is already compute-bound — doing a lot of arithmetic per byte loaded — can run near peak at modest occupancy, because there is little memory latency left to hide. A kernel that is memory-bound needs enough warps in flight to cover the round trip to HBM, and below that threshold it leaves throughput on the table. The roofline model on the performance methods page is the clean way to reason about which regime a kernel is in and how much headroom is left.

The GPU memory hierarchy

LayerCapacityLatencyBandwidth
Register file (per SM)~256 KB1 cy~10 TB/s
Shared memory / L1 (per SM)~256 KB~30 cy~5 TB/s
L2 cache (per chip)~50 MB~200 cy~6 TB/s
HBM (per chip)80–192 GB~400 cy3–8 TB/s
NVLink (between chips)~µs900 GB/s
PCIe Gen5 (host RAM)~1 µs setup63 GB/s

The numbers worth comparing: HBM3 bandwidth (3 TB/s on H100, 8 TB/s on B200) is roughly 50× a CPU socket's DDR5 bandwidth. The latency is worse — ~400 cycles vs ~250 — but the GPU compensates by issuing dozens of warps in flight at any moment. The effective throughput on memory-bound workloads is what wins.

HBM — High Bandwidth Memory

Conventional DDR uses a 64-bit-wide interface running at multi-GHz speeds — high frequency, narrow bus. HBM goes the other way: a 1024-bit- wide interface running at lower frequency. The HBM stack is built as eight DRAM dies stacked vertically on a silicon interposer right next to the GPU die, with thousands of through-silicon vias connecting them.

One HBM stack delivers ~819 GB/s. A GPU with 4–6 stacks reaches 3–8 TB/s. The cost is real: HBM is ~3× the per-GB price of DDR5, requires expensive 2.5D packaging, and adds significant power. The trade-off makes sense only when memory bandwidth is the dominant constraint — which describes essentially every AI workload of the past five years.

The host/device boundary

A discrete GPU is a separate computer on the end of a PCIe cable. It has its own memory (HBM), its own scheduler, and its own address space. The CPU is the host; the GPU is the device. Nothing the GPU computes on can live only in host RAM — it has to be copied across PCIe into HBM first, and any result you want back has to be copied out. That copy is slow relative to everything else in the system. PCIe Gen5 moves about 63 GB/s; HBM moves 3–8 TB/s. The link to the host is roughly 50–100× narrower than the GPU's own memory, which makes the boundary a wall you plan around rather than cross casually.

The consequence shapes how GPU code is written. You do not ship one small array over, run one kernel, and ship the answer back; the transfer would dominate the runtime and the GPU would sit idle most of the time. You move data over once, keep it resident in HBM, and run many kernels against it before paying to bring a result home. Good GPU pipelines also overlap transfer with compute — copying the next batch in while the current batch is being processed — so the PCIe cost hides behind work that is already happening. The recurring lesson is that a GPU is fast once the data is there; getting the data there and back is the part that needs care. Apple's unified memory sidesteps this entirely by giving the CPU and GPU one shared pool, which is one of its real advantages for inference.

Why GPUs only sometimes hit peak

Real workloads rarely hit a GPU's peak FLOPS rating. The bottleneck is usually memory bandwidth, not compute. Use the sliders to model an arithmetic intensity (FLOPs per byte loaded):

8
200
3000
67
achievable TFLOPS
67.0
bandwidth-bounded ceiling
utilization
100%
of peak compute
At intensity ~10 FLOPs/byte (matrix-vector multiply), a 67 TFLOP H100 with 3 TB/s HBM is bandwidth-bound at ~30 TFLOPS — 45% utilization. At intensity ~200 (a large batched matmul), it crosses into compute-bound and approaches peak. This is why batch size matters so much for training throughput.

Modern accelerator landscape

AcceleratorPeak TFLOPSMemoryBandwidthNotes
NVIDIA H1006780 GB HBM33 TB/sHopper, 2022. 132 SMs × ~80 cores each.
NVIDIA B200240192 GB HBM3e8 TB/sBlackwell, 2024. Two-die package, ~600 W.
AMD MI300X163192 GB HBM35.3 TB/s~750 W. Big edge in memory capacity.
Apple M3 Ultra GPU27512 GB unified800 GB/sUnified memory, no PCIe transfer needed.
Google TPU v5e19716 GB HBM819 GB/sINT8/BF16 dense matmul. Cloud-only.
Intel AMX (Sapphire Rapids)8L2 cacheL2 BWOn-CPU matmul accelerator. Tile registers.
Apple Neural Engine (M4)38unifiedshared SLC16 cores. INT8/FP16. Used by CoreML.

The pattern: NVIDIA dominates training, AMD competes on capacity, Apple's unified memory is special-purpose-good, Google's TPUs are the tightest matmul-focused design. On-CPU accelerators (Intel AMX, Apple's hidden AMX) bring matmul into the cache hierarchy with no PCIe round trip — useful for inference of small models and as a "free" boost to math libraries.

TPUs and systolic arrays

Where GPUs are general-purpose parallel processors, TPUs are matrix-multiply specialists. Each TPU chip is built around a large systolic array — typically 128×128 or 256×256 multiply-add units arranged in a grid. Data flows through the grid wave-by-wave: weights enter from one side, inputs from another, partial sums accumulate as they propagate. Every cycle, every cell does one multiply-add. The whole array is ~32K ops/cycle.

The cost: only matrix-multiply runs at peak. Anything else (activations, softmax, layer norm) runs on the small "vector unit" attached to the array, often at 10–50× lower throughput. TPU code has to keep the systolic array fed; XLA (Google's compiler) does this automatically by fusing operations.

Tensor Cores — the GPU's matmul lane

NVIDIA introduced Tensor Cores in Volta (2017) to compete with TPUs without ditching the GPU's general-purpose programmability. Each Tensor Core is a small systolic array (originally 4×4×4 FP16-with-FP32-accumulate) inside the SM. A single instruction (HMMA) performs the matrix-multiply-accumulate. Modern Tensor Cores support FP8, INT8, INT4, sparse matrices, and the new "FP4" format on Blackwell.

Tensor Cores deliver most of an H100's 67 TFLOPS BF16 throughput; the regular CUDA cores deliver only ~10. AI workloads that don't use Tensor Cores see roughly 15% of peak — a strong incentive to write code that compilers can lower onto them. The cuBLAS / cuDNN library calls are tuned to use them; hand-written CUDA that bypasses them gives up an order of magnitude.

Apple's unified memory

Apple silicon (M1/M2/M3/M4) breaks the GPU/CPU memory split. Both the CPU and GPU share the same DRAM through the same memory controller. There's no "copy data to GPU" step — pointers are valid in either context. Memory bandwidth is shared, capping out at ~800 GB/s on M3 Ultra (vs 3 TB/s on H100), but the latency is much better and the practical throughput on smaller models is competitive.

This is good for inference workloads with model sizes that fit (an M3 Ultra has up to 512 GB of unified memory, enough for any current open-weights model). It's less good for training, where the bandwidth gap matters more. For laptops and edge devices, unified memory is the dominant architecture; for datacenters, discrete HBM-equipped GPUs still win.

NPUs and AMX — accelerators on the SoC

Apple's Neural Engine, Qualcomm's NPU, Intel's AMX (Advanced Matrix Extensions), and AMD's XDNA NPU are all on-chip accelerators designed for tensor math. They're fundamentally smaller than discrete GPUs but much more power-efficient: an Apple M4's Neural Engine delivers ~38 TOPS at INT8 for ~1 W; an H100 delivers 4000 TOPS at INT8 for 700 W.

The trade: these on-CPU units are tightly bound to specific frameworks (CoreML on Apple, OpenVINO on Intel, DirectML on Windows). They can't run arbitrary CUDA kernels. But for inference of pre-converted models — image recognition, speech, modest LLMs — they're the right tool. Battery-life workloads will increasingly target them; datacenter training will not.

When GPUs win, and when they do not

A GPU wins when the work is massively data-parallel: the same operation applied to a large number of independent elements, with regular memory access and few data-dependent branches. Graphics is the original case — shading millions of pixels, each independent, with the same program. Machine learning is the case that made GPUs the most valuable silicon on the planet, because training and inference are stacks of matrix multiplies, which are the most parallel and regular operations there are. Scientific simulation — fluid dynamics, weather, molecular dynamics, finite-element solvers — fits the same mould: a grid of cells, each updated by the same stencil, over and over.

A GPU loses when the work is branchy, serial, or small. Branchy work — code where the path taken depends heavily on the data, like parsing or graph traversal — causes warp divergence and wastes most of the lanes. Serial work, where each step depends on the previous one, cannot be spread across lanes at all and runs on a single weak GPU thread, far slower than a CPU. And small work simply does not pay for itself: spinning up a kernel and copying data across PCIe has fixed overhead, so a task that takes microseconds on a CPU can take longer on a GPU once the launch and transfer costs are counted. The rule of thumb is blunt but useful: if you cannot describe the work as "do this same thing to a very large number of things," a GPU is probably the wrong tool.

Why LLM inference is memory-bandwidth-bound

The headline use of GPUs today is running large language models, and that workload lands in a surprising place: it is limited by memory bandwidth, not by compute. The reason is arithmetic intensity. Generating one token at a time means reading the model's entire weight matrix from HBM to do a handful of multiply-adds per weight, then moving on to the next token. At batch size one, almost every weight is read once and used once. The GPU's enormous arithmetic throughput sits mostly idle because the bottleneck is how fast weights can stream out of HBM.

You can read this straight off the numbers. A 70-billion-parameter model in 16-bit precision is ~140 GB of weights. On a GPU with 3 TB/s of HBM bandwidth, a single pass over those weights takes about 140 / 3000 = 47 ms, which sets a floor on the time per token regardless of how many TFLOPS the chip can do. The compute to go with that pass — a couple of operations per weight — finishes in a fraction of the time. The model is bandwidth-bound, and the lever that helps is batching: serving many requests at once reuses each weight across all of them, raising the arithmetic intensity and pushing the work toward the compute-bound regime. This is the central tension of LLM serving, and the inference and serving page is built around it — prefill versus decode, the KV cache, and why throughput and latency pull against each other. It is also why every new accelerator generation chases HBM bandwidth as hard as it chases FLOPS.

Common misconceptions

  • "GPUs are good at everything parallel." They're great at SIMD-shaped, regularly-strided, dense work. They're terrible at branchy, irregular, pointer-chasing parallelism (graph traversal, sparse linear algebra without specialised support). The execution model penalises divergence.
  • "More TFLOPS means faster training." Only if you can keep them fed. A 240 TFLOP B200 with 8 TB/s HBM trains a transformer in roughly the same wall time as a 67 TFLOP H100 with 3 TB/s — because both are bandwidth-bound on the attention layers. Compute-density gains help only when the workload has the arithmetic intensity to use them.
  • "TPUs are obsolete because of GPUs." Google trained Gemini and serves Search inference on TPUs. The economics depend on whether you control the workload (TPUs win when matmul dominates) and whether you need flexibility (GPUs win for everything else, including most academic workloads).
  • "On-CPU accelerators replace GPUs." Only for inference of compact models. Training and large-model inference still need discrete GPUs. The split is roughly: laptops use NPUs/AMX; phones use NPUs; datacenters use GPUs.

Numbers worth remembering

QuantityValue
NVIDIA H100 peak BF16 TFLOPS67
NVIDIA B200 peak FP8 TFLOPS~5,000 (with sparsity)
H100 HBM3 bandwidth~3 TB/s
B200 HBM3e bandwidth~8 TB/s
NVIDIA warp size32 threads
AMD wavefront size32 or 64 threads
Active warps per H100 SMup to 64
Number of SMs in H100132
Tensor Core dimensions, modern16×16×16 (varies by precision)
NVLink 4.0 bandwidth900 GB/s per direction
Apple M3 Ultra unified memoryup to 512 GB
Apple Neural Engine (M4)~38 TOPS INT8
Google TPU v5e BF16 TFLOPS~197

Further reading

Found this useful?