SIMT stands for Single Instruction, Multiple Threads. It's NVIDIA's execution model where 32 threads (a 'warp') execute the same instruction together but with their own register state and program counter. From the programmer's perspective they look like independent threads; from the hardware's perspective they share an instruction-fetch pipeline. SIMT lets GPUs run thousands of threads with the silicon cost of just hundreds of execution units.

How is GPU memory different from CPU memory?

GPU HBM (High Bandwidth Memory) achieves 3-8 TB/s on modern accelerators — roughly 50× a CPU socket. The trade is latency (HBM ~400 ns vs DDR5 ~80 ns), capacity (96-192 GB vs hundreds of GB), and cost. GPUs hide the latency through massive thread parallelism: while one warp waits for memory, thousands of others can run.

A Tensor Processing Unit is Google's custom accelerator built around large matrix-multiply units. A TPU v5e has a 128×128 systolic array that performs matrix multiply at much higher density than a GPU. TPUs trade off general-purpose programmability for matmul throughput — they can't easily run graphics or arbitrary CUDA kernels, but they sustain higher utilization on the matrix-heavy workloads that dominate transformer training and inference.

14 / 15

Internals / 14

GPUs and accelerators

A CPU is a latency machine — it makes one thread go fast. A GPU is a throughput machine — it lets thousands of threads share execution units, with massive memory bandwidth to feed them. Modern AI training has made GPUs the most economically important silicon on the planet; modern inference has made TPUs, NPUs, and on-CPU matrix accelerators (Intel AMX, Apple AMX) the post-Moore answer to the question of "how do we keep getting more performance per watt?". This page covers the SIMT execution model, the GPU memory hierarchy, and where the dedicated accelerators fit alongside.

CPU vs GPU — the latency / throughput split

A modern CPU core has ~6 execution units, a ~600-entry reorder buffer, branch prediction, register renaming, and deep speculation. It is built to run one thread fast. A modern GPU has ~16,000 execution units across a single chip, no branch predictors per unit, no speculation, and much weaker single-thread performance. It is built to run thousands of threads at once. The two chips answer two different questions. A CPU answers "how do I finish this one task as soon as possible?" A GPU answers "how do I finish the most tasks per second?" Almost every design difference falls out of that one split.

Look at where the transistors go. On a CPU die, most of the silicon is spent on the machinery that makes a single instruction stream go fast: large caches to keep data close, branch predictors and speculation to keep the pipeline full, out-of-order scheduling to find work to do while a load is in flight, and wide issue to run several instructions per cycle. The arithmetic units themselves are a small fraction of the area. On a GPU die, the ratio flips. The arithmetic units dominate, the control logic is shared across many of them, and the caches are small relative to the compute they feed. A GPU spends its transistor budget on doing math, not on cleverly scheduling math.

Where the transistors go. A CPU spends most of its area on control and cache around a few large cores; a GPU fills the die with arithmetic units and shares the control logic across them.

The trade is real and it cuts both ways. A CPU thread that does sequential pointer chasing through 1 KB of data will beat a GPU thread on the same task — the GPU has higher latency to its caches and no sophisticated prediction to hide it. A workload that does the same operation on 4 million elements in parallel will run 100× faster on the GPU, because the GPU has 100× the execution units and the memory bandwidth to feed them. Neither chip is "better." They sit at opposite ends of the latency/throughput curve, and the right one depends entirely on the shape of the work. The GPU's whole design is a bet that you have a great deal of identical, independent work to do, and that you care about finishing all of it rather than any single piece quickly.

This is the same bet that vector units make inside a CPU, just taken much further. If you have read the SIMD and vector units page, a GPU will feel familiar: it is SIMD's idea — one instruction driving many lanes of arithmetic — scaled from the 8 or 16 lanes of an AVX register up to thousands of lanes across a whole chip, with hardware threading layered on top to hide the long memory latencies that come with that scale.

SIMT — Single Instruction, Multiple Threads

NVIDIA's execution model groups 32 threads together as a warp. AMD calls them wavefronts and they are 32 or 64 threads. All threads in a warp execute the same instruction at the same time, but each has its own register state, its own program counter, and its own data. From the programmer's view they are independent threads that happen to be scheduled in lockstep. From the hardware's view they share a single instruction fetch and decode, and only the arithmetic fans out across the 32 lanes. That sharing is the whole trick: one expensive front end (fetch, decode, schedule) drives 32 cheap back ends (the lanes), so the chip pays for control logic once and gets 32 results.

SIMT in one picture: a warp fetches and decodes one instruction, then every lane runs it on its own slice of the data. The front end is paid for once and shared 32 ways.

Warp executes:  load a[tid]    // 32 different a[tid]s loaded in parallel
                add b[tid] = a[tid] * 2
                store b[tid]

If the warp diverges (some threads take an if-branch, others don't):
  → both sides are executed serially, with masking
  → throughput drops by the divergence factor
  → "branch divergence" — the CUDA programmer's biggest performance pitfall

If memory accesses are coalesced (consecutive threads access consecutive bytes):
  → one 128-byte memory transaction serves the whole warp
  → "coalesced access" — the second-biggest performance lever

The two failure modes in that listing are worth dwelling on, because they are where most GPU performance is won or lost. Branch divergence is what happens when an if sends some lanes of a warp one way and the rest another. Because the warp shares one instruction stream, the hardware cannot run both sides at once. It runs the first branch with the other lanes masked off and idle, then runs the second branch with the first set masked off. A warp where half the lanes take each side of a branch does twice the work for the same result. Code full of data-dependent branches turns a wide machine into a narrow one.

Coalescing is the memory-side version of the same idea. If the 32 lanes of a warp read 32 consecutive addresses, the memory system can satisfy them with a single wide transaction. If they read scattered addresses, it may need 32 separate transactions, each fetching a full cache line of which only a few bytes are used. The difference is often an order of magnitude in effective bandwidth, and bandwidth is the resource a GPU is built to spend. The practical upshot for anyone writing GPU code: keep lanes doing the same thing, and keep their memory accesses next to each other.

Streaming multiprocessor — the GPU's "core"

An NVIDIA GPU is built from Streaming Multiprocessors (SMs); AMD calls them Compute Units (CUs). An H100 has 132 SMs. Each SM contains:

~64 FP32 ALUs ("CUDA cores")
4 specialised matrix-multiply units (Tensor Cores)
~256 KB register file (yes, KB — far larger than a CPU's 256 bytes)
~256 KB L1 cache + shared memory (programmer-managed scratchpad)
4 warp schedulers, each issuing one instruction every 1–2 cycles
Texture units, special-function units (sin, sqrt, etc.)

At any time, an SM holds 32–64 active warps and switches between them cycle-by-cycle. When one warp stalls on a memory load, another warp's instruction runs. This is how GPUs hide DRAM latency — not with predictors and reorder buffers but with sheer thread parallelism.

Occupancy and latency hiding

A CPU hides the ~400 cycles it takes to fetch from DRAM by predicting the load early and finding other work in the same thread to run while it waits. A GPU has no such machinery per lane. Instead it hides the same latency by having so many warps resident that there is always one ready to run. When warp A issues a load and stalls, the scheduler picks warp B, which is ready; when B stalls, it picks C. By the time the wheel comes back around to A, the load has landed. The latency never disappears — it is simply covered by other warps' work. This is latency hiding by oversubscription, and it is the central idea of GPU performance.

The metric for "how many warps are resident" is occupancy: the ratio of active warps to the hardware maximum. Occupancy is bounded by resources. Each SM has a fixed register file and a fixed slab of shared memory, and every warp that wants to be resident must fit its registers and its block's shared memory into those budgets. A kernel that uses many registers per thread, or a lot of shared memory per block, can fit fewer warps and so has lower occupancy. The tension is real: registers and shared memory make each thread faster, but using more of them leaves fewer threads in flight to hide latency. Tuning a GPU kernel is often a search for the point where you have just enough warps to keep the memory pipeline covered without spending so much per thread that occupancy collapses.

An important corollary: high occupancy is a means, not an end. You do not need the maximum number of warps, only enough to hide the latency you actually have. A kernel that is already compute-bound — doing a lot of arithmetic per byte loaded — can run near peak at modest occupancy, because there is little memory latency left to hide. A kernel that is memory-bound needs enough warps in flight to cover the round trip to HBM, and below that threshold it leaves throughput on the table. The roofline model on the performance methods page is the clean way to reason about which regime a kernel is in and how much headroom is left.

The GPU memory hierarchy

Layer	Capacity	Latency	Bandwidth
Register file (per SM)	~256 KB	1 cy	~10 TB/s
Shared memory / L1 (per SM)	~256 KB	~30 cy	~5 TB/s
L2 cache (per chip)	~50 MB	~200 cy	~6 TB/s
HBM (per chip)	80–192 GB	~400 cy	3–8 TB/s
NVLink (between chips)	—	~µs	900 GB/s
PCIe Gen5 (host RAM)	—	~1 µs setup	63 GB/s

The numbers worth comparing: HBM3 bandwidth (3 TB/s on H100, 8 TB/s on B200) is roughly 50× a CPU socket's DDR5 bandwidth. The latency is worse — ~400 cycles vs ~250 — but the GPU compensates by issuing dozens of warps in flight at any moment. The effective throughput on memory-bound workloads is what wins.

HBM — High Bandwidth Memory

Conventional DDR uses a 64-bit-wide interface running at multi-GHz speeds — high frequency, narrow bus. HBM goes the other way: a 1024-bit- wide interface running at lower frequency. The HBM stack is built as eight DRAM dies stacked vertically on a silicon interposer right next to the GPU die, with thousands of through-silicon vias connecting them.

One HBM stack delivers ~819 GB/s. A GPU with 4–6 stacks reaches 3–8 TB/s. The cost is real: HBM is ~3× the per-GB price of DDR5, requires expensive 2.5D packaging, and adds significant power. The trade-off makes sense only when memory bandwidth is the dominant constraint — which describes essentially every AI workload of the past five years.

The host/device boundary

A discrete GPU is a separate computer on the end of a PCIe cable. It has its own memory (HBM), its own scheduler, and its own address space. The CPU is the host; the GPU is the device. Nothing the GPU computes on can live only in host RAM — it has to be copied across PCIe into HBM first, and any result you want back has to be copied out. That copy is slow relative to everything else in the system. PCIe Gen5 moves about 63 GB/s; HBM moves 3–8 TB/s. The link to the host is roughly 50–100× narrower than the GPU's own memory, which makes the boundary a wall you plan around rather than cross casually.

The consequence shapes how GPU code is written. You do not ship one small array over, run one kernel, and ship the answer back; the transfer would dominate the runtime and the GPU would sit idle most of the time. You move data over once, keep it resident in HBM, and run many kernels against it before paying to bring a result home. Good GPU pipelines also overlap transfer with compute — copying the next batch in while the current batch is being processed — so the PCIe cost hides behind work that is already happening. The recurring lesson is that a GPU is fast once the data is there; getting the data there and back is the part that needs care. Apple's unified memory sidesteps this entirely by giving the CPU and GPU one shared pool, which is one of its real advantages for inference.

Why GPUs only sometimes hit peak

Real workloads rarely hit a GPU's peak FLOPS rating. The bottleneck is usually memory bandwidth, not compute. Use the sliders to model an arithmetic intensity (FLOPs per byte loaded):

batch size 8

arithmetic intensity (FLOPs / byte) 200

memory bandwidth (GB/s) 3000

peak compute (TFLOPS) 67

achievable TFLOPS

67.0

bandwidth-bounded ceiling

utilization

100%

of peak compute

At intensity ~10 FLOPs/byte (matrix-vector multiply), a 67 TFLOP H100 with 3 TB/s HBM is bandwidth-bound at ~30 TFLOPS — 45% utilization. At intensity ~200 (a large batched matmul), it crosses into compute-bound and approaches peak. This is why batch size matters so much for training throughput.

Modern accelerator landscape

Accelerator	Peak TFLOPS	Memory	Bandwidth	Notes
NVIDIA H100	67	80 GB HBM3	3 TB/s	Hopper, 2022. 132 SMs × ~80 cores each.
NVIDIA B200	240	192 GB HBM3e	8 TB/s	Blackwell, 2024. Two-die package, ~600 W.
AMD MI300X	163	192 GB HBM3	5.3 TB/s	~750 W. Big edge in memory capacity.
Apple M3 Ultra GPU	27	512 GB unified	800 GB/s	Unified memory, no PCIe transfer needed.
Google TPU v5e	197	16 GB HBM	819 GB/s	INT8/BF16 dense matmul. Cloud-only.
Intel AMX (Sapphire Rapids)	8	L2 cache	L2 BW	On-CPU matmul accelerator. Tile registers.
Apple Neural Engine (M4)	38	unified	shared SLC	16 cores. INT8/FP16. Used by CoreML.

The pattern: NVIDIA dominates training, AMD competes on capacity, Apple's unified memory is special-purpose-good, Google's TPUs are the tightest matmul-focused design. On-CPU accelerators (Intel AMX, Apple's hidden AMX) bring matmul into the cache hierarchy with no PCIe round trip — useful for inference of small models and as a "free" boost to math libraries.

TPUs and systolic arrays

Where GPUs are general-purpose parallel processors, TPUs are matrix-multiply specialists. Each TPU chip is built around a large systolic array — typically 128×128 or 256×256 multiply-add units arranged in a grid. Data flows through the grid wave-by-wave: weights enter from one side, inputs from another, partial sums accumulate as they propagate. Every cycle, every cell does one multiply-add. The whole array is ~32K ops/cycle.

The cost: only matrix-multiply runs at peak. Anything else (activations, softmax, layer norm) runs on the small "vector unit" attached to the array, often at 10–50× lower throughput. TPU code has to keep the systolic array fed; XLA (Google's compiler) does this automatically by fusing operations.

Tensor Cores — the GPU's matmul lane

NVIDIA introduced Tensor Cores in Volta (2017) to compete with TPUs without ditching the GPU's general-purpose programmability. Each Tensor Core is a small systolic array (originally 4×4×4 FP16-with-FP32-accumulate) inside the SM. A single instruction (HMMA) performs the matrix-multiply-accumulate. Modern Tensor Cores support FP8, INT8, INT4, sparse matrices, and the new "FP4" format on Blackwell.

Tensor Cores deliver most of an H100's 67 TFLOPS BF16 throughput; the regular CUDA cores deliver only ~10. AI workloads that don't use Tensor Cores see roughly 15% of peak — a strong incentive to write code that compilers can lower onto them. The cuBLAS / cuDNN library calls are tuned to use them; hand-written CUDA that bypasses them gives up an order of magnitude.

Apple's unified memory

Apple silicon (M1/M2/M3/M4) breaks the GPU/CPU memory split. Both the CPU and GPU share the same DRAM through the same memory controller. There's no "copy data to GPU" step — pointers are valid in either context. Memory bandwidth is shared, capping out at ~800 GB/s on M3 Ultra (vs 3 TB/s on H100), but the latency is much better and the practical throughput on smaller models is competitive.

This is good for inference workloads with model sizes that fit (an M3 Ultra has up to 512 GB of unified memory, enough for any current open-weights model). It's less good for training, where the bandwidth gap matters more. For laptops and edge devices, unified memory is the dominant architecture; for datacenters, discrete HBM-equipped GPUs still win.

NPUs and AMX — accelerators on the SoC

Apple's Neural Engine, Qualcomm's NPU, Intel's AMX (Advanced Matrix Extensions), and AMD's XDNA NPU are all on-chip accelerators designed for tensor math. They're fundamentally smaller than discrete GPUs but much more power-efficient: an Apple M4's Neural Engine delivers ~38 TOPS at INT8 for ~1 W; an H100 delivers 4000 TOPS at INT8 for 700 W.

The trade: these on-CPU units are tightly bound to specific frameworks (CoreML on Apple, OpenVINO on Intel, DirectML on Windows). They can't run arbitrary CUDA kernels. But for inference of pre-converted models — image recognition, speech, modest LLMs — they're the right tool. Battery-life workloads will increasingly target them; datacenter training will not.

When GPUs win, and when they do not

A GPU wins when the work is massively data-parallel: the same operation applied to a large number of independent elements, with regular memory access and few data-dependent branches. Graphics is the original case — shading millions of pixels, each independent, with the same program. Machine learning is the case that made GPUs the most valuable silicon on the planet, because training and inference are stacks of matrix multiplies, which are the most parallel and regular operations there are. Scientific simulation — fluid dynamics, weather, molecular dynamics, finite-element solvers — fits the same mould: a grid of cells, each updated by the same stencil, over and over.

A GPU loses when the work is branchy, serial, or small. Branchy work — code where the path taken depends heavily on the data, like parsing or graph traversal — causes warp divergence and wastes most of the lanes. Serial work, where each step depends on the previous one, cannot be spread across lanes at all and runs on a single weak GPU thread, far slower than a CPU. And small work simply does not pay for itself: spinning up a kernel and copying data across PCIe has fixed overhead, so a task that takes microseconds on a CPU can take longer on a GPU once the launch and transfer costs are counted. The rule of thumb is blunt but useful: if you cannot describe the work as "do this same thing to a very large number of things," a GPU is probably the wrong tool.

Why LLM inference is memory-bandwidth-bound

The headline use of GPUs today is running large language models, and that workload lands in a surprising place: it is limited by memory bandwidth, not by compute. The reason is arithmetic intensity. Generating one token at a time means reading the model's entire weight matrix from HBM to do a handful of multiply-adds per weight, then moving on to the next token. At batch size one, almost every weight is read once and used once. The GPU's enormous arithmetic throughput sits mostly idle because the bottleneck is how fast weights can stream out of HBM.

You can read this straight off the numbers. A 70-billion-parameter model in 16-bit precision is ~140 GB of weights. On a GPU with 3 TB/s of HBM bandwidth, a single pass over those weights takes about 140 / 3000 = 47 ms, which sets a floor on the time per token regardless of how many TFLOPS the chip can do. The compute to go with that pass — a couple of operations per weight — finishes in a fraction of the time. The model is bandwidth-bound, and the lever that helps is batching: serving many requests at once reuses each weight across all of them, raising the arithmetic intensity and pushing the work toward the compute-bound regime. This is the central tension of LLM serving, and the inference and serving page is built around it — prefill versus decode, the KV cache, and why throughput and latency pull against each other. It is also why every new accelerator generation chases HBM bandwidth as hard as it chases FLOPS.

Common misconceptions

"GPUs are good at everything parallel." They're great at SIMD-shaped, regularly-strided, dense work. They're terrible at branchy, irregular, pointer-chasing parallelism (graph traversal, sparse linear algebra without specialised support). The execution model penalises divergence.
"More TFLOPS means faster training." Only if you can keep them fed. A 240 TFLOP B200 with 8 TB/s HBM trains a transformer in roughly the same wall time as a 67 TFLOP H100 with 3 TB/s — because both are bandwidth-bound on the attention layers. Compute-density gains help only when the workload has the arithmetic intensity to use them.
"TPUs are obsolete because of GPUs." Google trained Gemini and serves Search inference on TPUs. The economics depend on whether you control the workload (TPUs win when matmul dominates) and whether you need flexibility (GPUs win for everything else, including most academic workloads).
"On-CPU accelerators replace GPUs." Only for inference of compact models. Training and large-model inference still need discrete GPUs. The split is roughly: laptops use NPUs/AMX; phones use NPUs; datacenters use GPUs.

Numbers worth remembering

Quantity	Value
NVIDIA H100 peak BF16 TFLOPS	67
NVIDIA B200 peak FP8 TFLOPS	~5,000 (with sparsity)
H100 HBM3 bandwidth	~3 TB/s
B200 HBM3e bandwidth	~8 TB/s
NVIDIA warp size	32 threads
AMD wavefront size	32 or 64 threads
Active warps per H100 SM	up to 64
Number of SMs in H100	132
Tensor Core dimensions, modern	16×16×16 (varies by precision)
NVLink 4.0 bandwidth	900 GB/s per direction
Apple M3 Ultra unified memory	up to 512 GB
Apple Neural Engine (M4)	~38 TOPS INT8
Google TPU v5e BF16 TFLOPS	~197