SIMD and vector throughput
One AVX-512 instruction does what 16 scalar instructions used to. SIMD — Single Instruction, Multiple Data — is the cheapest way to multiply throughput on data-parallel work. Vector registers up to 512 bits wide, packed with float, double, integer, or mixed types, executed in lockstep across N lanes. The catch: not every workload is vector-shaped, the wider registers cost real power, and Intel's AVX-512 on early Skylake-X clocked the whole package down. This page walks the story from MMX to AMX, with interactive lane diagrams.
The vector idea
Most CPU instructions operate on one number at a time. add x10, x11, x12
adds two 64-bit integers and produces one. SIMD changes that: a single instruction
operates on a vector of 4, 8, or 16 elements packed into a wide register,
and produces a corresponding vector of results. The hardware has multiple ALUs
operating in lockstep — the same operation, different data. Hence "Single
Instruction, Multiple Data".
The shape of a SIMD instruction is straightforward: vector add (
VPADDD in x86), vector multiply (VPMULLD), shuffle
(VPSHUFB), gather (VPGATHERDD — load from N different
addresses into a vector register). The interesting work is in which
elements you lay out, in what order, and how the compiler vectorizes your code.
The picture to hold in your head is two ways of spending a cycle. A scalar add occupies one execution port and produces one result. A SIMD add occupies one execution port, takes the same single instruction, and produces eight results. The cost in instruction slots, decode bandwidth, and program size is identical; the work done is multiplied by the lane count. That is the whole pitch, and the diagram below is the entire idea in one frame.
The two halves of that picture run on the same core, often on the same execution port, decoded from the same instruction stream. The difference is purely how wide the operands are. This is why SIMD is sometimes called free parallelism: you do not spawn threads, you do not synchronise, you do not pay scheduling cost. You widen the data and let one instruction chew through more of it. The whole game then becomes feeding those wide lanes fast enough, which is where alignment, cache behaviour, and memory bandwidth start to dominate.
Data parallelism, not task parallelism
SIMD is one specific flavour of parallelism, and it helps to place it against the others. Task parallelism runs different work on different cores: thread A handles one request, thread B handles another, and they coordinate through locks or queues. Data parallelism runs the same operation across many data elements at once. SIMD is data parallelism inside a single instruction stream — one core, one thread, one program counter, but each arithmetic instruction touches a whole vector of values.
The two compose. A typical numeric kernel uses threads to split a large array across cores (task-level, one chunk per core) and SIMD to process each chunk eight or sixteen elements at a time (data-level, inside each core). A machine with 16 cores and AVX-512 has 16 × 16 = 256 float32 lanes available per cycle before you count the second vector port or fused multiply-add. That product, cores times lanes, is the peak the hardware can ever reach, and it is the ceiling the roofline model draws on its vertical axis when it asks whether your loop is compute-bound or memory-bound.
The reason this distinction matters in practice: data parallelism has almost no coordination cost, while task parallelism pays for synchronisation, false sharing, and scheduler overhead. If your work is data-parallel, reach for SIMD first. It is the cheapest multiplier you have, and it stacks on top of threads rather than competing with them.
Lanes in a vector add
Pick a vector width. Each cell is one lane — one 32-bit element being processed in parallel with the others. A single instruction processes the whole row.
Inside a vector register
A vector register is a fixed-width box of bits with no inherent type. The same 512 bits in an AVX-512 register can be read as 16 float32s, 8 float64s, 32 int16s, or 64 bytes, depending on which instruction you point at it. The instruction carries the type; the register just holds bits. That is why the same physical register file backs integer SIMD, float SIMD, and string scanning — the lanes are an interpretation, not a property of the hardware.
This is the single most useful thing to internalise about SIMD performance: lane count is the register width divided by the element size. Cut your element from 32 bits to 8 bits and you quadruple your lanes for free. It is why machine-learning inference moved aggressively to int8 and even int4 quantisation — narrower types are not just smaller in memory, they are wider in the vector unit, so the same hardware does four to eight times the arithmetic per cycle. The same logic drives image pipelines that work on 8-bit pixels and audio codecs on 16-bit samples.
Throughput vs scalar
For an array of N independent integer additions, the cycle count drops by exactly the SIMD width. Drag the slider:
Twenty-eight years of vector ISAs
| Year | ISA | Width | Notes |
|---|---|---|---|
| 1996 | MMX | 64-bit | Intel — integer only, shared with FP registers |
| 1999 | SSE | 128-bit | 4 × float32 — Pentium III |
| 2001 | SSE2 | 128-bit | Doubled to handle int / double / 16 × i8 |
| 2008 | SSE4.2 | 128-bit | String + CRC instructions |
| 2011 | AVX | 256-bit | 8 × float32 — Sandy Bridge debut |
| 2013 | AVX2 | 256-bit | 256-bit integer ops; gather instructions |
| 2017 | AVX-512 F | 512-bit | 16 × float32; mask registers; conflict detection |
| 2018 | NEON v8 | 128-bit | Standard on every ARM64 chip |
| 2021 | SVE / SVE2 | variable | ARM scalable vectors — code is width-agnostic, runtime length 128–2048 bits |
| 2023 | AMX | 8 KB tiles | Intel matrix-multiply accelerator on Sapphire Rapids+ |
| 2024 | AVX10 | 128–512 | Unified successor to AVX-512; available on E-cores |
Each width doubling brought a new ISA generation. AVX-512's 512-bit register holds 16 floats, 8 doubles, 64 bytes, or anything in between. Beyond that, ARM SVE (2021) and AVX10 (2024) take a different approach: scalable vectors whose hardware width is determined at runtime, so the same compiled binary runs at 128 bits on a small core and 512 bits on a server core.
AVX-512 license-down — and why it's mostly dead
On Intel Skylake-X (2017) and Cascade Lake (2019) server chips, every time an AVX-512 instruction executed, the package transitioned to a "license 2" power state — frequency dropped by ~10–25% across the entire socket, including any scalar code running on the same core. The penalty lasted hundreds of microseconds after the last AVX-512 instruction. On workloads that occasionally used AVX-512 between long stretches of scalar code (most of them), this was a net loss: the AVX-512 sped up its own work but slowed everything else down enough to cancel the gain.
The community's response was fierce. Cloudflare, Mozilla, and others published benchmarks showing how disabling AVX-512 made workloads faster. Linus Torvalds famously called it "this AVX-512 thing" and recommended targeting it like radioactive material. Subsequent generations fixed the issue: Ice Lake (2019) reduced the penalty to ~5%; Sapphire Rapids (2023) eliminated it almost entirely. AMD Zen 4 (2022) and Zen 5 (2024) implement AVX-512 without any meaningful frequency drop.
Auto-vectorization vs intrinsics
Modern compilers (GCC, Clang, MSVC) auto-vectorize loops that look like this:
void add_arrays(int *a, int *b, int *out, int n) {
for (int i = 0; i < n; i++) {
out[i] = a[i] + b[i];
}
}
// Clang -O3 -march=skylake-avx512 produces:
// .loop:
// vmovdqu64 zmm0, [rdi + rcx*4] ; load 16 ints from a
// vpaddd zmm0, zmm0, [rsi + rcx*4] ; add 16 ints from b
// vmovdqu64 [rdx + rcx*4], zmm0 ; store 16 ints to out
// add rcx, 16
// cmp rcx, r8
// jl .loopThis works when the loop is straightforward enough — fixed iteration count, no internal branches, no data dependencies between iterations, no aliasing concerns. When the compiler can't or won't vectorize, you fall back to intrinsics — C functions that map 1:1 to SIMD instructions. They look like:
#include <immintrin.h>
__m512i va = _mm512_loadu_epi32(&a[i]);
__m512i vb = _mm512_loadu_epi32(&b[i]);
__m512i vsum = _mm512_add_epi32(va, vb);
_mm512_storeu_epi32(&out[i], vsum);
// Same generated assembly. Verbose but explicit.Intrinsics give you exact control: you choose the alignment, the masking, the memory ordering. The cost is portability — AVX-512 intrinsics don't run on Apple silicon (which uses NEON / ARM SVE). Workarounds: write intrinsic versions for each ISA, or use a portable wrapper library like Highway (Google), xsimd, or std::experimental::simd in C++26.
A practical workflow that holds up: write the loop in plain scalar form first, get
it correct, then check whether the compiler already vectorized it. Most compilers
emit a vectorization report on request (-fopt-info-vec in GCC,
-Rpass=loop-vectorize in Clang) that tells you which loops became SIMD
and which were rejected, and why. Reach for intrinsics only on the loops that show
up hot in a profile and that the compiler refused to vectorize. Hand-writing
intrinsics everywhere is a common mistake: it is slow to write, hard to read, easy
to get wrong, and usually no faster than letting the compiler handle the
straightforward cases. Save the effort for the inner kernel that actually pays for
it.
Mask registers — the AVX-512 quiet revolution
AVX-512 introduced eight mask registers (k0–k7),
each holding up to 64 bits. Almost every AVX-512 instruction takes an optional
mask: lanes whose mask bit is 0 are skipped (or merged with the destination's
old value). Mask registers turn data-dependent control flow into branchless
vector code.
// Conditional add: out[i] = a[i] + b[i] if a[i] > 0 else a[i]
__m512i va = _mm512_loadu_epi32(&a[i]);
__m512i vb = _mm512_loadu_epi32(&b[i]);
__mmask16 m = _mm512_cmpgt_epi32_mask(va, _mm512_setzero_epi32());
// blend: where mask is 1, use va+vb; where 0, use va
__m512i vsum = _mm512_mask_add_epi32(va, m, va, vb);
_mm512_storeu_epi32(&out[i], vsum);Without masks, the same operation would require either a branch (which kills vectorization) or a separate select-and-blend pattern. Masks fold the conditional into the instruction itself — no branch, no divergence cost, full SIMD throughput even when only some lanes do "work". This is the move that made GPU programming techniques portable to CPU SIMD.
When SIMD wins, loses, partly works
| Pattern | SIMD fit | Notes |
|---|---|---|
| Vector add of two arrays | YES | Textbook fit. Auto-vectorized by every compiler since GCC 4.x. |
| Sum reduction (a + b + c + …) | YES | Use multiple accumulators or tree reduction to avoid the dependency chain. |
| Dot product | YES | Vectorized multiply followed by horizontal add. AVX-512 has a dedicated FMA + reduce. |
| Linear interpolation, gamma correction | YES | Pure data-parallel; one of SIMD's best cases. |
| Pointer chasing (linked list traversal) | NO | Each step depends on the previous load — no parallelism to exploit. |
| Hash table probe | PART | Possible with gather instructions but each load can miss independently — bandwidth-limited, not compute-limited. |
| JSON parsing | PART | simdjson uses SIMD for byte-level scanning, but control flow is inherently scalar. |
| String comparison / substring search | YES | AVX-512 has dedicated string instructions. memcmp / strchr / strstr all win. |
| Sorting | PART | Bitonic sort and quicksort partitions vectorize well; merge phases are scalar. |
| Cryptographic hashing (SHA, BLAKE) | YES | AES-NI, SHA-NI; modern x86 has dedicated SIMD-shaped crypto units. |
ARM SVE — variable-width vectors
ARM's Scalable Vector Extension (SVE) takes a different approach from AVX. Instead of a fixed register width, SVE registers can be 128 to 2048 bits wide; the actual width is determined by the silicon at runtime. The same compiled binary runs unchanged on a 128-bit-wide microcontroller, a 256-bit-wide phone, and a 512-bit-wide HPC chip — the loop just iterates more or fewer times to cover the array.
The mechanism: SVE instructions take a predicate (similar to AVX-512 masks) that says how many elements of the current iteration to actually process. The compiler emits a "vector-length-agnostic" loop that asks the hardware "how many elements fit in your registers?" and processes that many per iteration. Apple silicon as of M4 supports a 128-bit subset of SVE (NEON- compatible); future server-class ARM chips will go wider.
Alignment, and why it matters
A vector load pulls 16, 32, or 64 bytes from memory in one go. The hardware is happiest when that block starts on an address that is a multiple of the vector width — a 32-byte AVX2 load from an address divisible by 32, a 64-byte AVX-512 load from an address divisible by 64. This is called alignment, and it matters because an aligned load maps cleanly onto a single cache line access, while an unaligned load can straddle two cache lines and cost an extra access.
There are two families of vector load instruction for exactly this reason. The
aligned form (vmovdqa, _mm256_load_si256) assumes the
address is aligned and faults if it is not — historically it was also faster. The
unaligned form (vmovdqu, _mm256_loadu_si256) handles any
address. On modern Intel and AMD cores the speed gap between the two has nearly
closed when the data happens to be aligned anyway, so most code now uses the
unaligned form everywhere and simply tries to allocate aligned buffers. The penalty
that remains shows up when an access actually crosses a cache-line or page boundary,
which the unaligned form makes legal but not free.
Two practical habits follow. Allocate vector buffers with an aligned allocator
(aligned_alloc, posix_memalign, or an over-aligned type)
so the common case is cheap. And mind the tail: arrays rarely divide evenly
by the lane count, so the last few elements that do not fill a full vector need
either a scalar cleanup loop or a masked vector operation. The tail is a frequent
source of off-by-one bugs and of disappointing speedups on short arrays, because the
fixed setup cost of a vector loop is amortised over too few elements.
SIMD, SIMT, and tensor units
SIMD has two close relatives that are easy to confuse with it. The first is SIMT — Single Instruction, Multiple Thread — which is how GPUs work. A GPU groups threads into a warp (32 on NVIDIA) or wavefront (32 or 64 on AMD), and the whole group executes the same instruction in lockstep, each thread on its own data. That is SIMD in spirit, but exposed to the programmer as independent threads rather than explicit vector registers. The compiler and hardware hide the lanes behind a thread abstraction, which is why GPU code reads like scalar code that happens to run thousands of times.
The differences are mostly about scale and latency. A CPU runs tens of SIMD lanes with single-digit-nanosecond instruction latency and a deep cache hierarchy to keep them fed. A GPU runs tens of thousands of SIMT lanes and hides memory latency by swapping between warps instead of caching aggressively. CPUs win on branchy, latency-sensitive, low-parallelism work; GPUs win when you have enough independent data to drown out memory latency. They are not competitors so much as different points on the same data-parallel curve. The roofline model is the tool that tells you which one a given kernel belongs on.
The second relative is the tensor unit: Intel AMX, NVIDIA tensor cores, Apple AMX, Google TPU MXUs. These do not process a vector with one instruction — they process a small matrix multiply with one instruction. A single AMX or tensor-core instruction multiplies, say, a 16×16 tile by another and accumulates, which is hundreds of multiply-adds folded into one op. They exist because matrix multiply is the dominant operation in deep learning, and feeding it through ordinary SIMD lanes leaves too much throughput on the table. If you care about how these get fed in production, the inference and serving page covers how the work is batched and scheduled to keep the matrix units busy.
The progression is one of widening granularity. Scalar does one element per instruction. SIMD does a vector. SIMT does a vector but calls each lane a thread. Tensor units do a whole tile. Each step trades flexibility for throughput, and each step is the right tool only when your work has the matching shape: enough independent elements, enough independent rows, enough matrix structure to fill the wider unit.
Common misconceptions
- "AVX-512 is always faster than AVX2." On Skylake-X, often slower in mixed code due to license-down. On modern Intel and AMD, faster on data-parallel work. On Apple silicon, irrelevant — Apple uses NEON / ARM SVE, not AVX.
- "Auto-vectorization is good enough." For straightforward loops, yes. For complex inner kernels — image processing, parser inner loops, math libraries — hand-written intrinsics still beat auto-vec by 20–60%, sometimes 10×.
- "Wider is always better." No. AVX-512 on a 4-element-long array is slower than AVX2 because the overhead of setting up the larger register dominates the work. The break-even is around 16–32 elements; below that, narrower SIMD or scalar wins.
- "SIMD only matters for HPC." Modern memcpy, memset, JSON parsers, video codecs, regex engines, hash functions, and database scan operators all rely heavily on SIMD. The "boring" CPU instructions you use every day are vectorized internally.
- "GPUs replaced SIMD." No, they're complementary. GPUs are SIMT (Single Instruction Multiple Thread), which is similar but optimized for thousands of lanes and high latency tolerance. CPUs use SIMD for tens of lanes with low latency. Different shapes for different workloads, and the roofline model tells you which one a kernel belongs on.
Numbers worth remembering
| Quantity | Value | Notes |
|---|---|---|
| SSE register width | 128 bits | 4 × i32, 2 × i64, 4 × f32, 2 × f64 |
| AVX / AVX2 register width | 256 bits | 8 × i32, 8 × f32, 4 × f64 |
| AVX-512 register width | 512 bits | 16 × i32, 16 × f32, 8 × f64 |
| ARM NEON register width | 128 bits | Standard on every ARM64 chip |
| Number of AVX-512 mask registers | 8 (k0..k7) | k0 implicit "all 1s"; up to 64 bits each |
| Skylake-X AVX-512 license-down penalty | ~10–25% | Frequency drop; persisted ~600 µs after last AVX-512 op |
| Sapphire Rapids AVX-512 frequency drop | < 5% | The license-down problem largely resolved |
| Apple AMX matrix multiply throughput | ~2 TFLOPS / core | Outside the standard ISA, accessed via Accelerate framework |
| Intel AMX tile size (Sapphire Rapids+) | 8 × 64 bytes (512 bytes) | Each core has 8 tile registers |
| simdjson parsing throughput | ~3 GB/s on AVX-512 | Roughly 4× a scalar parser |
Further reading
- Intel Intrinsics Guide — searchable reference for every x86 SIMD instruction with throughput / latency on each microarchitecture.
- ARM Intrinsics — equivalent reference for NEON / SVE.
- Agner Fog — Optimizing Assembly — Section 13 covers vector code in detail, including AVX-512 license-down measurements.
- Hennessy & Patterson — Computer Architecture: A Quantitative Approach. Chapter 4 (Data-Level Parallelism) covers SIMD as one of three flavours alongside vector machines and GPUs.
- Google Highway — portable SIMD wrapper that compiles to AVX-512, NEON, SVE, and SSE from one source.
- simdjson — the JSON parser that uses SIMD aggressively. The paper and source are excellent reading on practical SIMD application.
- Wikipedia — AVX-512 — comprehensive list of every AVX-512 sub-extension and the chips that support each.
- Chips and Cheese — measured SIMD throughput and license-down behaviour on every recent CPU.