07 / 15
Internals / 07

SIMD and vector throughput

One AVX-512 instruction does what 16 scalar instructions used to. SIMD — Single Instruction, Multiple Data — is the cheapest way to multiply throughput on data-parallel work. Vector registers up to 512 bits wide, packed with float, double, integer, or mixed types, executed in lockstep across N lanes. The catch: not every workload is vector-shaped, the wider registers cost real power, and Intel's AVX-512 on early Skylake-X clocked the whole package down. This page walks the story from MMX to AMX, with interactive lane diagrams.


The vector idea

Most CPU instructions operate on one number at a time. add x10, x11, x12 adds two 64-bit integers and produces one. SIMD changes that: a single instruction operates on a vector of 4, 8, or 16 elements packed into a wide register, and produces a corresponding vector of results. The hardware has multiple ALUs operating in lockstep — the same operation, different data. Hence "Single Instruction, Multiple Data".

The shape of a SIMD instruction is straightforward: vector add ( VPADDD in x86), vector multiply (VPMULLD), shuffle (VPSHUFB), gather (VPGATHERDD — load from N different addresses into a vector register). The interesting work is in which elements you lay out, in what order, and how the compiler vectorizes your code.

The picture to hold in your head is two ways of spending a cycle. A scalar add occupies one execution port and produces one result. A SIMD add occupies one execution port, takes the same single instruction, and produces eight results. The cost in instruction slots, decode bandwidth, and program size is identical; the work done is multiplied by the lane count. That is the whole pitch, and the diagram below is the entire idea in one frame.

scalar — 1 result / instructiona0+b0=c0…repeat 8 times = 8 instructions, 8 cycles8-lane SIMD — 8 results / instructiona0a1a2a3a4a5a6a7+b0b1b2b3b4b5b6b7one VPADDD → c0..c7 in a single cycle
Top: scalar code spends one instruction per element. Bottom: an 8-lane SIMD add does the same eight additions with one instruction in one cycle.

The two halves of that picture run on the same core, often on the same execution port, decoded from the same instruction stream. The difference is purely how wide the operands are. This is why SIMD is sometimes called free parallelism: you do not spawn threads, you do not synchronise, you do not pay scheduling cost. You widen the data and let one instruction chew through more of it. The whole game then becomes feeding those wide lanes fast enough, which is where alignment, cache behaviour, and memory bandwidth start to dominate.

Data parallelism, not task parallelism

SIMD is one specific flavour of parallelism, and it helps to place it against the others. Task parallelism runs different work on different cores: thread A handles one request, thread B handles another, and they coordinate through locks or queues. Data parallelism runs the same operation across many data elements at once. SIMD is data parallelism inside a single instruction stream — one core, one thread, one program counter, but each arithmetic instruction touches a whole vector of values.

The two compose. A typical numeric kernel uses threads to split a large array across cores (task-level, one chunk per core) and SIMD to process each chunk eight or sixteen elements at a time (data-level, inside each core). A machine with 16 cores and AVX-512 has 16 × 16 = 256 float32 lanes available per cycle before you count the second vector port or fused multiply-add. That product, cores times lanes, is the peak the hardware can ever reach, and it is the ceiling the roofline model draws on its vertical axis when it asks whether your loop is compute-bound or memory-bound.

The reason this distinction matters in practice: data parallelism has almost no coordination cost, while task parallelism pays for synchronisation, false sharing, and scheduler overhead. If your work is data-parallel, reach for SIMD first. It is the cheapest multiplier you have, and it stacks on top of threads rather than competing with them.

Lanes in a vector add

Pick a vector width. Each cell is one lane — one 32-bit element being processed in parallel with the others. A single instruction processes the whole row.

AVX2 (256-bit, 8 × i32)
A
3
7
1
4
9
2
6
8
+
B
5
2
8
3
1
6
4
7
=
SUM
8
9
9
7
10
8
10
15
8 32-bit additions in 1 instruction · 256-bit register. The execution unit has 8 parallel adders, all firing in the same cycle. Throughput per cycle: 8 additions vs 1 for scalar.

Inside a vector register

A vector register is a fixed-width box of bits with no inherent type. The same 512 bits in an AVX-512 register can be read as 16 float32s, 8 float64s, 32 int16s, or 64 bytes, depending on which instruction you point at it. The instruction carries the type; the register just holds bits. That is why the same physical register file backs integer SIMD, float SIMD, and string scanning — the lanes are an interpretation, not a property of the hardware.

one 256-bit register, three readings8 × f32f0f1f2f3f4f5f6f716 × i1632 × i8same 256 bits — the instruction picks the lane width
One register, three lane layouts. Narrower elements pack more lanes into the same width, so byte work runs at 32 or 64 lanes while double-precision runs at 4 or 8.

This is the single most useful thing to internalise about SIMD performance: lane count is the register width divided by the element size. Cut your element from 32 bits to 8 bits and you quadruple your lanes for free. It is why machine-learning inference moved aggressively to int8 and even int4 quantisation — narrower types are not just smaller in memory, they are wider in the vector unit, so the same hardware does four to eight times the arithmetic per cycle. The same logic drives image pipelines that work on 8-bit pixels and audio codecs on 16-bit samples.

Throughput vs scalar

For an array of N independent integer additions, the cycle count drops by exactly the SIMD width. Drag the slider:

1,024
scalar
1,024 cy
SSE (4 lanes)
256 cy · 4×
AVX2 (8 lanes)
128 cy · 8.0×
AVX-512 (16 lanes)
64 cy · 16.0×
This is the textbook upper bound — perfect lane utilisation, no startup cost, cache hits all the way. Real-world speedups are often closer to 0.5–0.8× of this because of memory bandwidth, alignment overhead, and tail handling for the last incomplete vector.

Twenty-eight years of vector ISAs

YearISAWidthNotes
1996MMX64-bitIntel — integer only, shared with FP registers
1999SSE128-bit4 × float32 — Pentium III
2001SSE2128-bitDoubled to handle int / double / 16 × i8
2008SSE4.2128-bitString + CRC instructions
2011AVX256-bit8 × float32 — Sandy Bridge debut
2013AVX2256-bit256-bit integer ops; gather instructions
2017AVX-512 F512-bit16 × float32; mask registers; conflict detection
2018NEON v8128-bitStandard on every ARM64 chip
2021SVE / SVE2variableARM scalable vectors — code is width-agnostic, runtime length 128–2048 bits
2023AMX8 KB tilesIntel matrix-multiply accelerator on Sapphire Rapids+
2024AVX10128–512Unified successor to AVX-512; available on E-cores

Each width doubling brought a new ISA generation. AVX-512's 512-bit register holds 16 floats, 8 doubles, 64 bytes, or anything in between. Beyond that, ARM SVE (2021) and AVX10 (2024) take a different approach: scalable vectors whose hardware width is determined at runtime, so the same compiled binary runs at 128 bits on a small core and 512 bits on a server core.

AVX-512 license-down — and why it's mostly dead

On Intel Skylake-X (2017) and Cascade Lake (2019) server chips, every time an AVX-512 instruction executed, the package transitioned to a "license 2" power state — frequency dropped by ~10–25% across the entire socket, including any scalar code running on the same core. The penalty lasted hundreds of microseconds after the last AVX-512 instruction. On workloads that occasionally used AVX-512 between long stretches of scalar code (most of them), this was a net loss: the AVX-512 sped up its own work but slowed everything else down enough to cancel the gain.

The community's response was fierce. Cloudflare, Mozilla, and others published benchmarks showing how disabling AVX-512 made workloads faster. Linus Torvalds famously called it "this AVX-512 thing" and recommended targeting it like radioactive material. Subsequent generations fixed the issue: Ice Lake (2019) reduced the penalty to ~5%; Sapphire Rapids (2023) eliminated it almost entirely. AMD Zen 4 (2022) and Zen 5 (2024) implement AVX-512 without any meaningful frequency drop.

The general lesson: wider vectors mean more transistors switching per cycle, which means more dynamic power. Up to a point this is fine. Past that point, modern CPUs use voltage and frequency scaling to stay within their power budget. AVX-512 on Skylake-X was over the line; on modern silicon it's not.

Auto-vectorization vs intrinsics

Modern compilers (GCC, Clang, MSVC) auto-vectorize loops that look like this:

void add_arrays(int *a, int *b, int *out, int n) {
    for (int i = 0; i < n; i++) {
        out[i] = a[i] + b[i];
    }
}

// Clang -O3 -march=skylake-avx512 produces:
//   .loop:
//       vmovdqu64  zmm0, [rdi + rcx*4]    ; load 16 ints from a
//       vpaddd     zmm0, zmm0, [rsi + rcx*4]  ; add 16 ints from b
//       vmovdqu64  [rdx + rcx*4], zmm0    ; store 16 ints to out
//       add        rcx, 16
//       cmp        rcx, r8
//       jl         .loop

This works when the loop is straightforward enough — fixed iteration count, no internal branches, no data dependencies between iterations, no aliasing concerns. When the compiler can't or won't vectorize, you fall back to intrinsics — C functions that map 1:1 to SIMD instructions. They look like:

#include <immintrin.h>

__m512i va = _mm512_loadu_epi32(&a[i]);
__m512i vb = _mm512_loadu_epi32(&b[i]);
__m512i vsum = _mm512_add_epi32(va, vb);
_mm512_storeu_epi32(&out[i], vsum);

// Same generated assembly. Verbose but explicit.

Intrinsics give you exact control: you choose the alignment, the masking, the memory ordering. The cost is portability — AVX-512 intrinsics don't run on Apple silicon (which uses NEON / ARM SVE). Workarounds: write intrinsic versions for each ISA, or use a portable wrapper library like Highway (Google), xsimd, or std::experimental::simd in C++26.

A practical workflow that holds up: write the loop in plain scalar form first, get it correct, then check whether the compiler already vectorized it. Most compilers emit a vectorization report on request (-fopt-info-vec in GCC, -Rpass=loop-vectorize in Clang) that tells you which loops became SIMD and which were rejected, and why. Reach for intrinsics only on the loops that show up hot in a profile and that the compiler refused to vectorize. Hand-writing intrinsics everywhere is a common mistake: it is slow to write, hard to read, easy to get wrong, and usually no faster than letting the compiler handle the straightforward cases. Save the effort for the inner kernel that actually pays for it.

Mask registers — the AVX-512 quiet revolution

AVX-512 introduced eight mask registers (k0k7), each holding up to 64 bits. Almost every AVX-512 instruction takes an optional mask: lanes whose mask bit is 0 are skipped (or merged with the destination's old value). Mask registers turn data-dependent control flow into branchless vector code.

// Conditional add: out[i] = a[i] + b[i] if a[i] > 0 else a[i]
__m512i va = _mm512_loadu_epi32(&a[i]);
__m512i vb = _mm512_loadu_epi32(&b[i]);
__mmask16 m = _mm512_cmpgt_epi32_mask(va, _mm512_setzero_epi32());
// blend: where mask is 1, use va+vb; where 0, use va
__m512i vsum = _mm512_mask_add_epi32(va, m, va, vb);
_mm512_storeu_epi32(&out[i], vsum);

Without masks, the same operation would require either a branch (which kills vectorization) or a separate select-and-blend pattern. Masks fold the conditional into the instruction itself — no branch, no divergence cost, full SIMD throughput even when only some lanes do "work". This is the move that made GPU programming techniques portable to CPU SIMD.

When SIMD wins, loses, partly works

PatternSIMD fitNotes
Vector add of two arrays YES Textbook fit. Auto-vectorized by every compiler since GCC 4.x.
Sum reduction (a + b + c + …) YES Use multiple accumulators or tree reduction to avoid the dependency chain.
Dot product YES Vectorized multiply followed by horizontal add. AVX-512 has a dedicated FMA + reduce.
Linear interpolation, gamma correction YES Pure data-parallel; one of SIMD's best cases.
Pointer chasing (linked list traversal) NO Each step depends on the previous load — no parallelism to exploit.
Hash table probe PART Possible with gather instructions but each load can miss independently — bandwidth-limited, not compute-limited.
JSON parsing PART simdjson uses SIMD for byte-level scanning, but control flow is inherently scalar.
String comparison / substring search YES AVX-512 has dedicated string instructions. memcmp / strchr / strstr all win.
Sorting PART Bitonic sort and quicksort partitions vectorize well; merge phases are scalar.
Cryptographic hashing (SHA, BLAKE) YES AES-NI, SHA-NI; modern x86 has dedicated SIMD-shaped crypto units.

ARM SVE — variable-width vectors

ARM's Scalable Vector Extension (SVE) takes a different approach from AVX. Instead of a fixed register width, SVE registers can be 128 to 2048 bits wide; the actual width is determined by the silicon at runtime. The same compiled binary runs unchanged on a 128-bit-wide microcontroller, a 256-bit-wide phone, and a 512-bit-wide HPC chip — the loop just iterates more or fewer times to cover the array.

The mechanism: SVE instructions take a predicate (similar to AVX-512 masks) that says how many elements of the current iteration to actually process. The compiler emits a "vector-length-agnostic" loop that asks the hardware "how many elements fit in your registers?" and processes that many per iteration. Apple silicon as of M4 supports a 128-bit subset of SVE (NEON- compatible); future server-class ARM chips will go wider.

Alignment, and why it matters

A vector load pulls 16, 32, or 64 bytes from memory in one go. The hardware is happiest when that block starts on an address that is a multiple of the vector width — a 32-byte AVX2 load from an address divisible by 32, a 64-byte AVX-512 load from an address divisible by 64. This is called alignment, and it matters because an aligned load maps cleanly onto a single cache line access, while an unaligned load can straddle two cache lines and cost an extra access.

There are two families of vector load instruction for exactly this reason. The aligned form (vmovdqa, _mm256_load_si256) assumes the address is aligned and faults if it is not — historically it was also faster. The unaligned form (vmovdqu, _mm256_loadu_si256) handles any address. On modern Intel and AMD cores the speed gap between the two has nearly closed when the data happens to be aligned anyway, so most code now uses the unaligned form everywhere and simply tries to allocate aligned buffers. The penalty that remains shows up when an access actually crosses a cache-line or page boundary, which the unaligned form makes legal but not free.

Two practical habits follow. Allocate vector buffers with an aligned allocator (aligned_alloc, posix_memalign, or an over-aligned type) so the common case is cheap. And mind the tail: arrays rarely divide evenly by the lane count, so the last few elements that do not fill a full vector need either a scalar cleanup loop or a masked vector operation. The tail is a frequent source of off-by-one bugs and of disappointing speedups on short arrays, because the fixed setup cost of a vector loop is amortised over too few elements.

Rule of thumb: below roughly 16 to 32 elements, the setup and tail handling of a wide vector loop can cost more than it saves. Short, hot, fixed-size arrays sometimes run faster on a narrower SIMD width — or on plain scalar code that the branch predictor and out-of-order engine handle perfectly well.

SIMD, SIMT, and tensor units

SIMD has two close relatives that are easy to confuse with it. The first is SIMT — Single Instruction, Multiple Thread — which is how GPUs work. A GPU groups threads into a warp (32 on NVIDIA) or wavefront (32 or 64 on AMD), and the whole group executes the same instruction in lockstep, each thread on its own data. That is SIMD in spirit, but exposed to the programmer as independent threads rather than explicit vector registers. The compiler and hardware hide the lanes behind a thread abstraction, which is why GPU code reads like scalar code that happens to run thousands of times.

The differences are mostly about scale and latency. A CPU runs tens of SIMD lanes with single-digit-nanosecond instruction latency and a deep cache hierarchy to keep them fed. A GPU runs tens of thousands of SIMT lanes and hides memory latency by swapping between warps instead of caching aggressively. CPUs win on branchy, latency-sensitive, low-parallelism work; GPUs win when you have enough independent data to drown out memory latency. They are not competitors so much as different points on the same data-parallel curve. The roofline model is the tool that tells you which one a given kernel belongs on.

The second relative is the tensor unit: Intel AMX, NVIDIA tensor cores, Apple AMX, Google TPU MXUs. These do not process a vector with one instruction — they process a small matrix multiply with one instruction. A single AMX or tensor-core instruction multiplies, say, a 16×16 tile by another and accumulates, which is hundreds of multiply-adds folded into one op. They exist because matrix multiply is the dominant operation in deep learning, and feeding it through ordinary SIMD lanes leaves too much throughput on the table. If you care about how these get fed in production, the inference and serving page covers how the work is batched and scheduled to keep the matrix units busy.

The progression is one of widening granularity. Scalar does one element per instruction. SIMD does a vector. SIMT does a vector but calls each lane a thread. Tensor units do a whole tile. Each step trades flexibility for throughput, and each step is the right tool only when your work has the matching shape: enough independent elements, enough independent rows, enough matrix structure to fill the wider unit.

Common misconceptions

  • "AVX-512 is always faster than AVX2." On Skylake-X, often slower in mixed code due to license-down. On modern Intel and AMD, faster on data-parallel work. On Apple silicon, irrelevant — Apple uses NEON / ARM SVE, not AVX.
  • "Auto-vectorization is good enough." For straightforward loops, yes. For complex inner kernels — image processing, parser inner loops, math libraries — hand-written intrinsics still beat auto-vec by 20–60%, sometimes 10×.
  • "Wider is always better." No. AVX-512 on a 4-element-long array is slower than AVX2 because the overhead of setting up the larger register dominates the work. The break-even is around 16–32 elements; below that, narrower SIMD or scalar wins.
  • "SIMD only matters for HPC." Modern memcpy, memset, JSON parsers, video codecs, regex engines, hash functions, and database scan operators all rely heavily on SIMD. The "boring" CPU instructions you use every day are vectorized internally.
  • "GPUs replaced SIMD." No, they're complementary. GPUs are SIMT (Single Instruction Multiple Thread), which is similar but optimized for thousands of lanes and high latency tolerance. CPUs use SIMD for tens of lanes with low latency. Different shapes for different workloads, and the roofline model tells you which one a kernel belongs on.

Numbers worth remembering

QuantityValueNotes
SSE register width128 bits4 × i32, 2 × i64, 4 × f32, 2 × f64
AVX / AVX2 register width256 bits8 × i32, 8 × f32, 4 × f64
AVX-512 register width512 bits16 × i32, 16 × f32, 8 × f64
ARM NEON register width128 bitsStandard on every ARM64 chip
Number of AVX-512 mask registers8 (k0..k7)k0 implicit "all 1s"; up to 64 bits each
Skylake-X AVX-512 license-down penalty~10–25%Frequency drop; persisted ~600 µs after last AVX-512 op
Sapphire Rapids AVX-512 frequency drop< 5%The license-down problem largely resolved
Apple AMX matrix multiply throughput~2 TFLOPS / coreOutside the standard ISA, accessed via Accelerate framework
Intel AMX tile size (Sapphire Rapids+)8 × 64 bytes (512 bytes)Each core has 8 tile registers
simdjson parsing throughput~3 GB/s on AVX-512Roughly 4× a scalar parser

Further reading

  • Intel Intrinsics Guide — searchable reference for every x86 SIMD instruction with throughput / latency on each microarchitecture.
  • ARM Intrinsics — equivalent reference for NEON / SVE.
  • Agner Fog — Optimizing Assembly — Section 13 covers vector code in detail, including AVX-512 license-down measurements.
  • Hennessy & Patterson — Computer Architecture: A Quantitative Approach. Chapter 4 (Data-Level Parallelism) covers SIMD as one of three flavours alongside vector machines and GPUs.
  • Google Highway — portable SIMD wrapper that compiles to AVX-512, NEON, SVE, and SSE from one source.
  • simdjson — the JSON parser that uses SIMD aggressively. The paper and source are excellent reading on practical SIMD application.
  • Wikipedia — AVX-512 — comprehensive list of every AVX-512 sub-extension and the chips that support each.
  • Chips and Cheese — measured SIMD throughput and license-down behaviour on every recent CPU.
Found this useful?