NUMA and memory bandwidth
On a single-socket laptop, every byte of DRAM is the same distance from the CPU. On a multi-socket server, DRAM near one socket is one hop away; DRAM near the next socket is two hops. Same-socket: 80 nanoseconds. Adjacent socket: 140 nanoseconds. Far corner of an 8-socket box: 290 nanoseconds. Non-Uniform Memory Access is what we call this asymmetry, and it's the dominant performance constraint on every workload that scales beyond one socket.
Why memory access stopped being uniform
On 1990s-era servers, RAM hung off a shared front-side bus. Every CPU went through the same memory controller. Latency was uniform — slow for everyone. As core counts grew, the front-side bus became the bottleneck: every additional core added contention without adding bandwidth. The fix, around 2003 with AMD's Opteron and 2008 with Intel's Nehalem, was to put the memory controller on the CPU and give each socket its own private DRAM channels.
That solved the bandwidth problem. It created a new one: memory attached to one socket is faster than memory attached to another. The CPUs talk to each other over an inter-socket interconnect (Intel calls it UPI / QPI; AMD calls it Infinity Fabric); requests for remote memory traverse that interconnect, then the remote memory controller, then the remote DRAM, then come back. The trip is roughly 60 ns longer than a local access, about 80% extra latency.
The word "node" is the unit you'll see everywhere once you start measuring this. A NUMA node is one memory domain: a set of CPU cores plus the DRAM channels wired directly to those cores. On most servers a node maps one-to-one to a physical socket, but that's a convention, not a rule. Modern chips can split a single socket into several nodes, and pooled memory devices can show up as nodes with no cores at all. What stays constant is the meaning: inside a node, access is cheap; crossing a node boundary costs latency and shares a limited interconnect with everyone else doing the same thing.
The latency figure above is the quiet-system number, measured with one core pulling data and nothing else competing. The other half of the story is bandwidth. The interconnect that carries remote traffic is far narrower than the combined DRAM channels of all the nodes, so once many cores start pulling remote data at the same time, they queue for the same link. Latency under load climbs well past the idle figure, and throughput flattens. This is why a workload can look healthy in a micro-benchmark and fall apart in production: the micro-benchmark never saturated the interconnect.
Local and remote, side by side
Two NUMA nodes, two memory accesses. A core on node 0 reading its own DRAM takes the short path: core to local memory controller to local DRAM and back. A core on node 0 reading node 1's DRAM takes the long path: it hops across the interconnect to node 1's memory controller, waits for that controller to fetch from node 1's DRAM, then carries the data all the way back. Same instruction in your code, very different cost.
Notice what's not different: the core, the cache hierarchy, the instruction. A load is a load. The address decides the path, and the address was fixed when the page was allocated, usually long before the load runs. That's why NUMA tuning is mostly about placement, not about rewriting hot loops. You're not making the access faster; you're making sure the page sits in the node that reads it. The same principle that governs the memory hierarchy (keep data close to the unit that uses it) applies one level up: between sockets instead of between caches.
The latency matrix
Pick a topology. The cell at row i, column j shows the latency when a thread on socket i reads DRAM attached to socket j. Local-socket cells are dark; remote-socket cells lighten as the distance grows.
| CPU → MEM ↓ | S0 | S1 | S2 | S3 |
|---|---|---|---|---|
| S0 | 80 ns | 130 ns | 150 ns | 180 ns |
| S1 | 130 ns | 80 ns | 180 ns | 150 ns |
| S2 | 150 ns | 180 ns | 80 ns | 130 ns |
| S3 | 180 ns | 150 ns | 130 ns | 80 ns |
Bandwidth scales with sockets and channels
Each DDR5 channel sustains ~38 GB/s. Modern Intel Sapphire Rapids and AMD Genoa servers have 8–12 channels per socket. Total bandwidth scales linearly with channels and sockets — but only if the workload's memory accesses spread evenly across them.
Where bandwidth comes from
| Technology | Per-channel bandwidth |
|---|---|
| DDR3-1600 | 12.8 GB/s |
| DDR4-2400 | 19.2 GB/s |
| DDR4-3200 | 25.6 GB/s |
| DDR5-4800 | 38.4 GB/s |
| DDR5-6400 | 51.2 GB/s |
| DDR5-8000 | 64.0 GB/s |
| HBM3 (1 stack) | 819.0 GB/s |
DDR roughly doubles per-channel bandwidth every two generations. HBM (High Bandwidth Memory) on GPUs and accelerators uses 1024-bit interfaces with hundreds of pins, sacrificing capacity and cost for bandwidth. NVIDIA H100 has ~3 TB/s of HBM3 bandwidth — about 40× a typical CPU socket. This is the fundamental reason GPU memory-bound workloads outpace CPU ones.
First-touch policy
Linux's default policy: when a thread first touches a page (reads or writes), the OS allocates physical memory on whichever NUMA node that thread is running on. This sounds reasonable. It causes one of the most common production bugs.
The policy makes sense in isolation. A thread that allocates a buffer is usually the thread that fills it and reads it, so putting the pages on that thread's node is the right default most of the time. The trouble starts when allocation and use are split across threads on different nodes, which is exactly how most server software is built: one thread (or a startup routine) carves out a big region, and a pool of worker threads spread across every node do the actual work on it.
// Bug: allocator thread is wrong
void *buffer = malloc(1024 * 1024 * 1024); // alloc on socket 0
memset(buffer, 0, 1024 * 1024 * 1024); // fault in on socket 0
// Pass buffer to a worker on socket 1
launch_worker_on_socket(1, buffer);
// Worker now reads at remote-socket latency. ~70% throughput loss.
// Fix: have the worker do its own first-touch
void *fix_first_touch(void *buffer, size_t n, int node) {
// Bind to target node, then memset.
numa_run_on_node(node);
memset(buffer, 0, n);
}The fix is to have the consuming thread do the first-touch. Or use
numa_alloc_onnode() from libnuma, which forces allocation on a
specific node regardless of who calls it. Or use a NUMA-aware allocator like
jemalloc with per-arena binding.
Linux NUMA balancing
The Linux kernel has automatic NUMA balancing
(/proc/sys/kernel/numa_balancing). It periodically samples memory
access patterns; if a thread is consistently reading from a remote node, the
kernel migrates the pages closer or moves the thread. This works for steady-state
workloads but causes problems with bursty ones — the migration itself is
expensive (TLB shootdown + page copy), and the heuristics can flip-flop.
Production guidance: explicit binding always beats automatic balancing for predictable workloads. The tools:
numactl --cpunodebind=N --membind=N— pin a process to node N for both CPU and memory.taskset— pin to specific CPUs (lets the OS choose memory).numastat -p <pid>— see how a process's memory is distributed across nodes.perf c2c— find cache lines that bounce between sockets (the worst kind of NUMA pathology).cgroups v2 memory.numa_stat— per-cgroup NUMA usage, useful in containerized environments.
Affinity, pinning, and why a migrating thread tanks
Affinity is the rule that ties a thread to a set of cores. Pinning is affinity taken to its strict form: this thread runs on this core (or this node) and nowhere else. Both exist because the kernel scheduler, left alone, treats cores as interchangeable. It will happily move a thread from a core on node 0 to a core on node 1 to balance load, fill an idle core, or react to a sleeping neighbour. That decision is good for fairness and bad for NUMA, because the thread's memory does not move with it. The pages it allocated under first-touch stay on node 0; the thread now runs on node 1; every access it makes is suddenly remote.
This is the single most common way a service quietly loses a third of its throughput. Nothing crashes, no error is logged, and the code is unchanged. A scheduler decision made microseconds ago turned a stream of local reads into a stream of remote ones. Worse, the caches the thread had warmed on its old core are cold on the new one, so it pays cache-miss penalties on top of the NUMA penalty. The scheduler may then notice the thread is now reading remote memory and move it back, or move its pages, and the flip-flop costs more than either placement would have. The interaction between the scheduler and NUMA is subtle enough that it's worth understanding how scheduling makes these decisions before you fight it.
The cure is to stop the migration. Two tools cover most cases:
taskset -c 0-31 ./serverpins a process to a fixed set of CPUs. The scheduler can move the thread within that set but never off it. This fixes migration but says nothing about where memory lands.numactl --cpunodebind=0 --membind=0 ./serverpins both: the threads run only on node 0's cores, and allocations come only from node 0's DRAM. This is the safe default for a single-node service. Both halves matter, because pinning CPUs without binding memory still lets the allocator scatter pages, and binding memory without pinning CPUs still lets the scheduler move the thread away from its own pages.
For services that legitimately need more than one node, the pattern is to shard:
run one instance per node, each pinned to its node with numactl, and put
a load balancer or connection pooler in front. Each instance then behaves like a
small, fully-local machine, and the only cross-node traffic is whatever the front
end forces. That's almost always faster than one process spread across all nodes and
hoping the balancer in the kernel does the right thing.
What this does to databases and high-throughput services
Databases feel NUMA harder than almost anything else, because their whole job is to
keep a large working set in memory and touch it from many threads at once. A
Postgres or MySQL buffer pool sized at tens of gigabytes, allocated at startup on one
thread, gets placed on one node by first-touch. Every query thread on the other nodes
then reads that pool remotely. The fix that the Postgres and MySQL tuning guides both
reach for is interleaving: numactl --interleave=all spreads the pages
round-robin across all nodes, so no single thread is fully local but no thread is
fully remote either. Interleaving trades best-case latency for predictable average
latency, which is usually the right trade for a shared buffer pool that everyone
touches.
In-memory stores and caches hit the same wall once they go multi-threaded. A sharded
design helps here too: partition the keyspace by node, pin each shard's threads and
memory together, and route a request to the shard that owns its key. The request
stays local end to end. The pathological opposite is a single hot data structure (a
global counter, a shared free list, one big lock) whose cache line bounces between
nodes on every update. Cache coherence has to ship that line over the interconnect
each time, and the line becomes a serial bottleneck that no amount of cores can
parallelise. perf c2c exists specifically to find these bouncing lines.
Throughput services that aren't databases still pay the tax through the network and storage path. NIC and NVMe interrupts land on whatever cores the kernel assigned them, usually node 0. A worker pinned to node 1 then pays a cross-node hop on every packet or block it handles, even though its own data is local. The lesson is to think about the whole path: pin the user threads to the same node as the device they talk to, or spread the interrupts so each node services its own traffic. NUMA placement is a property of the entire request path, not just the application's heap.
Seeing it: the observation toolkit
You can't tune what you can't measure, and NUMA problems are invisible until you look at the right counter. Start by learning the topology, then watch where memory actually lives, then catch the cross-node traffic in the act.
numactl --hardwareprints the topology: how many nodes, which cores belong to each, how much memory each has free, and the node-to-node distance matrix. Run this first on every new machine or VM size, because the answer is not always what the spec sheet implies.numastat -p <pid>shows how one process's pages are split across nodes. A healthy single-node service shows almost everything on its own node; a botched first-touch shows the heap on the wrong one.perf c2c record / reportfinds cache lines that bounce between nodes (the worst NUMA pathology), and points at the exact source line and the threads fighting over it.perf stat -e node-loads,node-load-missesmeasures how often loads miss the local node and go remote, giving you a single ratio to track before and after a change.cat /sys/devices/system/node/node*/meminfoand cgroup v2'smemory.numa_statgive per-node and per-container memory breakdowns, useful when the workload runs inside a container andnumastatcan't see the whole picture.
Tie these into the wider method you'd use for any resource bottleneck. The USE method (utilisation, saturation, errors, applied to every resource) maps cleanly onto NUMA: utilisation is per-node bandwidth, saturation is the interconnect queueing under load, and the error signal is the remote-access ratio climbing where it shouldn't. Treat the interconnect as a resource with its own limit and the diagnosis stops being guesswork.
Common production bugs
| Bug | Cause | Fix |
|---|---|---|
| First-touch miss | Memory allocated on the wrong NUMA node because the allocator thread isn't the eventual reader. | Initialize memory in the threads that will use it; use numactl --cpunodebind + --membind in deployment. |
| Process migration | The kernel scheduler moves a process to another socket; its memory stays on the original socket. | Pin processes with taskset / numactl. On Linux, set /proc/sys/kernel/numa_balancing thoughtfully. |
| NUMA imbalance after exec | A long-running daemon allocated memory on socket 0 at startup; later threads spawn on socket 1. | Use NUMA-aware allocators (jemalloc, mimalloc); pre-fault memory at the right time. |
| Kernel cache hot-spot | The kernel's page cache is bound to one socket's memory; cross-socket I/O thrashes the interconnect. | Distribute the page cache across nodes; on Linux, use cgroup v2 memory.numa_stat to monitor. |
| Lock-line contention across sockets | A spin-lock cache line bounces between sockets; coherence over UPI / Infinity Fabric is slow. | Use per-CPU data structures; reduce shared mutable state; consider RCU or sharded locks. |
Sub-NUMA Clustering
Modern Intel chips (Skylake-SP and later) and AMD Zen 4 / Zen 5 EPYCs offer a sub-NUMA mode. On a physical socket with 2–4 memory controllers, the chip advertises itself as multiple NUMA nodes — even though it's one socket. Each sub-node has its own local memory channels and lower latency to threads in the same sub-node. The cost: applications that aren't NUMA-aware see twice as many nodes and may make worse placement decisions.
Intel calls this Sub-NUMA Clustering (SNC) or Cluster-on-Die (COD). AMD calls it NPS (Nodes Per Socket). Default is usually NPS1 (one NUMA node per socket); HPC and database workloads often switch to NPS4 to expose the internal mesh structure. A single Genoa EPYC at NPS4 looks like 4 NUMA nodes to the OS.
CXL — the future of pooled memory
Compute Express Link (CXL) is a PCIe-based protocol for cache-coherent, low- latency memory expansion. Released in production with Intel Sapphire Rapids and AMD Genoa (2023). CXL.mem lets a server attach pooled DRAM beyond what fits in DIMM slots — sometimes shared across multiple servers in a rack. From the OS's perspective, CXL memory looks like a high-latency NUMA node (~250 ns vs ~80 ns local).
Use cases: tiered memory hierarchies (hot data in DDR5, warm in CXL pool, cold in NVMe); memory disaggregation in cloud providers; rapid fail-over for in-memory databases. The lessons from NUMA carry over: explicit placement beats automatic, and careless allocation patterns cost real throughput.
Common misconceptions
- "NUMA only matters for HPC." Any workload that scales past one socket cares: databases (Postgres, MySQL, Cassandra all have NUMA tuning guides), JVM applications, in-memory caches like Redis when run multi-threaded, anything bound to total memory bandwidth.
- "NUMA balancing fixes everything automatically." Linux NUMA balancing helps but can't solve cross-socket cache-line contention or first-touch on the wrong node. Explicit binding still wins on predictable workloads.
- "Single-socket boxes are immune." Modern AMD chips are internally multi-die: a single Genoa EPYC has 12 chiplets connected via Infinity Fabric. Cross-CCD latency on Zen 4 is ~70 ns — close enough to "remote" that careless cross-CCD sharing costs measurable throughput.
- "Big DIMM modules are always better." Capacity per channel and rank can affect timings and effective bandwidth. Two ranks per channel is often faster than one big rank because the controller can interleave.
Numbers worth remembering
| Quantity | Value | Notes |
|---|---|---|
| Local-socket DRAM latency | ~80 ns | DDR5 |
| Adjacent-socket latency (2-socket) | ~130–150 ns | Cross UPI / Infinity Fabric |
| Far-corner latency (8-socket) | ~290 ns | Up to 7 hops |
| DDR5 channel bandwidth | ~38 GB/s | DDR5-4800 |
| Channels per socket, modern server | 8–12 | Sapphire Rapids / Genoa / Zen 5 |
| Per-socket aggregate bandwidth | ~300–460 GB/s | 8–12 ch × 38 GB/s |
| UPI / Infinity Fabric per direction | ~80 GB/s | Cross-socket bottleneck |
| HBM3 stack bandwidth | ~819 GB/s | One stack on H100/Mi300X |
| CXL.mem latency | ~250 ns | Pooled DRAM via PCIe Gen5 |
| Linux NUMA balancing default | enabled | /proc/sys/kernel/numa_balancing = 1 |
Common mistakes
NUMA is one of those topics where the textbook explanation is straightforward and the production reality is full of corners. The five mistakes below are the ones that recur across teams that don't have a kernel engineer on staff — most are silent, all are diagnosable with the standard tools once you know to look.
- Trusting the default allocator.
- Linux's first-touch policy works only if the thread that first touches a page is the thread that will keep using it. Database workers that allocate buffer-pool pages in a startup loop on the main thread, then read them from per-core workers, are guaranteed cross-node traffic on a multi-socket box. The fix is to allocate on the same thread that consumes — or use
numactl --interleave=allto spread pages deliberately. - Treating cross-node bandwidth as free.
- UPI on Intel and Infinity Fabric on AMD sustain roughly 30–50 GB/s per link — close to local DRAM bandwidth, which is why teams assume parity. The asymmetry shows up at contended bandwidth: when both sockets are pulling traffic, the inter-socket link becomes the bottleneck and tail latency jumps 2–3× without obvious cause.
- Pinning to a CPU but not to a memory node.
tasksetpins the thread; it does not pin allocations. A thread pinned to CPU 0 will happily allocate pages on node 1 if the scheduler last ran an allocation there. Usenumactl --cpunodebind=0 --membind=0to lock both, or callset_mempolicyfrom C/Go.- Ignoring NUMA on cloud VMs.
- "It's a single VM, there's no NUMA." Wrong above ~16 vCPUs on AWS and GCP — large instances are pinned to multi-socket bare metal and the guest sees the topology via
/sys/devices/system/node/. m6i.32xlarge, c6i.32xlarge, n2-highmem-128 are all multi-node. Check withnumactl --hardwareon every new VM size you adopt. - Database tuning that pretends NUMA doesn't exist.
- Postgres on a 2-socket box with
shared_buffersat 64 GB will scatter the buffer pool across both nodes via first-touch, then every query has a ~50% chance of cross-node access. Eithernumactl --interleave=allthe whole instance, or run two Postgres processes (one per node) behind a connection pooler. - Forgetting interrupts.
- The NIC and NVMe IRQs land on whichever cores the kernel assigned them — usually node 0. A networking-heavy workload pinned to node 1 pays the cross-node tax on every packet. Use
set_irq_affinityfrom the kernel scripts to spread interrupts across nodes, or pin user threads to the same node as the device.
Further reading
- Wikipedia — Non-uniform memory access — comprehensive coverage and history.
- man numactl — the canonical tool for NUMA pinning.
- man numastat — per-process NUMA usage.
- Drepper — What Every Programmer Should Know About Memory — Section 5 covers NUMA in depth.
- Hennessy & Patterson — Computer Architecture: A Quantitative Approach. Chapter 5 (Multiprocessors) covers NUMA architectures.
- CXL Consortium — official specs and adoption status.
- Chips and Cheese — measured cross-socket and cross-CCD latencies on every recent chip.
- Postgres — NUMA tuning guide — practical example of database NUMA tuning.