11 / 15
Internals / 11

NUMA and memory bandwidth

On a single-socket laptop, every byte of DRAM is the same distance from the CPU. On a multi-socket server, DRAM near one socket is one hop away; DRAM near the next socket is two hops. Same-socket: 80 nanoseconds. Adjacent socket: 140 nanoseconds. Far corner of an 8-socket box: 290 nanoseconds. Non-Uniform Memory Access is what we call this asymmetry, and it's the dominant performance constraint on every workload that scales beyond one socket.


Why memory access stopped being uniform

On 1990s-era servers, RAM hung off a shared front-side bus. Every CPU went through the same memory controller. Latency was uniform — slow for everyone. As core counts grew, the front-side bus became the bottleneck: every additional core added contention without adding bandwidth. The fix, around 2003 with AMD's Opteron and 2008 with Intel's Nehalem, was to put the memory controller on the CPU and give each socket its own private DRAM channels.

That solved the bandwidth problem. It created a new one: memory attached to one socket is faster than memory attached to another. The CPUs talk to each other over an inter-socket interconnect (Intel calls it UPI / QPI; AMD calls it Infinity Fabric); requests for remote memory traverse that interconnect, then the remote memory controller, then the remote DRAM, then come back. The trip is roughly 60 ns longer than a local access, about 80% extra latency.

The word "node" is the unit you'll see everywhere once you start measuring this. A NUMA node is one memory domain: a set of CPU cores plus the DRAM channels wired directly to those cores. On most servers a node maps one-to-one to a physical socket, but that's a convention, not a rule. Modern chips can split a single socket into several nodes, and pooled memory devices can show up as nodes with no cores at all. What stays constant is the meaning: inside a node, access is cheap; crossing a node boundary costs latency and shares a limited interconnect with everyone else doing the same thing.

The latency figure above is the quiet-system number, measured with one core pulling data and nothing else competing. The other half of the story is bandwidth. The interconnect that carries remote traffic is far narrower than the combined DRAM channels of all the nodes, so once many cores start pulling remote data at the same time, they queue for the same link. Latency under load climbs well past the idle figure, and throughput flattens. This is why a workload can look healthy in a micro-benchmark and fall apart in production: the micro-benchmark never saturated the interconnect.

Local and remote, side by side

Two NUMA nodes, two memory accesses. A core on node 0 reading its own DRAM takes the short path: core to local memory controller to local DRAM and back. A core on node 0 reading node 1's DRAM takes the long path: it hops across the interconnect to node 1's memory controller, waits for that controller to fetch from node 1's DRAM, then carries the data all the way back. Same instruction in your code, very different cost.

NUMA NODE 0NUMA NODE 1coresmem ctrlDRAMcoresmem ctrlDRAMUPI / IFlocal · ~80 nsremote · ~140 ns (+60 ns over the interconnect)the only difference is which DRAM the address lands in
A local read stays inside node 0 (green). A remote read crosses the interconnect to node 1's controller and DRAM, then returns (red). The extra hop is the whole story.

Notice what's not different: the core, the cache hierarchy, the instruction. A load is a load. The address decides the path, and the address was fixed when the page was allocated, usually long before the load runs. That's why NUMA tuning is mostly about placement, not about rewriting hot loops. You're not making the access faster; you're making sure the page sits in the node that reads it. The same principle that governs the memory hierarchy (keep data close to the unit that uses it) applies one level up: between sockets instead of between caches.

The latency matrix

Pick a topology. The cell at row i, column j shows the latency when a thread on socket i reads DRAM attached to socket j. Local-socket cells are dark; remote-socket cells lighten as the distance grows.

4 sockets (mesh)
CPU →
MEM ↓
S0S1S2S3
S0 80 ns130 ns150 ns180 ns
S1 130 ns80 ns180 ns150 ns
S2 150 ns180 ns80 ns130 ns
S3 180 ns150 ns130 ns80 ns
local · < 100 ns adjacent · 100–160 ns 2-hop · 160–220 ns far corner · > 220 ns

Bandwidth scales with sockets and channels

Each DDR5 channel sustains ~38 GB/s. Modern Intel Sapphire Rapids and AMD Genoa servers have 8–12 channels per socket. Total bandwidth scales linearly with channels and sockets — but only if the workload's memory accesses spread evenly across them.

8
2
aggregate bandwidth
608GB/s
8 ch × 2 sockets × 38 GB/s
per-core bandwidth
9.5GB/s
at ~32 cores per socket
A typical 2-socket Sapphire Rapids server with 8 channels gives 608 GB/s of aggregate bandwidth — but only if every core hits its local socket. If half the traffic crosses sockets, you bottleneck on UPI (~80 GB/s per direction) and effective bandwidth halves. This is why NUMA-aware code matters past one socket.

Where bandwidth comes from

TechnologyPer-channel bandwidth
DDR3-160012.8 GB/s
DDR4-240019.2 GB/s
DDR4-320025.6 GB/s
DDR5-480038.4 GB/s
DDR5-640051.2 GB/s
DDR5-800064.0 GB/s
HBM3 (1 stack)819.0 GB/s

DDR roughly doubles per-channel bandwidth every two generations. HBM (High Bandwidth Memory) on GPUs and accelerators uses 1024-bit interfaces with hundreds of pins, sacrificing capacity and cost for bandwidth. NVIDIA H100 has ~3 TB/s of HBM3 bandwidth — about 40× a typical CPU socket. This is the fundamental reason GPU memory-bound workloads outpace CPU ones.

First-touch policy

Linux's default policy: when a thread first touches a page (reads or writes), the OS allocates physical memory on whichever NUMA node that thread is running on. This sounds reasonable. It causes one of the most common production bugs.

The policy makes sense in isolation. A thread that allocates a buffer is usually the thread that fills it and reads it, so putting the pages on that thread's node is the right default most of the time. The trouble starts when allocation and use are split across threads on different nodes, which is exactly how most server software is built: one thread (or a startup routine) carves out a big region, and a pool of worker threads spread across every node do the actual work on it.

1. malloc() reserves virtual addresses — no physical page yetvirtual region 0x7f…000 → 0x7f…fff (unmapped)2. first write (the "touch") faults the page in on the writing thread's nodeNODE 0init threadpage lands herememset → first touchNODE 1worker thread3. worker reads the page…every access is remotewhoever touches first wins — the allocator doesn't decide, the writer does
First-touch: malloc only reserves addresses. The physical page is placed on the node of whichever thread writes to it first. Allocate on node 0, consume on node 1, and every read pays the remote tax for the life of the buffer.
// Bug: allocator thread is wrong
void *buffer = malloc(1024 * 1024 * 1024);  // alloc on socket 0
memset(buffer, 0, 1024 * 1024 * 1024);      // fault in on socket 0

// Pass buffer to a worker on socket 1
launch_worker_on_socket(1, buffer);

// Worker now reads at remote-socket latency. ~70% throughput loss.

// Fix: have the worker do its own first-touch
void *fix_first_touch(void *buffer, size_t n, int node) {
    // Bind to target node, then memset.
    numa_run_on_node(node);
    memset(buffer, 0, n);
}

The fix is to have the consuming thread do the first-touch. Or use numa_alloc_onnode() from libnuma, which forces allocation on a specific node regardless of who calls it. Or use a NUMA-aware allocator like jemalloc with per-arena binding.

Linux NUMA balancing

The Linux kernel has automatic NUMA balancing (/proc/sys/kernel/numa_balancing). It periodically samples memory access patterns; if a thread is consistently reading from a remote node, the kernel migrates the pages closer or moves the thread. This works for steady-state workloads but causes problems with bursty ones — the migration itself is expensive (TLB shootdown + page copy), and the heuristics can flip-flop.

Production guidance: explicit binding always beats automatic balancing for predictable workloads. The tools:

  • numactl --cpunodebind=N --membind=N — pin a process to node N for both CPU and memory.
  • taskset — pin to specific CPUs (lets the OS choose memory).
  • numastat -p <pid> — see how a process's memory is distributed across nodes.
  • perf c2c — find cache lines that bounce between sockets (the worst kind of NUMA pathology).
  • cgroups v2 memory.numa_stat — per-cgroup NUMA usage, useful in containerized environments.

Affinity, pinning, and why a migrating thread tanks

Affinity is the rule that ties a thread to a set of cores. Pinning is affinity taken to its strict form: this thread runs on this core (or this node) and nowhere else. Both exist because the kernel scheduler, left alone, treats cores as interchangeable. It will happily move a thread from a core on node 0 to a core on node 1 to balance load, fill an idle core, or react to a sleeping neighbour. That decision is good for fairness and bad for NUMA, because the thread's memory does not move with it. The pages it allocated under first-touch stay on node 0; the thread now runs on node 1; every access it makes is suddenly remote.

This is the single most common way a service quietly loses a third of its throughput. Nothing crashes, no error is logged, and the code is unchanged. A scheduler decision made microseconds ago turned a stream of local reads into a stream of remote ones. Worse, the caches the thread had warmed on its old core are cold on the new one, so it pays cache-miss penalties on top of the NUMA penalty. The scheduler may then notice the thread is now reading remote memory and move it back, or move its pages, and the flip-flop costs more than either placement would have. The interaction between the scheduler and NUMA is subtle enough that it's worth understanding how scheduling makes these decisions before you fight it.

The cure is to stop the migration. Two tools cover most cases:

  • taskset -c 0-31 ./server pins a process to a fixed set of CPUs. The scheduler can move the thread within that set but never off it. This fixes migration but says nothing about where memory lands.
  • numactl --cpunodebind=0 --membind=0 ./server pins both: the threads run only on node 0's cores, and allocations come only from node 0's DRAM. This is the safe default for a single-node service. Both halves matter, because pinning CPUs without binding memory still lets the allocator scatter pages, and binding memory without pinning CPUs still lets the scheduler move the thread away from its own pages.

For services that legitimately need more than one node, the pattern is to shard: run one instance per node, each pinned to its node with numactl, and put a load balancer or connection pooler in front. Each instance then behaves like a small, fully-local machine, and the only cross-node traffic is whatever the front end forces. That's almost always faster than one process spread across all nodes and hoping the balancer in the kernel does the right thing.

What this does to databases and high-throughput services

Databases feel NUMA harder than almost anything else, because their whole job is to keep a large working set in memory and touch it from many threads at once. A Postgres or MySQL buffer pool sized at tens of gigabytes, allocated at startup on one thread, gets placed on one node by first-touch. Every query thread on the other nodes then reads that pool remotely. The fix that the Postgres and MySQL tuning guides both reach for is interleaving: numactl --interleave=all spreads the pages round-robin across all nodes, so no single thread is fully local but no thread is fully remote either. Interleaving trades best-case latency for predictable average latency, which is usually the right trade for a shared buffer pool that everyone touches.

In-memory stores and caches hit the same wall once they go multi-threaded. A sharded design helps here too: partition the keyspace by node, pin each shard's threads and memory together, and route a request to the shard that owns its key. The request stays local end to end. The pathological opposite is a single hot data structure (a global counter, a shared free list, one big lock) whose cache line bounces between nodes on every update. Cache coherence has to ship that line over the interconnect each time, and the line becomes a serial bottleneck that no amount of cores can parallelise. perf c2c exists specifically to find these bouncing lines.

Throughput services that aren't databases still pay the tax through the network and storage path. NIC and NVMe interrupts land on whatever cores the kernel assigned them, usually node 0. A worker pinned to node 1 then pays a cross-node hop on every packet or block it handles, even though its own data is local. The lesson is to think about the whole path: pin the user threads to the same node as the device they talk to, or spread the interrupts so each node services its own traffic. NUMA placement is a property of the entire request path, not just the application's heap.

Seeing it: the observation toolkit

You can't tune what you can't measure, and NUMA problems are invisible until you look at the right counter. Start by learning the topology, then watch where memory actually lives, then catch the cross-node traffic in the act.

  • numactl --hardware prints the topology: how many nodes, which cores belong to each, how much memory each has free, and the node-to-node distance matrix. Run this first on every new machine or VM size, because the answer is not always what the spec sheet implies.
  • numastat -p <pid> shows how one process's pages are split across nodes. A healthy single-node service shows almost everything on its own node; a botched first-touch shows the heap on the wrong one.
  • perf c2c record / report finds cache lines that bounce between nodes (the worst NUMA pathology), and points at the exact source line and the threads fighting over it.
  • perf stat -e node-loads,node-load-misses measures how often loads miss the local node and go remote, giving you a single ratio to track before and after a change.
  • cat /sys/devices/system/node/node*/meminfo and cgroup v2's memory.numa_stat give per-node and per-container memory breakdowns, useful when the workload runs inside a container and numastat can't see the whole picture.

Tie these into the wider method you'd use for any resource bottleneck. The USE method (utilisation, saturation, errors, applied to every resource) maps cleanly onto NUMA: utilisation is per-node bandwidth, saturation is the interconnect queueing under load, and the error signal is the remote-access ratio climbing where it shouldn't. Treat the interconnect as a resource with its own limit and the diagnosis stops being guesswork.

Common production bugs

BugCauseFix
First-touch missMemory allocated on the wrong NUMA node because the allocator thread isn't the eventual reader.Initialize memory in the threads that will use it; use numactl --cpunodebind + --membind in deployment.
Process migrationThe kernel scheduler moves a process to another socket; its memory stays on the original socket.Pin processes with taskset / numactl. On Linux, set /proc/sys/kernel/numa_balancing thoughtfully.
NUMA imbalance after execA long-running daemon allocated memory on socket 0 at startup; later threads spawn on socket 1.Use NUMA-aware allocators (jemalloc, mimalloc); pre-fault memory at the right time.
Kernel cache hot-spotThe kernel's page cache is bound to one socket's memory; cross-socket I/O thrashes the interconnect.Distribute the page cache across nodes; on Linux, use cgroup v2 memory.numa_stat to monitor.
Lock-line contention across socketsA spin-lock cache line bounces between sockets; coherence over UPI / Infinity Fabric is slow.Use per-CPU data structures; reduce shared mutable state; consider RCU or sharded locks.

Sub-NUMA Clustering

Modern Intel chips (Skylake-SP and later) and AMD Zen 4 / Zen 5 EPYCs offer a sub-NUMA mode. On a physical socket with 2–4 memory controllers, the chip advertises itself as multiple NUMA nodes — even though it's one socket. Each sub-node has its own local memory channels and lower latency to threads in the same sub-node. The cost: applications that aren't NUMA-aware see twice as many nodes and may make worse placement decisions.

Intel calls this Sub-NUMA Clustering (SNC) or Cluster-on-Die (COD). AMD calls it NPS (Nodes Per Socket). Default is usually NPS1 (one NUMA node per socket); HPC and database workloads often switch to NPS4 to expose the internal mesh structure. A single Genoa EPYC at NPS4 looks like 4 NUMA nodes to the OS.

CXL — the future of pooled memory

Compute Express Link (CXL) is a PCIe-based protocol for cache-coherent, low- latency memory expansion. Released in production with Intel Sapphire Rapids and AMD Genoa (2023). CXL.mem lets a server attach pooled DRAM beyond what fits in DIMM slots — sometimes shared across multiple servers in a rack. From the OS's perspective, CXL memory looks like a high-latency NUMA node (~250 ns vs ~80 ns local).

Use cases: tiered memory hierarchies (hot data in DDR5, warm in CXL pool, cold in NVMe); memory disaggregation in cloud providers; rapid fail-over for in-memory databases. The lessons from NUMA carry over: explicit placement beats automatic, and careless allocation patterns cost real throughput.

Common misconceptions

  • "NUMA only matters for HPC." Any workload that scales past one socket cares: databases (Postgres, MySQL, Cassandra all have NUMA tuning guides), JVM applications, in-memory caches like Redis when run multi-threaded, anything bound to total memory bandwidth.
  • "NUMA balancing fixes everything automatically." Linux NUMA balancing helps but can't solve cross-socket cache-line contention or first-touch on the wrong node. Explicit binding still wins on predictable workloads.
  • "Single-socket boxes are immune." Modern AMD chips are internally multi-die: a single Genoa EPYC has 12 chiplets connected via Infinity Fabric. Cross-CCD latency on Zen 4 is ~70 ns — close enough to "remote" that careless cross-CCD sharing costs measurable throughput.
  • "Big DIMM modules are always better." Capacity per channel and rank can affect timings and effective bandwidth. Two ranks per channel is often faster than one big rank because the controller can interleave.

Numbers worth remembering

QuantityValueNotes
Local-socket DRAM latency~80 nsDDR5
Adjacent-socket latency (2-socket)~130–150 nsCross UPI / Infinity Fabric
Far-corner latency (8-socket)~290 nsUp to 7 hops
DDR5 channel bandwidth~38 GB/sDDR5-4800
Channels per socket, modern server8–12Sapphire Rapids / Genoa / Zen 5
Per-socket aggregate bandwidth~300–460 GB/s8–12 ch × 38 GB/s
UPI / Infinity Fabric per direction~80 GB/sCross-socket bottleneck
HBM3 stack bandwidth~819 GB/sOne stack on H100/Mi300X
CXL.mem latency~250 nsPooled DRAM via PCIe Gen5
Linux NUMA balancing defaultenabled/proc/sys/kernel/numa_balancing = 1

Common mistakes

NUMA is one of those topics where the textbook explanation is straightforward and the production reality is full of corners. The five mistakes below are the ones that recur across teams that don't have a kernel engineer on staff — most are silent, all are diagnosable with the standard tools once you know to look.

Trusting the default allocator.
Linux's first-touch policy works only if the thread that first touches a page is the thread that will keep using it. Database workers that allocate buffer-pool pages in a startup loop on the main thread, then read them from per-core workers, are guaranteed cross-node traffic on a multi-socket box. The fix is to allocate on the same thread that consumes — or use numactl --interleave=all to spread pages deliberately.
Treating cross-node bandwidth as free.
UPI on Intel and Infinity Fabric on AMD sustain roughly 30–50 GB/s per link — close to local DRAM bandwidth, which is why teams assume parity. The asymmetry shows up at contended bandwidth: when both sockets are pulling traffic, the inter-socket link becomes the bottleneck and tail latency jumps 2–3× without obvious cause.
Pinning to a CPU but not to a memory node.
taskset pins the thread; it does not pin allocations. A thread pinned to CPU 0 will happily allocate pages on node 1 if the scheduler last ran an allocation there. Use numactl --cpunodebind=0 --membind=0 to lock both, or call set_mempolicy from C/Go.
Ignoring NUMA on cloud VMs.
"It's a single VM, there's no NUMA." Wrong above ~16 vCPUs on AWS and GCP — large instances are pinned to multi-socket bare metal and the guest sees the topology via /sys/devices/system/node/. m6i.32xlarge, c6i.32xlarge, n2-highmem-128 are all multi-node. Check with numactl --hardware on every new VM size you adopt.
Database tuning that pretends NUMA doesn't exist.
Postgres on a 2-socket box with shared_buffers at 64 GB will scatter the buffer pool across both nodes via first-touch, then every query has a ~50% chance of cross-node access. Either numactl --interleave=all the whole instance, or run two Postgres processes (one per node) behind a connection pooler.
Forgetting interrupts.
The NIC and NVMe IRQs land on whichever cores the kernel assigned them — usually node 0. A networking-heavy workload pinned to node 1 pays the cross-node tax on every packet. Use set_irq_affinity from the kernel scripts to spread interrupts across nodes, or pin user threads to the same node as the device.
The non-obvious rule. NUMA penalty is not a constant — it grows non-linearly with contention. A workload that's 5% slower in benchmarks at low utilisation becomes 30% slower at 80% utilisation, because the inter-socket link saturates while local DRAM still has headroom. Always benchmark NUMA effects at the utilisation you'll actually run at.

Further reading

Found this useful?