12 / 15
Internals / 12

PCIe, DMA, interrupts

Everything that isn't on the CPU die — GPUs, NVMe SSDs, NICs, FPGAs, AI accelerators — talks to the CPU over PCI Express. Lanes are point-to-point serial links; bandwidth doubles every generation. Devices move data via DMA without bothering the CPU. Interrupts let them signal completion. The IOMMU is the gatekeeper that makes the whole thing safe in a multi-tenant or virtualized world. This page is the operational layer that connects the CPU's silicon to everything else in the box.


PCIe is point-to-point, not a bus

PCI's name is misleading: PCIe is not a shared bus. Each lane is an independent pair of wires (one transmit, one receive) running serial data at multi-gigabit rates. A device with an x16 connection has 16 such lane pairs in each direction. There's no contention between devices for the same lane — arbitration only happens upstream at switches and at the root complex.

RoleNotes
Root ComplexLives on the CPU. Connects PCIe to the rest of the SoC.
SwitchMultiplexes one upstream port to many downstream ports. Used to fan out PCIe lanes when the CPU has fewer lanes than slots.
EndpointThe actual device — GPU, NVMe SSD, NIC. Sources or sinks PCIe traffic.
BridgeTranslates between PCIe generations or between PCIe and another bus (e.g., CXL).

Bandwidth scales with lanes × generation

Per-lane bandwidth doubles every generation. Mainstream slot widths are x1, x4, x8, x16; the M.2 slot used by NVMe SSDs is typically x4. A modern GPU plugs into x16; an NVMe drive into x4. The arithmetic:

x16
per-lane bandwidth
3.94GB/s
each direction
total bandwidth
63.0GB/s
16 lanes × 3.94 GB/s
63 GB/s — typical NVIDIA H100 / RTX 5090
The actual throughput a real device achieves is ~5–10% lower than peak — PCIe framing overhead, ordering rules, and credit-based flow control all eat into the raw line rate. A Gen5 x16 GPU slot's 63 GB/s peak is more like 56–60 GB/s in practice.

Lane counts on real CPUs

CPUPCIe GenTotal lanesTypical use
Intel Core i9-14900K (consumer)Gen5 + Gen420 lanes totalx16 GPU + x4 NVMe
AMD Ryzen 9 9950X (consumer)Gen528 lanesx16 GPU + 2× x4 NVMe + chipset
Apple M4 Max (laptop)Gen4 (internal)~24 lanes equivalentInternal SSD, Thunderbolt 5, ML accelerators
Intel Xeon Sapphire Rapids (server)Gen580 lanes per socket4× GPU + dozens of NVMe
AMD EPYC Genoa (server)Gen5128 lanes per socket~12 GPUs + 24 NVMe

The lane count is the gate-keeping resource on multi-GPU and high-NVMe systems. Consumer CPUs typically max out at one x16 GPU plus a couple of NVMe drives. Server platforms have 4–6× more, which is why HPC and AI training boxes use them.

DMA — bypassing the CPU

Without DMA, every byte read from a NIC into memory would require a CPU instruction. At 100 Gbps that's ~12 GB/s of CPU-mediated memory writes — every cycle of every core spent moving data. Unworkable. Direct Memory Access lets the device read or write DRAM directly, with the CPU only setting up the transfer and handling completion.

DMA modeSpeedNotes
Programmed I/O~MB/sCPU issues every load and store. Used only on legacy devices and slow buses.
Coherent DMAGB/sDevice reads/writes DRAM through the IOMMU; the host CPU sees coherent results without explicit cache flushes. Modern default.
Streaming DMAGB/sDevice transfers data the CPU has explicitly mapped for streaming. Used for large data in/out (NIC packet buffers, NVMe queues).
Peer-to-peer DMAGB/sDevice A writes directly to Device B's memory without involving the host CPU. Used between GPUs (NVLink) or between NVMe drives in some storage systems.

A typical NIC packet receive cycle: the kernel posts descriptors pointing to free buffers in DRAM. The NIC DMAs incoming packets into those buffers. When a packet arrives, the NIC writes a completion to a different ring buffer and raises an MSI-X interrupt to a specific CPU core. The kernel reads the completion, processes the packet, and posts a new descriptor. The CPU never copies the packet bytes — the NIC put them where they need to be.

The IOMMU — a per-device MMU

Without an IOMMU, a malicious or buggy device could DMA anywhere in physical memory — including kernel code or another VM's data. The IOMMU is a translation layer: each device sees its own virtual address space, and the IOMMU translates those addresses to physical ones (with permission checks). It's structurally identical to the CPU's MMU, but for I/O.

  • Device isolation. Each device gets its own translation domain. A device that walks off its assigned range hits a fault, not someone else's memory.
  • VM passthrough. A GPU assigned to a VM has its DMA traffic translated to the VM's address space, so the guest sees the GPU as a regular PCIe device. Intel's VT-d and AMD-Vi enable this.
  • SR-IOV. A single PCIe device exposes multiple "virtual functions" (VFs), each with its own IOMMU context. Used by NICs (one VF per VM), SmartNICs, and some GPU virtualization stacks.
  • Performance cost. IOMMU translation adds 0–100 ns of latency per DMA. For high-throughput workloads, modern IOMMUs cache translations in IOTLBs — the device-side equivalent of the CPU TLB.

Interrupts — from INTx to MSI-X

MechanismVector countNotes
INTx (legacy PCI)4 lines (A/B/C/D) sharedEdge-triggered, level-sensitive on x86, prone to interrupt storms when shared. Used until ~2008.
MSIUp to 32 vectors per deviceMessage-signaled interrupts. The device writes to a doorbell address; CPU receives a message. Replaced INTx in PCIe 1.0 era.
MSI-XUp to 2048 vectors per deviceOne vector per queue. Modern NICs and NVMe drives use MSI-X to spread interrupts across cores. Standard in PCIe Gen2+.
IMS (Interrupt Message Store)Per-VF / per-context vectorsIntel SR-IOV / IOMMU-managed. Each VM or container gets its own interrupt vectors without coordinating with the OS.

Modern NVMe drives have 64+ submission/completion queue pairs. Each queue is pinned to one CPU core via MSI-X — an interrupt for that queue's completion only ever fires on that core. This is what lets a single NVMe drive saturate 100 GB/s of host throughput without thrashing one core to death. Distribute the queues across cores; each core handles its own queue's completions; no shared lock.

Interrupt coalescing

A naive NIC raises an interrupt on every received packet. At 1 Mpps that's 1 million interrupts per second, each costing ~200 ns of context-switch overhead — 20% of one core's time spent just entering interrupt handlers. Interrupt coalescing batches them: the device waits up to N microseconds (or until M packets are buffered) before raising the interrupt. The trade-off is latency: higher coalescing thresholds mean lower CPU overhead but higher tail latency.

NVMe drives do the same with completion queues — most drivers configure the device to coalesce until 16 completions are ready or 100 µs has elapsed. NICs let you tune ethtool -c parameters (rx-usecs, rx-frames). Production guidance: set thresholds to keep interrupt rate under ~50 Kpps per core, while keeping tail latency within your SLO.

The Receive Side Scaling (RSS) trick: NICs hash the 5-tuple of incoming packets (src IP, dst IP, src port, dst port, protocol) and route each flow to a specific RX queue. Each queue has its own MSI-X interrupt vector pinned to one core. Result: parallel packet processing across all cores without cross-core synchronisation. Linux RPS and XPS extend this to software-distributed steering when the NIC can't.

Bus, hub, and the death of shared

PCI (1992) was a 32-bit shared parallel bus running at 33 MHz — 132 MB/s for the entire system, divided among all attached devices. PCI-X (1999) doubled the width and added speed grades up to 533 MB/s. Both shared a clock and bus arbiter; one device's transaction blocked the bus for everyone.

PCIe (2003) replaced this with point-to-point serial links. Each lane has its own differential signaling pair; there's no shared bus. Aggregate bandwidth scales with lane count. Adding devices doesn't slow down existing ones (until the root complex is the bottleneck). It's a far more scalable architecture, which is why it has survived through 6 generations and 64× bandwidth growth.

Anatomy of a Transaction Layer Packet (TLP)

PCIe is a packet-switched fabric, not a bus. Every read, write, configuration, or interrupt travels as a Transaction Layer Packet. Understanding the TLP is how you understand why PCIe gen-over-gen bandwidth grows but per-transaction latency improves only modestly — the wire is faster, but the protocol envelope around each transaction stays the same size.

+--------+--------+----------------------+--------+
| Frame  | Seq #  | TLP HEADER + PAYLOAD | LCRC   |  ← Data Link Layer
+--------+--------+----------------------+--------+
                  |
            +-----+------------------------------------------+
            | Fmt | Type | TC | TD | EP | Attr | Length      |
            +-----+------+----+----+----+------+-------------+
            | Requester ID    | Tag      | Last/First DW BE  |
            +-----------------+----------+-------------------+
            | Address (64-bit, for memory reads/writes)      |
            +------------------------------------------------+
            | Data (0 to 1024 DW, with completion semantics) |
            +------------------------------------------------+

Four TLP types cover almost everything: memory read/write (the workhorse — every DMA is a sequence of these), configuration read/write (the BIOS / OS enumerating devices at boot), completion (the reply to a memory read — non-posted), and message (interrupts, error signalling, power-management requests).

The header is 12-16 bytes and the payload is bounded by the device's Max Payload Size (typically 256 or 512 bytes on modern systems). That MPS-versus-header ratio is why bulk transfers run close to wire rate while small transactions waste tens of percent on overhead. A 64-byte cache-line read over PCIe carries ~20 bytes of envelope around 64 bytes of data — 24% overhead. Bulk 4 KB transfers see <1% overhead.

Why Max Read Request Size matters more than MPS. The CPU's root complex caps how big a single read TLP can be — typically 512 bytes. A 4 KB DMA read is actually 8 TLPs in flight at once. The device pipelines them; the root complex returns completions out of order. If a device is bandwidth-starved on PCIe, the first thing to check is lspci -vvv for MaxReadReq — bumping it from 128 to 4096 often doubles throughput.

A DMA, step by step

Putting the pieces together. A simplified NVMe write of a 4 KB block, from write() syscall to completion:

  1. App. Calls write(fd, buf, 4096) with a userspace buffer.
  2. Kernel. Pins the user pages (so the kernel can hand a physical address to the device without those pages getting paged out). For O_DIRECT, no copy; otherwise the kernel may copy into a page-cache page first.
  3. Driver. Builds an NVMe Submission Queue Entry (SQE) containing the PRP (Physical Region Page) list — physical addresses the device should DMA from. Writes the SQE to a queue in DRAM the device has been told about at init.
  4. Driver. Writes a 4-byte value to the device's "doorbell" MMIO register — a PCIe memory-write TLP. This is the only round trip the CPU makes; from here the device runs the show.
  5. Device. Reads the SQE from DRAM via a memory-read TLP. Reads the 4 KB data buffer via 8 split memory-read TLPs (one per 512 B chunk, given typical MaxReadReq).
  6. Device. Writes the data to NAND, updates the FTL.
  7. Device. Posts a Completion Queue Entry (CQE) to DRAM via a memory-write TLP, then raises an MSI-X interrupt — another memory-write TLP, targeting the interrupt-message address pre-programmed at init.
  8. CPU. Takes the interrupt, runs the bottom-half, reads the CQE, wakes the syscall, returns to userspace.

The whole sequence is 12-15 TLPs across the PCIe fabric. On modern hardware end- to-end latency is around 50-100 µs for the NVMe path, of which the CPU is doing real work for maybe 1-2 µs — the rest is device latency and PCIe transit. The IOMMU adds 0-100 ns per DMA (often 0 with a primed TLB).

Patterns the fast paths reach for

Once you understand the TLP and the DMA flow, several high-throughput subsystems become legible. They are all variations on the same theme: eliminate the CPU from the critical path.

TechniqueWhat changesUsed by
NVMe queue pairs (one per core)Each CPU core has its own SQ/CQ. No cross-core locking on the data path; MSI-X interrupts return to the core that issued the request.Linux NVMe driver default since 4.0
io_uringUserspace + kernel share submission/completion ring buffers. A single syscall submits dozens of I/Os. Per-syscall overhead drops by 10x for small I/Os.Modern Linux storage stacks (5.1+)
DPDK / AF_XDPNIC ring buffers mapped directly into a userspace process. Poll mode — no interrupts. Bypasses the kernel network stack entirely.Telco workloads, CDN edge stacks, packet brokers
GPUDirect Storage / RDMANVMe → GPU memory directly via peer-to-peer DMA, skipping a bounce buffer in CPU DRAM. Adds >30% throughput for ML training data pipelines.NVIDIA Magnum IO, Cloudera, MosaicML
SR-IOV virtual functionsOne physical device exposes N "virtual" PCIe devices, each with its own BAR + MSI-X vectors. The hypervisor passes a VF through to a VM; the IOMMU enforces isolation.Cloud network VNICs (AWS ENA, Azure SR-IOV)
CXL.cache / CXL.memPCIe physical layer, but with a coherent memory protocol on top. The accelerator participates in the CPU's coherence domain. Removes the explicit DMA step for memory-class accelerators.Memory pooling, accelerator-attached HBM, post-2023 servers
Doorbell batchingDriver writes the doorbell only every N submissions, not every one. Trades a few microseconds of latency for substantially less PCIe traffic.Most modern NVMe + RDMA drivers
The common pattern. The CPU does setup; the device does the data movement; signaling happens via batched doorbells and coalesced interrupts. The CPU's only critical-path work per transaction is ~50 ns of cache-resident state management. Get below that and you're competing with the speed of light through the PCIe physical layer.

Common misconceptions

  • "PCIe lanes have a fixed assignment." Most CPUs have lane bifurcation: an x16 slot can run as one x16, two x8, or four x4 depending on BIOS configuration. This is how you get four NVMe drives into a single physical x16 slot via an adapter card.
  • "The IOMMU only matters for VMs." Even on bare metal, the IOMMU enforces device isolation against malicious USB devices, faulty firmware, and DMA attacks like the Thunderclap and FireWire attacks. Linux enables it by default on most distros now.
  • "PCIe Gen6 will fix bandwidth forever." Each generation halves the signaling time per bit, requiring more aggressive equalization. Gen6 introduced PAM-4 signaling (4 levels per symbol, vs Gen5's 2). Gen7 will need new electrical standards. Bandwidth growth is slowing in absolute terms.
  • "DMA is faster because the CPU doesn't have to do the copy." True for the obvious case, but DMA also has setup costs (descriptor preparation, TLB-flush in some IOMMU configurations). For tiny transfers, programmed I/O can actually beat DMA.

Numbers worth remembering

QuantityValue
PCIe Gen3 per-lane bandwidth~0.985 GB/s
PCIe Gen4 per-lane bandwidth~1.97 GB/s
PCIe Gen5 per-lane bandwidth~3.94 GB/s
PCIe Gen6 per-lane bandwidth~7.88 GB/s
x16 slot, Gen5 (typical GPU)63 GB/s
x4 slot, Gen5 (typical NVMe)~14 GB/s
Consumer CPU lanes (typical)20–28
Server CPU lanes (typical)80–128 per socket
MSI-X max vectors per device2048
NIC interrupt coalescing target< 50 Kpps per core
NVMe queue pair counttypically 64–128
IOMMU per-DMA overhead~0–100 ns

Further reading

Found this useful?