PCIe, DMA, interrupts
Everything that isn't on the CPU die — GPUs, NVMe SSDs, NICs, FPGAs, AI accelerators — talks to the CPU over PCI Express. Lanes are point-to-point serial links; bandwidth doubles every generation. Devices move data via DMA without bothering the CPU. Interrupts let them signal completion. The IOMMU is the gatekeeper that makes the whole thing safe in a multi-tenant or virtualized world. This page is the operational layer that connects the CPU's silicon to everything else in the box.
PCIe is point-to-point, not a bus
PCI's name is misleading: PCIe is not a shared bus. Each lane is an independent pair of wires (one transmit, one receive) running serial data at multi-gigabit rates. A device with an x16 connection has 16 such lane pairs in each direction. There's no contention between devices for the same lane — arbitration only happens upstream at switches and at the root complex.
| Role | Notes |
|---|---|
| Root Complex | Lives on the CPU. Connects PCIe to the rest of the SoC. |
| Switch | Multiplexes one upstream port to many downstream ports. Used to fan out PCIe lanes when the CPU has fewer lanes than slots. |
| Endpoint | The actual device — GPU, NVMe SSD, NIC. Sources or sinks PCIe traffic. |
| Bridge | Translates between PCIe generations or between PCIe and another bus (e.g., CXL). |
Bandwidth scales with lanes × generation
Per-lane bandwidth doubles every generation. Mainstream slot widths are x1, x4, x8, x16; the M.2 slot used by NVMe SSDs is typically x4. A modern GPU plugs into x16; an NVMe drive into x4. The arithmetic:
Lane counts on real CPUs
| CPU | PCIe Gen | Total lanes | Typical use |
|---|---|---|---|
| Intel Core i9-14900K (consumer) | Gen5 + Gen4 | 20 lanes total | x16 GPU + x4 NVMe |
| AMD Ryzen 9 9950X (consumer) | Gen5 | 28 lanes | x16 GPU + 2× x4 NVMe + chipset |
| Apple M4 Max (laptop) | Gen4 (internal) | ~24 lanes equivalent | Internal SSD, Thunderbolt 5, ML accelerators |
| Intel Xeon Sapphire Rapids (server) | Gen5 | 80 lanes per socket | 4× GPU + dozens of NVMe |
| AMD EPYC Genoa (server) | Gen5 | 128 lanes per socket | ~12 GPUs + 24 NVMe |
The lane count is the gate-keeping resource on multi-GPU and high-NVMe systems. Consumer CPUs typically max out at one x16 GPU plus a couple of NVMe drives. Server platforms have 4–6× more, which is why HPC and AI training boxes use them.
DMA — bypassing the CPU
Without DMA, every byte read from a NIC into memory would require a CPU instruction. At 100 Gbps that's ~12 GB/s of CPU-mediated memory writes — every cycle of every core spent moving data. Unworkable. Direct Memory Access lets the device read or write DRAM directly, with the CPU only setting up the transfer and handling completion.
| DMA mode | Speed | Notes |
|---|---|---|
| Programmed I/O | ~MB/s | CPU issues every load and store. Used only on legacy devices and slow buses. |
| Coherent DMA | GB/s | Device reads/writes DRAM through the IOMMU; the host CPU sees coherent results without explicit cache flushes. Modern default. |
| Streaming DMA | GB/s | Device transfers data the CPU has explicitly mapped for streaming. Used for large data in/out (NIC packet buffers, NVMe queues). |
| Peer-to-peer DMA | GB/s | Device A writes directly to Device B's memory without involving the host CPU. Used between GPUs (NVLink) or between NVMe drives in some storage systems. |
A typical NIC packet receive cycle: the kernel posts descriptors pointing to free buffers in DRAM. The NIC DMAs incoming packets into those buffers. When a packet arrives, the NIC writes a completion to a different ring buffer and raises an MSI-X interrupt to a specific CPU core. The kernel reads the completion, processes the packet, and posts a new descriptor. The CPU never copies the packet bytes — the NIC put them where they need to be.
The IOMMU — a per-device MMU
Without an IOMMU, a malicious or buggy device could DMA anywhere in physical memory — including kernel code or another VM's data. The IOMMU is a translation layer: each device sees its own virtual address space, and the IOMMU translates those addresses to physical ones (with permission checks). It's structurally identical to the CPU's MMU, but for I/O.
- Device isolation. Each device gets its own translation domain. A device that walks off its assigned range hits a fault, not someone else's memory.
- VM passthrough. A GPU assigned to a VM has its DMA traffic translated to the VM's address space, so the guest sees the GPU as a regular PCIe device. Intel's VT-d and AMD-Vi enable this.
- SR-IOV. A single PCIe device exposes multiple "virtual functions" (VFs), each with its own IOMMU context. Used by NICs (one VF per VM), SmartNICs, and some GPU virtualization stacks.
- Performance cost. IOMMU translation adds 0–100 ns of latency per DMA. For high-throughput workloads, modern IOMMUs cache translations in IOTLBs — the device-side equivalent of the CPU TLB.
Interrupts — from INTx to MSI-X
| Mechanism | Vector count | Notes |
|---|---|---|
| INTx (legacy PCI) | 4 lines (A/B/C/D) shared | Edge-triggered, level-sensitive on x86, prone to interrupt storms when shared. Used until ~2008. |
| MSI | Up to 32 vectors per device | Message-signaled interrupts. The device writes to a doorbell address; CPU receives a message. Replaced INTx in PCIe 1.0 era. |
| MSI-X | Up to 2048 vectors per device | One vector per queue. Modern NICs and NVMe drives use MSI-X to spread interrupts across cores. Standard in PCIe Gen2+. |
| IMS (Interrupt Message Store) | Per-VF / per-context vectors | Intel SR-IOV / IOMMU-managed. Each VM or container gets its own interrupt vectors without coordinating with the OS. |
Modern NVMe drives have 64+ submission/completion queue pairs. Each queue is pinned to one CPU core via MSI-X — an interrupt for that queue's completion only ever fires on that core. This is what lets a single NVMe drive saturate 100 GB/s of host throughput without thrashing one core to death. Distribute the queues across cores; each core handles its own queue's completions; no shared lock.
Interrupt coalescing
A naive NIC raises an interrupt on every received packet. At 1 Mpps that's 1 million interrupts per second, each costing ~200 ns of context-switch overhead — 20% of one core's time spent just entering interrupt handlers. Interrupt coalescing batches them: the device waits up to N microseconds (or until M packets are buffered) before raising the interrupt. The trade-off is latency: higher coalescing thresholds mean lower CPU overhead but higher tail latency.
NVMe drives do the same with completion queues — most drivers configure the device
to coalesce until 16 completions are ready or 100 µs has elapsed. NICs let you
tune ethtool -c parameters (rx-usecs, rx-frames).
Production guidance: set thresholds to keep interrupt rate under ~50 Kpps per core,
while keeping tail latency within your SLO.
Bus, hub, and the death of shared
PCI (1992) was a 32-bit shared parallel bus running at 33 MHz — 132 MB/s for the entire system, divided among all attached devices. PCI-X (1999) doubled the width and added speed grades up to 533 MB/s. Both shared a clock and bus arbiter; one device's transaction blocked the bus for everyone.
PCIe (2003) replaced this with point-to-point serial links. Each lane has its own differential signaling pair; there's no shared bus. Aggregate bandwidth scales with lane count. Adding devices doesn't slow down existing ones (until the root complex is the bottleneck). It's a far more scalable architecture, which is why it has survived through 6 generations and 64× bandwidth growth.
Anatomy of a Transaction Layer Packet (TLP)
PCIe is a packet-switched fabric, not a bus. Every read, write, configuration, or interrupt travels as a Transaction Layer Packet. Understanding the TLP is how you understand why PCIe gen-over-gen bandwidth grows but per-transaction latency improves only modestly — the wire is faster, but the protocol envelope around each transaction stays the same size.
+--------+--------+----------------------+--------+
| Frame | Seq # | TLP HEADER + PAYLOAD | LCRC | ← Data Link Layer
+--------+--------+----------------------+--------+
|
+-----+------------------------------------------+
| Fmt | Type | TC | TD | EP | Attr | Length |
+-----+------+----+----+----+------+-------------+
| Requester ID | Tag | Last/First DW BE |
+-----------------+----------+-------------------+
| Address (64-bit, for memory reads/writes) |
+------------------------------------------------+
| Data (0 to 1024 DW, with completion semantics) |
+------------------------------------------------+Four TLP types cover almost everything: memory read/write (the workhorse — every DMA is a sequence of these), configuration read/write (the BIOS / OS enumerating devices at boot), completion (the reply to a memory read — non-posted), and message (interrupts, error signalling, power-management requests).
The header is 12-16 bytes and the payload is bounded by the device's
Max Payload Size (typically 256 or 512 bytes on modern systems). That
MPS-versus-header ratio is why bulk transfers run close to wire rate while
small transactions waste tens of percent on overhead. A 64-byte cache-line read
over PCIe carries ~20 bytes of envelope around 64 bytes of data — 24%
overhead. Bulk 4 KB transfers see <1% overhead.
lspci -vvv for
MaxReadReq — bumping it from 128 to 4096 often doubles throughput.A DMA, step by step
Putting the pieces together. A simplified NVMe write of a 4 KB block, from
write() syscall to completion:
- App. Calls
write(fd, buf, 4096)with a userspace buffer. - Kernel. Pins the user pages (so the kernel can hand a physical address to the device without those pages getting paged out). For O_DIRECT, no copy; otherwise the kernel may copy into a page-cache page first.
- Driver. Builds an NVMe Submission Queue Entry (SQE) containing the PRP (Physical Region Page) list — physical addresses the device should DMA from. Writes the SQE to a queue in DRAM the device has been told about at init.
- Driver. Writes a 4-byte value to the device's "doorbell" MMIO register — a PCIe memory-write TLP. This is the only round trip the CPU makes; from here the device runs the show.
- Device. Reads the SQE from DRAM via a memory-read TLP. Reads the 4 KB data buffer via 8 split memory-read TLPs (one per 512 B chunk, given typical MaxReadReq).
- Device. Writes the data to NAND, updates the FTL.
- Device. Posts a Completion Queue Entry (CQE) to DRAM via a memory-write TLP, then raises an MSI-X interrupt — another memory-write TLP, targeting the interrupt-message address pre-programmed at init.
- CPU. Takes the interrupt, runs the bottom-half, reads the CQE, wakes the syscall, returns to userspace.
The whole sequence is 12-15 TLPs across the PCIe fabric. On modern hardware end- to-end latency is around 50-100 µs for the NVMe path, of which the CPU is doing real work for maybe 1-2 µs — the rest is device latency and PCIe transit. The IOMMU adds 0-100 ns per DMA (often 0 with a primed TLB).
Patterns the fast paths reach for
Once you understand the TLP and the DMA flow, several high-throughput subsystems become legible. They are all variations on the same theme: eliminate the CPU from the critical path.
| Technique | What changes | Used by |
|---|---|---|
| NVMe queue pairs (one per core) | Each CPU core has its own SQ/CQ. No cross-core locking on the data path; MSI-X interrupts return to the core that issued the request. | Linux NVMe driver default since 4.0 |
| io_uring | Userspace + kernel share submission/completion ring buffers. A single syscall submits dozens of I/Os. Per-syscall overhead drops by 10x for small I/Os. | Modern Linux storage stacks (5.1+) |
| DPDK / AF_XDP | NIC ring buffers mapped directly into a userspace process. Poll mode — no interrupts. Bypasses the kernel network stack entirely. | Telco workloads, CDN edge stacks, packet brokers |
| GPUDirect Storage / RDMA | NVMe → GPU memory directly via peer-to-peer DMA, skipping a bounce buffer in CPU DRAM. Adds >30% throughput for ML training data pipelines. | NVIDIA Magnum IO, Cloudera, MosaicML |
| SR-IOV virtual functions | One physical device exposes N "virtual" PCIe devices, each with its own BAR + MSI-X vectors. The hypervisor passes a VF through to a VM; the IOMMU enforces isolation. | Cloud network VNICs (AWS ENA, Azure SR-IOV) |
| CXL.cache / CXL.mem | PCIe physical layer, but with a coherent memory protocol on top. The accelerator participates in the CPU's coherence domain. Removes the explicit DMA step for memory-class accelerators. | Memory pooling, accelerator-attached HBM, post-2023 servers |
| Doorbell batching | Driver writes the doorbell only every N submissions, not every one. Trades a few microseconds of latency for substantially less PCIe traffic. | Most modern NVMe + RDMA drivers |
Common misconceptions
- "PCIe lanes have a fixed assignment." Most CPUs have lane bifurcation: an x16 slot can run as one x16, two x8, or four x4 depending on BIOS configuration. This is how you get four NVMe drives into a single physical x16 slot via an adapter card.
- "The IOMMU only matters for VMs." Even on bare metal, the IOMMU enforces device isolation against malicious USB devices, faulty firmware, and DMA attacks like the Thunderclap and FireWire attacks. Linux enables it by default on most distros now.
- "PCIe Gen6 will fix bandwidth forever." Each generation halves the signaling time per bit, requiring more aggressive equalization. Gen6 introduced PAM-4 signaling (4 levels per symbol, vs Gen5's 2). Gen7 will need new electrical standards. Bandwidth growth is slowing in absolute terms.
- "DMA is faster because the CPU doesn't have to do the copy." True for the obvious case, but DMA also has setup costs (descriptor preparation, TLB-flush in some IOMMU configurations). For tiny transfers, programmed I/O can actually beat DMA.
Numbers worth remembering
| Quantity | Value |
|---|---|
| PCIe Gen3 per-lane bandwidth | ~0.985 GB/s |
| PCIe Gen4 per-lane bandwidth | ~1.97 GB/s |
| PCIe Gen5 per-lane bandwidth | ~3.94 GB/s |
| PCIe Gen6 per-lane bandwidth | ~7.88 GB/s |
| x16 slot, Gen5 (typical GPU) | 63 GB/s |
| x4 slot, Gen5 (typical NVMe) | ~14 GB/s |
| Consumer CPU lanes (typical) | 20–28 |
| Server CPU lanes (typical) | 80–128 per socket |
| MSI-X max vectors per device | 2048 |
| NIC interrupt coalescing target | < 50 Kpps per core |
| NVMe queue pair count | typically 64–128 |
| IOMMU per-DMA overhead | ~0–100 ns |
Further reading
- Wikipedia — PCI Express — comprehensive coverage of every generation, signaling, and protocol.
- Wikipedia — Direct memory access — DMA history and modern architectures.
- Wikipedia — IOMMU — Intel VT-d, AMD-Vi, ARM SMMU.
- Linux kernel — PCI subsystem documentation — how the kernel enumerates and manages PCIe devices.
- Wikipedia — Message Signaled Interrupts — MSI and MSI-X, which replaced legacy INTx.
- Chips and Cheese — measured PCIe bandwidth on every recent platform.
- PCI-SIG — the standards body that publishes PCIe specs.