05 / 10
Internals / 05

Virtual memory

Every process believes it owns a flat, contiguous address space from zero to a few terabytes. It doesn't. Virtual memory is the kernel's lie, signed and notarised in hardware by the MMU, that lets dozens of processes share one finite stick of DRAM without any of them knowing. Here is how the lie is told: pages, page tables, the TLB, faults, copy-on-write, swap, huge pages, and NUMA.


The lie everyone believes.

A process running on a Linux box thinks its memory starts at address zero (well, near zero, page zero is unmapped on purpose) and runs up to 0x7fff_ffff_ffff or so. It thinks the stack grows down from somewhere near the top, the heap grows up from somewhere near the bottom, and the kernel lives above the userland ceiling. None of this is true of the actual silicon. Physical RAM is a flat array of bytes; addresses in it start at zero and end wherever the DIMMs end. Two processes both writing to virtual address 0x401000 are almost certainly writing to two different physical frames.

Virtual memory is the mechanism that holds this illusion together. Every load and store a userspace program issues goes through the MMU, the memory management unit inside the CPU, which translates the virtual address into a physical one using per-process page tables that the kernel maintains. The illusion is consensual hallucination: the hardware is in on it, and the kernel is the dungeon master.

There are three jobs the illusion does at once, and it is worth separating them because people tend to remember only the first. The first is isolation. Process A cannot read or scribble on process B's memory because A's page tables simply have no entry that points at B's frames. A stray pointer in A faults inside A's own address space; it can never reach into the kernel or a neighbour. That is the property the whole multi-user, multi-process model rests on, and it is enforced by hardware on every single access, not by checks the kernel runs after the fact.

The second job is the illusion of a large, contiguous space. A program can ask for a gigabyte buffer and get back one flat run of addresses even when physical RAM is fragmented into a thousand scattered free frames, because the page table stitches non-contiguous frames into a contiguous virtual run. The third job is over-commit: the kernel can promise far more virtual memory than the machine physically has, on the bet that most of it is never touched at the same time. All three jobs fall out of the same trick, a per-process map from virtual page to physical frame, and the rest of this page is about how that map is built, cached, and faulted in. The same machinery from the silicon's point of view lives on the computer-architecture virtual-memory page; here we stay on the kernel's side of the line. For the layer above this, how a heap is carved out of these pages, see memory management and the memory-allocation walkthrough.

Pages and frames.

The unit of mapping is the page. On x86-64 the default page size is 4 KiB; physical RAM is divided into matching frames of 4 KiB. Each process gets a page table, really a tree, but think of it as a function from virtual page number to physical frame number plus permission bits (read, write, execute, user, present, dirty, accessed). Every memory access the CPU performs is mediated by this lookup. Miss the cache for the translation and you pay; miss the page itself and you pay much more.

Because the unit is 4 KiB, the low 12 bits of any virtual address are the offset within the page and are passed through untouched. Only the upper bits, the virtual page number, need translating. A 48-bit address space therefore has 2³⁶ virtual pages per process, which is why a single flat page table would be absurd: 64 GiB of entries to map a few megabytes of code.

Multi-level page tables.

The fix is a tree. On x86-64 with 48-bit addressing the tree has four levels: PML4 → PDPT → PD → PT, and each level is itself a 4 KiB page holding 512 8-byte entries. The upper 36 bits of the virtual address are sliced into four 9-bit indices, one per level, plus the 12-bit offset. Only the branches that contain mapped pages need to exist; everything else is a null entry, taking zero memory. Newer Intel and AMD parts ship with 5-level paging (PML5) to support 57-bit addresses, used in systems with terabytes of RAM.

The cost of this elegance is that one userspace load can require four separate memory dereferences just to translate the address, before the actual data load happens. On a TLB miss the MMU does this walk in hardware. The page-walk cache hides some of it, but if those page-table pages are themselves cold in L1, you can spend 100+ cycles on a translation that, when cached, takes one.

48-bit virtual addressPML4 idxPDPT idxPD idxPT idxoffset9 bits9 bits9 bits9 bitsPML4PDPTPDPTphys frame+ offsetCR3 points at the top table; each index selects the next table down
One translation, four dependent loads. The offset bypasses the tree and is glued straight onto the frame address.
Why the TLB matters so much. A 4-level walk that misses every cache is four memory loads, call it 400 cycles on a modern Intel part. Hit the TLB and it's one cycle. The TLB is the single most important cache for memory-heavy workloads, and you almost never see it directly.

The TLB — translation lookaside buffer.

The TLB is a small fully-associative cache inside the MMU that holds recent virtual → physical translations. A typical Intel core has an L1 dTLB of about 64 entries for 4 KiB pages, an L1 iTLB of similar size, and an L2 STLB of around 1,500–2,000 entries shared between data and instructions. With 4 KiB pages, 64 entries cover exactly 256 KiB of working set. Anything bigger spills into the L2 TLB, and anything bigger than the L2 TLB does a full page walk.

This is why working-set size matters more than total RSS. A program that touches 200 MiB scattered across random pages will have a hot TLB miss rate; one that touches 200 MiB sequentially will have almost none, because the hardware prefetcher and the page-walk cache together hide most of the cost. Profilers like perf expose this as dTLB-load-misses and iTLB-load-misses.

Page sizeL1 dTLB coverageWhere it shines
4 KiB (default)~256 KiB (64 entries)Small, sparse data; everything by default
2 MiB (huge)~64 MiB (32 entries)JVM heaps, Postgres shared_buffers, Redis
1 GiB (gigantic)~4 GiB (4 entries)DPDK, VM hypervisors, databases with TB heaps

Page faults — minor and major.

When the MMU walks the page table and finds an entry whose present bit is zero, the CPU raises a page fault exception and traps into the kernel. The fault handler then decides what to do. There are two flavours, and the difference between them is roughly six orders of magnitude.

A minor fault happens when the page is already in physical RAM but isn't mapped into this process. For example, a freshly forked child's COW page on first write, or an mmap'd file page that's in the page cache from a previous read. The handler grabs a frame, updates the page table, returns. Cost: a few microseconds.

A major fault happens when the page is not in RAM at all and must be read from disk: a swapped-out anonymous page, or a file page evicted from the page cache. The cost is the cost of an I/O: a few hundred microseconds on NVMe, several milliseconds on spinning rust. From the application's perspective a single load instruction just took ten million cycles. Latency-sensitive services treat major faults as incidents.

The third outcome of a fault is the one no one wants. If the faulting address is not covered by any mapping in the process's VMA list, or the access violates the mapping's permissions (a write to a read-only page that is not a copy-on-write page, an execute on a no-execute page), the handler cannot fix anything. It delivers SIGSEGV to the process, which usually dies with a segmentation fault. So one hardware mechanism, the same page-fault trap, drives three very different paths: a cheap fix-up in RAM, an expensive fetch from disk, and a fatal access violation. The fault handler is a small decision tree the kernel runs on every miss, and reading it as a tree makes the rest of demand paging fall into place.

page fault trapaddress in a VMA?permissions ok?page in RAM?SIGSEGVminor faultmajor fault (disk I/O)noyesnoyesyesno
Every miss runs the same decision tree. Two paths fix the mapping; the leftmost path kills the process.

Demand paging and copy-on-write.

The kernel is lazy on purpose. When you fork() a process, it doesn't copy the address space. It duplicates the page tables and marks every writable page read-only in both. The first write from either side traps, the kernel allocates a fresh frame, copies the data, marks the writer's mapping writable, and returns. Pages neither side writes are shared until one of them exits. This is why fork(); exec(); is cheap in practice even though it looks like it should copy gigabytes.

mmap() is the same trick generalised. A mapping is created in the VMA list but no physical frames are allocated. Each page is loaded on first access, either from the backing file (for file-backed mappings) or zero-filled (for anonymous mappings). This is how malloc() can hand back a gigabyte instantly. It is an mmap(MAP_ANONYMOUS) that costs nothing until you touch it.

after fork (shared, read-only)parent PTE (ro)child PTE (ro)frame Xchild writes → COW splitparent PTE (rw)child PTE (rw)frame Xframe X'the write traps, the kernel copies the page, then both sides go their own way
Copy-on-write: sharing stays free until someone writes. Most forked pages are read and then thrown away by exec, so most are never copied.

File-backed mmap is more than a lazy allocator; it is a different way to do file I/O. Map a file and its bytes appear as memory, and the page cache becomes the buffer: reads turn into minor faults that the page cache already holds, writes mark pages dirty and the write-back machinery flushes them later. There is no read() or write() syscall per access and no copy between a kernel buffer and a userspace buffer, which is why databases and language runtimes lean on it. The cost is that you give up explicit control over when I/O happens: a load can block on a major fault you did not schedule, and an error reading the file shows up as a SIGBUS mid-instruction rather than as a return value you can check. madvise() is how a program hints the kernel about its access pattern, MADV_SEQUENTIAL to read ahead more aggressively, MADV_RANDOM to stop, MADV_DONTNEED to drop pages it is done with. Shared file-backed mappings are also how processes share memory without a pipe: two processes mapping the same file with MAP_SHARED see each other's writes through the one set of frames the page cache holds.

Demand paging is the thread running through all of this. The kernel does the least work it can get away with at allocation time and pushes the real work to first touch, where it can be sure the work is actually needed. That is why resident set size (RSS, the pages actually in RAM) is almost always smaller than virtual set size (VSZ, the pages mapped), and why a process can map a 100 GB file on a 16 GB machine and run fine as long as its working set, the pages it touches in any short window, fits. Over-commit is the kernel betting that the gap between what is mapped and what is touched stays wide. When the bet goes wrong, the system has to start taking pages back, which is the next section.

Swap and memory pressure.

When physical RAM fills up, the kernel's page reclaimer (kswapd, plus direct reclaim) starts evicting pages. Clean file-backed pages can be dropped, since they can be re-read from disk if needed. Dirty pages are written back. Anonymous pages (heap, stack) have nowhere to go unless there's a swap device; with swap, they're compressed-and-written or just written. The page table entry is flipped to "not present" and the next access triggers a major fault.

Swap is fine for absorbing transient spikes on a desktop. On a latency-sensitive server it's poison. A handful of pages going to disk under load can cascade. The application gets slower, more pages go cold, more pages get swapped, and the system enters the thrashing region where it's spending most of its cycles waiting for I/O. The standard modern advice is to run with swap off and rely on cgroup memory limits plus the OOM killer to bound damage: a fast death beats a slow one.

Thrashing is worth naming precisely because it is a feedback loop, not just slowness. The reclaimer evicts a page to make room, the program touches that page a moment later, takes a major fault, waits on disk, and meanwhile the reclaimer has had to evict another page to service the fault. The working set no longer fits, so every page brought in pushes out a page that is about to be needed. Useful work collapses toward zero while the disk stays pinned at 100% and the CPU sits mostly idle waiting on it. The system is busy doing nothing but moving pages between RAM and disk. Linux added pressure stall information (/proc/pressure/memory) specifically to make this visible: it reports the fraction of time tasks are stalled waiting on memory, which catches the onset of thrashing far earlier than free-memory counters do, because by the time free memory is gone you are already deep in the loop.

The newer middle path is zswap and zram, which compress pages and keep them in RAM instead of writing them to a disk-backed swap device. A compressed page is far cheaper to fault back in than a disk read, so a modest amount of it can absorb spikes without dropping a service into disk-thrash. It does not change the arithmetic when the real working set exceeds RAM, only the cost of being a little over. The honest framing is that swap trades a hard failure (out of memory now) for a soft one (everything slow), and on a server that soft failure is usually worse than the hard one, because a slow service still passes health checks while failing every user.

Disable swap on latency-sensitive servers. Major faults are catastrophic for tail latency. Better to have the OOM killer terminate one container than to have every request paying 10 ms swap-in penalties. Kubernetes refused to schedule on swap-enabled nodes until v1.22, and even now most production clusters keep swap off.

Huge pages — coverage vs scan jitter.

A 2 MiB huge page is a single page-table entry that covers 512 × 4 KiB. One TLB entry now covers 2 MiB of working set instead of 4 KiB. For workloads with multi-gigabyte heaps such as JVMs, Postgres shared_buffers, Redis, and MongoDB's WiredTiger cache, this can cut TLB-miss-driven stalls by an order of magnitude. The price is granularity: a 2 MiB page is allocated as a contiguous physical region, which is hard to find on a fragmented system.

Linux offers two flavours. Explicit huge pages are reserved at boot via hugepages=N or /proc/sys/vm/nr_hugepages and are claimed by applications through mmap(MAP_HUGETLB) or hugetlbfs. They are locked in RAM and never paged. Transparent Huge Pages (THP) are opportunistic: the kernel's khugepaged daemon scans process memory and promotes contiguous 4 KiB pages into 2 MiB ones in the background. THP is convenient but the scan and the synchronous compaction on allocation can produce noticeable latency spikes. Postgres, Redis, and MongoDB documentation all recommend setting /sys/kernel/mm/transparent_hugepage/enabled=never. JVM and DPDK workloads generally want it on. There is no single right answer. Measure.

NUMA — not all memory is local.

On a multi-socket server, each socket has its own memory controller and its own attached DRAM. A core on socket 0 accessing a page that lives on socket 1's DIMMs has to traverse the inter-socket interconnect (UPI on Intel, Infinity Fabric on AMD), which adds roughly 50–100 ns to the access, call it 1.5× to 2× slower than local. On a four-socket box the worst-case hop is two interconnects and the penalty grows. This is the world of NUMA: Non-Uniform Memory Access.

Linux exposes this through numactl for placement, mbind() and set_mempolicy() for in-program control, and /proc/<pid>/numa_maps for inspection. The default policy is first-touch: a page is allocated on the node where the thread that first writes to it is running. Get this wrong, allocate a huge buffer on the main thread and then pin worker threads to a different socket, and every access pays the remote tax. The kernel's numa_balancing feature tries to migrate pages toward the threads that use them, but it costs CPU and can hurt workloads with steady, balanced access. Production NUMA tuning usually means pinning threads with numactl --cpunodebind and allocating on the same node with --membind.

Looking at pressure.

The kernel exposes its accounting in /proc/meminfo, the per-NUMA-node variants in /sys/devices/system/node/node*/meminfo, per-process detail in /proc/<pid>/status and /proc/<pid>/smaps. vmstat 1 shows page-in / page-out / swap-in / swap-out per second. sar -B historicises it. perf mem record sample-attributes loads and stores to cache levels and TLBs.

$ cat /proc/meminfo
MemTotal:       65794132 kB
MemFree:         2103040 kB
MemAvailable:   42118936 kB
Buffers:          392104 kB
Cached:         38901724 kB
SwapCached:            0 kB
Active:         29412800 kB
Inactive:       19738468 kB
Active(anon):    9712688 kB
Inactive(anon):   145128 kB
Dirty:             19704 kB
Writeback:             0 kB
AnonPages:       9857664 kB
Mapped:          1822796 kB
Shmem:               288 kB
KReclaimable:    1124416 kB
Slab:            1923080 kB
PageTables:        96424 kB
SwapTotal:             0 kB
SwapFree:              0 kB
HugePages_Total:    1024
HugePages_Free:      512
Hugepagesize:       2048 kB
DirectMap4k:      718080 kB
DirectMap2M:    33161216 kB
DirectMap1G:    33554432 kB

For TLB and cache misses specifically, perf goes deeper than the proc interface. The classic incident playbook: tail latency creeps up, vmstat shows si/so non-zero (swap thrash), or perf reveals an unexpectedly high dTLB-load-misses rate (NUMA imbalance, THP off when it should be on, or the process touching more memory than the TLB can map).

$ perf mem record -a -- sleep 10
$ perf mem report --sort=mem,sym

Overhead  Memory access            Symbol
  41.2%   L1 hit                   [.] hash_lookup
  18.6%   LFB hit                  [.] memcpy_avx_unaligned
  12.1%   L3 hit                   [.] btree_walk
   9.4%   Local RAM hit            [.] page_remap
   7.8%   Remote RAM (2 hops)      [.] worker_loop      <-- NUMA tax
   4.0%   L2 hit                   [.] strcmp
   3.1%   Remote Cache (1 hop)     [.] cache_lookup
   2.5%   LFB hit                  [.] memset_avx
BPF tools earn their keep here. bcc's tlbstat, cachestat, oomkill, and bpftrace one-liners against kmem:mm_page_alloc and vmscan:* tracepoints expose behaviour the proc files can't. Brendan Gregg's memory-analysis flame charts are the canonical reference.

Further reading.

Found this useful?