Virtual memory
Every process believes it owns a flat, contiguous address space from zero to a few terabytes. It doesn't. Virtual memory is the kernel's lie, signed and notarised in hardware by the MMU, that lets dozens of processes share one finite stick of DRAM without any of them knowing. Here is how the lie is told: pages, page tables, the TLB, faults, copy-on-write, swap, huge pages, and NUMA.
The lie everyone believes.
A process running on a Linux box thinks its memory starts at address zero (well, near
zero, page zero is unmapped on purpose) and runs up to 0x7fff_ffff_ffff or
so. It thinks the stack grows down from somewhere near the top, the heap grows up from
somewhere near the bottom, and the kernel lives above the userland ceiling. None of this
is true of the actual silicon. Physical RAM is a flat array of bytes; addresses in it
start at zero and end wherever the DIMMs end. Two processes both writing to virtual
address 0x401000 are almost certainly writing to two different physical
frames.
Virtual memory is the mechanism that holds this illusion together. Every load and store a userspace program issues goes through the MMU, the memory management unit inside the CPU, which translates the virtual address into a physical one using per-process page tables that the kernel maintains. The illusion is consensual hallucination: the hardware is in on it, and the kernel is the dungeon master.
There are three jobs the illusion does at once, and it is worth separating them because people tend to remember only the first. The first is isolation. Process A cannot read or scribble on process B's memory because A's page tables simply have no entry that points at B's frames. A stray pointer in A faults inside A's own address space; it can never reach into the kernel or a neighbour. That is the property the whole multi-user, multi-process model rests on, and it is enforced by hardware on every single access, not by checks the kernel runs after the fact.
The second job is the illusion of a large, contiguous space. A program can ask for a gigabyte buffer and get back one flat run of addresses even when physical RAM is fragmented into a thousand scattered free frames, because the page table stitches non-contiguous frames into a contiguous virtual run. The third job is over-commit: the kernel can promise far more virtual memory than the machine physically has, on the bet that most of it is never touched at the same time. All three jobs fall out of the same trick, a per-process map from virtual page to physical frame, and the rest of this page is about how that map is built, cached, and faulted in. The same machinery from the silicon's point of view lives on the computer-architecture virtual-memory page; here we stay on the kernel's side of the line. For the layer above this, how a heap is carved out of these pages, see memory management and the memory-allocation walkthrough.
Pages and frames.
The unit of mapping is the page. On x86-64 the default page size is 4 KiB; physical RAM is divided into matching frames of 4 KiB. Each process gets a page table, really a tree, but think of it as a function from virtual page number to physical frame number plus permission bits (read, write, execute, user, present, dirty, accessed). Every memory access the CPU performs is mediated by this lookup. Miss the cache for the translation and you pay; miss the page itself and you pay much more.
Because the unit is 4 KiB, the low 12 bits of any virtual address are the offset within the page and are passed through untouched. Only the upper bits, the virtual page number, need translating. A 48-bit address space therefore has 2³⁶ virtual pages per process, which is why a single flat page table would be absurd: 64 GiB of entries to map a few megabytes of code.
Multi-level page tables.
The fix is a tree. On x86-64 with 48-bit addressing the tree has four levels: PML4 → PDPT → PD → PT, and each level is itself a 4 KiB page holding 512 8-byte entries. The upper 36 bits of the virtual address are sliced into four 9-bit indices, one per level, plus the 12-bit offset. Only the branches that contain mapped pages need to exist; everything else is a null entry, taking zero memory. Newer Intel and AMD parts ship with 5-level paging (PML5) to support 57-bit addresses, used in systems with terabytes of RAM.
The cost of this elegance is that one userspace load can require four separate memory dereferences just to translate the address, before the actual data load happens. On a TLB miss the MMU does this walk in hardware. The page-walk cache hides some of it, but if those page-table pages are themselves cold in L1, you can spend 100+ cycles on a translation that, when cached, takes one.
The TLB — translation lookaside buffer.
The TLB is a small fully-associative cache inside the MMU that holds recent virtual → physical translations. A typical Intel core has an L1 dTLB of about 64 entries for 4 KiB pages, an L1 iTLB of similar size, and an L2 STLB of around 1,500–2,000 entries shared between data and instructions. With 4 KiB pages, 64 entries cover exactly 256 KiB of working set. Anything bigger spills into the L2 TLB, and anything bigger than the L2 TLB does a full page walk.
This is why working-set size matters more than total RSS. A program that touches 200 MiB
scattered across random pages will have a hot TLB miss rate; one that touches 200 MiB
sequentially will have almost none, because the hardware prefetcher and the page-walk
cache together hide most of the cost. Profilers like perf expose this as
dTLB-load-misses and iTLB-load-misses.
| Page size | L1 dTLB coverage | Where it shines |
|---|---|---|
| 4 KiB (default) | ~256 KiB (64 entries) | Small, sparse data; everything by default |
| 2 MiB (huge) | ~64 MiB (32 entries) | JVM heaps, Postgres shared_buffers, Redis |
| 1 GiB (gigantic) | ~4 GiB (4 entries) | DPDK, VM hypervisors, databases with TB heaps |
Page faults — minor and major.
When the MMU walks the page table and finds an entry whose present bit is
zero, the CPU raises a page fault exception and traps into the kernel.
The fault handler then decides what to do. There are two flavours, and the difference
between them is roughly six orders of magnitude.
A minor fault happens when the page is already in physical RAM but isn't mapped into this process. For example, a freshly forked child's COW page on first write, or an mmap'd file page that's in the page cache from a previous read. The handler grabs a frame, updates the page table, returns. Cost: a few microseconds.
A major fault happens when the page is not in RAM at all and must be read from disk: a swapped-out anonymous page, or a file page evicted from the page cache. The cost is the cost of an I/O: a few hundred microseconds on NVMe, several milliseconds on spinning rust. From the application's perspective a single load instruction just took ten million cycles. Latency-sensitive services treat major faults as incidents.
The third outcome of a fault is the one no one wants. If the faulting address is not
covered by any mapping in the process's VMA list, or the access violates the
mapping's permissions (a write to a read-only page that is not a copy-on-write page, an
execute on a no-execute page), the handler cannot fix anything. It delivers
SIGSEGV to the process, which usually dies with a segmentation fault. So one
hardware mechanism, the same page-fault trap, drives three very different paths: a cheap
fix-up in RAM, an expensive fetch from disk, and a fatal access violation. The fault
handler is a small decision tree the kernel runs on every miss, and reading it as a tree
makes the rest of demand paging fall into place.
Demand paging and copy-on-write.
The kernel is lazy on purpose. When you fork() a process, it doesn't copy
the address space. It duplicates the page tables and marks every writable page
read-only in both. The first write from either side traps, the kernel allocates a fresh
frame, copies the data, marks the writer's mapping writable, and returns. Pages neither
side writes are shared until one of them exits. This is why fork(); exec();
is cheap in practice even though it looks like it should copy gigabytes.
mmap() is the same trick generalised. A mapping is created in the VMA list
but no physical frames are allocated. Each page is loaded on first access, either from
the backing file (for file-backed mappings) or zero-filled (for anonymous mappings).
This is how malloc() can hand back a gigabyte instantly. It is an
mmap(MAP_ANONYMOUS) that costs nothing until you touch it.
File-backed mmap is more than a lazy allocator; it is a different way to do
file I/O. Map a file and its bytes appear as memory, and the page cache becomes the
buffer: reads turn into minor faults that the page cache already holds, writes mark pages
dirty and the write-back machinery flushes them later. There is no read() or
write() syscall per access and no copy between a kernel buffer and a userspace
buffer, which is why databases and language runtimes lean on it. The cost is that you give
up explicit control over when I/O happens: a load can block on a major fault you did not
schedule, and an error reading the file shows up as a SIGBUS mid-instruction
rather than as a return value you can check. madvise() is how a program hints
the kernel about its access pattern, MADV_SEQUENTIAL to read ahead more
aggressively, MADV_RANDOM to stop, MADV_DONTNEED to drop pages
it is done with. Shared file-backed mappings are also how processes share memory without a
pipe: two processes mapping the same file with MAP_SHARED see each other's
writes through the one set of frames the page cache holds.
Demand paging is the thread running through all of this. The kernel does the least work it can get away with at allocation time and pushes the real work to first touch, where it can be sure the work is actually needed. That is why resident set size (RSS, the pages actually in RAM) is almost always smaller than virtual set size (VSZ, the pages mapped), and why a process can map a 100 GB file on a 16 GB machine and run fine as long as its working set, the pages it touches in any short window, fits. Over-commit is the kernel betting that the gap between what is mapped and what is touched stays wide. When the bet goes wrong, the system has to start taking pages back, which is the next section.
Swap and memory pressure.
When physical RAM fills up, the kernel's page reclaimer (kswapd, plus
direct reclaim) starts evicting pages. Clean file-backed pages can be dropped, since they
can be re-read from disk if needed. Dirty pages are written back. Anonymous pages
(heap, stack) have nowhere to go unless there's a swap device; with swap, they're
compressed-and-written or just written. The page table entry is flipped to "not present"
and the next access triggers a major fault.
Swap is fine for absorbing transient spikes on a desktop. On a latency-sensitive server it's poison. A handful of pages going to disk under load can cascade. The application gets slower, more pages go cold, more pages get swapped, and the system enters the thrashing region where it's spending most of its cycles waiting for I/O. The standard modern advice is to run with swap off and rely on cgroup memory limits plus the OOM killer to bound damage: a fast death beats a slow one.
Thrashing is worth naming precisely because it is a feedback loop, not just slowness. The
reclaimer evicts a page to make room, the program touches that page a moment later, takes a
major fault, waits on disk, and meanwhile the reclaimer has had to evict another page to
service the fault. The working set no longer fits, so every page brought in pushes out a
page that is about to be needed. Useful work collapses toward zero while the disk stays
pinned at 100% and the CPU sits mostly idle waiting on it. The system is busy doing nothing
but moving pages between RAM and disk. Linux added pressure stall information
(/proc/pressure/memory) specifically to make this visible: it reports the
fraction of time tasks are stalled waiting on memory, which catches the onset of thrashing
far earlier than free-memory counters do, because by the time free memory is gone you are
already deep in the loop.
The newer middle path is zswap and zram, which compress pages and keep them in RAM instead of writing them to a disk-backed swap device. A compressed page is far cheaper to fault back in than a disk read, so a modest amount of it can absorb spikes without dropping a service into disk-thrash. It does not change the arithmetic when the real working set exceeds RAM, only the cost of being a little over. The honest framing is that swap trades a hard failure (out of memory now) for a soft one (everything slow), and on a server that soft failure is usually worse than the hard one, because a slow service still passes health checks while failing every user.
Huge pages — coverage vs scan jitter.
A 2 MiB huge page is a single page-table entry that covers 512 × 4 KiB. One TLB entry now covers 2 MiB of working set instead of 4 KiB. For workloads with multi-gigabyte heaps such as JVMs, Postgres shared_buffers, Redis, and MongoDB's WiredTiger cache, this can cut TLB-miss-driven stalls by an order of magnitude. The price is granularity: a 2 MiB page is allocated as a contiguous physical region, which is hard to find on a fragmented system.
Linux offers two flavours. Explicit huge pages are reserved at boot via
hugepages=N or /proc/sys/vm/nr_hugepages and are claimed by
applications through mmap(MAP_HUGETLB) or hugetlbfs. They are
locked in RAM and never paged. Transparent Huge Pages (THP) are
opportunistic: the kernel's khugepaged daemon scans process memory and
promotes contiguous 4 KiB pages into 2 MiB ones in the background. THP is convenient
but the scan and the synchronous compaction on allocation can produce noticeable
latency spikes. Postgres, Redis, and MongoDB documentation all recommend setting
/sys/kernel/mm/transparent_hugepage/enabled=never. JVM and DPDK workloads
generally want it on. There is no single right answer. Measure.
NUMA — not all memory is local.
On a multi-socket server, each socket has its own memory controller and its own attached DRAM. A core on socket 0 accessing a page that lives on socket 1's DIMMs has to traverse the inter-socket interconnect (UPI on Intel, Infinity Fabric on AMD), which adds roughly 50–100 ns to the access, call it 1.5× to 2× slower than local. On a four-socket box the worst-case hop is two interconnects and the penalty grows. This is the world of NUMA: Non-Uniform Memory Access.
Linux exposes this through numactl for placement, mbind() and
set_mempolicy() for in-program control, and /proc/<pid>/numa_maps
for inspection. The default policy is first-touch: a page is allocated on the
node where the thread that first writes to it is running. Get this wrong, allocate a
huge buffer on the main thread and then pin worker threads to a different socket, and
every access pays the remote tax. The kernel's numa_balancing feature
tries to migrate pages toward the threads that use them, but it costs CPU and can hurt
workloads with steady, balanced access. Production NUMA tuning usually means pinning
threads with numactl --cpunodebind and allocating on the same node with
--membind.
Looking at pressure.
The kernel exposes its accounting in /proc/meminfo, the per-NUMA-node
variants in /sys/devices/system/node/node*/meminfo, per-process detail in
/proc/<pid>/status and /proc/<pid>/smaps.
vmstat 1 shows page-in / page-out / swap-in / swap-out per second.
sar -B historicises it. perf mem record sample-attributes
loads and stores to cache levels and TLBs.
$ cat /proc/meminfo
MemTotal: 65794132 kB
MemFree: 2103040 kB
MemAvailable: 42118936 kB
Buffers: 392104 kB
Cached: 38901724 kB
SwapCached: 0 kB
Active: 29412800 kB
Inactive: 19738468 kB
Active(anon): 9712688 kB
Inactive(anon): 145128 kB
Dirty: 19704 kB
Writeback: 0 kB
AnonPages: 9857664 kB
Mapped: 1822796 kB
Shmem: 288 kB
KReclaimable: 1124416 kB
Slab: 1923080 kB
PageTables: 96424 kB
SwapTotal: 0 kB
SwapFree: 0 kB
HugePages_Total: 1024
HugePages_Free: 512
Hugepagesize: 2048 kB
DirectMap4k: 718080 kB
DirectMap2M: 33161216 kB
DirectMap1G: 33554432 kBFor TLB and cache misses specifically, perf goes deeper than the proc
interface. The classic incident playbook: tail latency creeps up, vmstat
shows si/so non-zero (swap thrash), or perf
reveals an unexpectedly high dTLB-load-misses rate (NUMA imbalance, THP
off when it should be on, or the process touching more memory than the TLB can map).
$ perf mem record -a -- sleep 10
$ perf mem report --sort=mem,sym
Overhead Memory access Symbol
41.2% L1 hit [.] hash_lookup
18.6% LFB hit [.] memcpy_avx_unaligned
12.1% L3 hit [.] btree_walk
9.4% Local RAM hit [.] page_remap
7.8% Remote RAM (2 hops) [.] worker_loop <-- NUMA tax
4.0% L2 hit [.] strcmp
3.1% Remote Cache (1 hop) [.] cache_lookup
2.5% LFB hit [.] memset_avxbcc's
tlbstat, cachestat, oomkill, and
bpftrace one-liners against kmem:mm_page_alloc and
vmscan:* tracepoints expose behaviour the proc files can't. Brendan
Gregg's memory-analysis flame charts are the canonical reference.Further reading.
- Ulrich Drepper — What Every Programmer Should Know About Memory — the 114-page reference that explains caches, the MMU, NUMA, and prefetching from first principles. Twenty years on, still the best single document on the subject.
- Waldspurger — Memory Resource Management in VMware ESX (OSDI 2002) — ballooning, page sharing, and the techniques every modern hypervisor inherited.
- Linux kernel — Transparent Huge Pages and hugetlbpage — the source of truth on THP tunables and explicit huge pages.
- Brendan Gregg — Memory Flame Graphs and Linux perf tools map — the practical observability toolbox.
- mmap(2), madvise(2), numactl(8) — man pages for the syscalls and tools this page named.
- LWN — Five-level page tables — how the kernel grew from 48-bit to 57-bit virtual addresses.
- OSTEP — virtualisation of memory chapters — the textbook treatment of paging, segmentation, TLBs, and swap.