Virtual memory and the TLB
Every load is two loads. The CPU starts with a virtual address, walks a multi-level page table to find the physical address, then issues the actual data fetch. The page table walk itself is 4 memory accesses on x86-64 — five with LA57 — which would be unbearable if the CPU paid that cost on every load. The TLB hides it. The TLB is the most precious silicon in your CPU you've never heard of, and the single biggest performance lever for workloads with large working sets.
Why virtual memory exists
This page looks at virtual memory from the hardware side: the silicon that turns a virtual address into a physical one on every memory access. The operating-system side, how pages get allocated, faulted in, swapped to disk, and shared between processes, lives on the OS virtual memory page. The two halves meet at the page table: the OS builds and edits it, the hardware reads and walks it. Here we follow the read side.
Without virtual memory, every program would see the same physical RAM directly. Two processes accessing address 0x1000 would collide. There'd be no isolation between them, no protection from a buggy program scribbling on the kernel, no way to over-commit memory or swap pages to disk. Virtual memory fixes all of this with one trick: each process gets its own private address space, and the hardware quietly translates every memory access through a per- process page table.
The cost is real: every memory access now has two phases. First, translate the virtual address into a physical one. Second, fetch the actual data. The translation requires reading the page table — which is itself in memory. Without help, this is one memory access per level of the page table, plus the data access. On x86-64 with 4-level paging, that's 5 memory accesses per load. Unbearable. The fix is to cache the translations in the TLB.
The MMU does the translating
The hardware that performs translation is the memory management unit, or MMU. It sits between the CPU core and the memory system. The core only ever deals in virtual addresses. The moment a load or store leaves the core, the MMU intercepts the address and turns it into a physical one before the request reaches DRAM. To the program this is invisible. To the silicon it is a fixed pipeline stage with a hard latency budget, because it runs on every access.
The MMU has three jobs and it does all of them on each access. It translates the virtual page number to a physical page-frame number. It checks permissions: is this page readable, writable, executable, and is the current privilege level allowed to touch it? And it signals faults when a translation is missing or a permission is violated, handing control to the OS. Translation and protection ride on the same lookup, which is why the page table is both a map and an access-control list.
The MMU does not hold the map itself. The map is the page table, which lives in DRAM and is far too large to keep on chip. A full 4-level table for a large process can span many megabytes. So the MMU has two working parts: a small fast cache of recent translations (the TLB) and a small state machine that reads the page table from memory when the TLB misses (the page-table walker, sometimes called the hardware page walker or PMH on Intel). The fast path is a TLB hit. The slow path is a walk. The rest of this page is mostly about the gap between the two.
A 4-level page walk
x86-64 splits a 48-bit virtual address into five parts: four 9-bit indices, one for each level of the page table, plus a 12-bit byte offset within the 4 KB page. Each level lookup reads one 8-byte page-table entry. The bottom level entry holds the physical page-frame number; combine it with the offset and you have the physical address.
The walk is a pointer chase through a tree. CR3 points at the top
table; each index selects an entry; each entry points at the next table down; the
last entry holds the frame. The diagram below traces one address all the way to a
physical frame. The offset bits never get translated, they pass straight through,
which is the whole reason a page is the unit of mapping.
255 × 8 to find the PML4 entry. Reads 8 bytes.
~4 cy if cached, ~80 ns if not510 × 8,
read the PDPT entry.
~4 cy if cached350 × 8,
read the PD entry.
~4 cy if cached576) to form the physical address.
~4 cy if cachedThe TLB pays for itself
Modern CPUs have a small TLB at L1 (~64–256 entries, accessed in 1 cycle) and a larger one at L2 (~1024–4096 entries, accessed in 7–10 cycles). A TLB hit is free — the translation is already in the cache. A miss kicks off a page walk; the cost depends on whether the page-table memory itself is in cache.
The two paths look nothing alike in cost. A hit returns the physical frame in a single cycle, in time for the load to use it the same pipeline stage. A miss hands the address to the page-table walker, which issues the level reads itself, without interrupting the OS. Those reads hit the regular data caches, so the realistic cost of a miss is a handful of cache accesses, roughly 16 cycles when the page tables are warm in L1 or L2. If the page-table entries are cold and have to come from DRAM, each level read is a full memory access, and four of those is several hundred nanoseconds. The diagram below puts the two paths side by side.
huge_pages=on, and Linux transparent_hugepage=always
are about.TLB sizes on real chips
| Chip | L1 D-TLB | L1 I-TLB | L2 unified TLB | Notes |
|---|---|---|---|---|
| Apple M4 (P-core) | 192 | 256 | 4096 | Separate I/D L1 TLBs; unified L2; covers ~16 MB working set with 4 KB pages |
| Intel Raptor Lake (P-core) | 128 | 256 | 2048 | Plus 1024 1-GB-page entries in L2 |
| AMD Zen 5 | 96 | 96 | 4096 | Bigger L2 than Intel; fewer L1 entries |
| AMD Zen 4 | 72 | 64 | 3072 | Predecessor to Zen 5 |
| Apple M1 | 128 | 128 | 3072 | Established Apple's big-TLB approach |
The pattern: Apple invests heavily in TLB capacity (matching its big-ROB, big-cache philosophy); AMD has caught up; Intel's ahead in 1-GB-page entries; everyone's growing. A modern L2 TLB at 4096 entries covers 16 MB of working set at 4 KB pages, or 8 GB at 2 MB pages. Beyond that, TLB miss rates dominate.
TLB coverage: the number that actually matters
Entry count is the headline figure, but the figure that predicts performance is TLB coverage: entries times page size. It is how much memory the TLB can map without a single walk. With 4 KB pages, a 1024-entry TLB covers 4 MB. A program whose hot data fits in 4 MB rarely misses; a program that touches 4 GB at random misses on almost every access, because only one access in a thousand finds its page already cached.
This is why page size is a hardware-performance lever, not just an allocation detail. Bigger pages do not make the TLB bigger, they make each entry cover more ground. One 2 MB entry maps what 512 separate 4 KB entries would, so the same 1024-entry TLB jumps from 4 MB of coverage to 2 GB. A 1 GB entry covers a full gigabyte by itself. For databases, JVMs, and machine-learning training, where the working set is measured in gigabytes, that multiplier is the difference between a TLB that does its job and one that is hopelessly outsized. The interactive cost model above lets you watch coverage collapse and recover as you change the page size.
There is a second-order effect worth knowing. CPUs keep separate TLB entries for each page size, and often separate structures for them. An L2 TLB might hold thousands of 4 KB entries but only a few hundred 1 GB entries. Mixing page sizes in one process can leave some of that capacity stranded. The practical rule: pick a page size that matches the working set, and do not assume the largest page is free, because the structure that maps it may be small.
Translation as the hardware boundary for protection
Process isolation is not a software policy that the kernel enforces by checking
every pointer. It is a hardware property of the address space. Two processes can
both hold the virtual address 0x401000 and never see each other's
data, because each has its own page table and the MMU resolves the same virtual
address to a different physical frame for each. The isolation is real because the
only path from a virtual address to DRAM runs through that translation, and a
process cannot name a physical frame it has no mapping for. There is no virtual
address it can form that reaches another process's private memory.
The same lookup carries permission bits. Each page-table entry has a writable bit, a no-execute bit, a user/supervisor bit, and a few others. The MMU checks them on every access in parallel with the translation, at no extra latency. A store to a read-only page, a jump into a no-execute page, or a user-mode access to a kernel page all fault before the access completes. This is what makes constant data, executable code, and the kernel safe from a misbehaving program: the enforcement is in silicon, on the same critical path as the address itself, so it cannot be skipped for speed.
Switching processes means switching maps. On a context switch the OS loads a new
value into CR3, which points the MMU at a different top-level table,
and the new process now sees its own address space. The catch is that TLB entries
from the old process are still cached, so they must not be trusted. Older designs
flushed the whole TLB on every switch, which was correct but threw away every
translation. Modern x86 and ARM tag each TLB entry with an address-space
identifier (a PCID on x86, an ASID on ARM), so entries from different processes
coexist in the TLB and a switch keeps both processes' translations warm. The same
tagging is what lets the OS
keep kernel mappings out of user space cheaply, which matters for the Meltdown fix
below.
How translation meets the cache: VIPT
Translation sits in front of memory, and the first thing past the MMU is the L1 cache. That creates a timing problem. Caches are indexed to pick the right set and tagged to confirm the right line. If the CPU waited for the full virtual-to-physical translation before it could even start the cache lookup, every load would pay the TLB latency before the cache latency, in series. The fix is a trick that lets the two happen at once, called VIPT: virtually indexed, physically tagged.
The insight is that the low bits of a virtual address are the page offset, and the offset is not translated, so those bits are the same in the virtual and physical address. If the cache index is taken entirely from offset bits, the CPU can start the cache lookup using the untranslated virtual address while the TLB resolves the frame number in parallel. By the time the cache has read out the candidate lines and needs a tag to compare, the TLB has produced the physical frame, and the tag check uses that physical address. Index and translate overlap; the load sees only the longer of the two latencies, not their sum. This is why L1 caches are fast and small, and why their organization is tied to the page size.
The constraint VIPT imposes shapes real cache geometry. For the index to fit inside the offset bits, the number of sets times the line size must not exceed one page. With 4 KB pages and 64-byte lines, that caps a VIPT cache at 64 sets, so an 8-way L1 lands around 32 KB, which is exactly the size that has held for years. Designers who want a larger L1 either add associativity, accept a few index bits that come from the translated part of the address (and deal with the aliasing that creates), or lean on larger pages. The full story of indexing, tags, and associativity lives on the caches page; the point here is that virtual memory and the cache are co-designed, not independent layers. Where translation sits in the broader latency picture is the subject of the memory hierarchy page.
The physical address space
Virtual addresses are 48 bits on mainstream x86-64, giving each process 256 TB of address space. Physical addresses are narrower, because no machine has 256 TB of DRAM. Current cores support roughly 46 to 52 physical address bits depending on the generation, which is plenty to cover the largest servers shipping today. The gap between the two is the point: the virtual space can be sparse and enormous while physical memory stays small and dense, and translation is what bridges them.
That sparseness is what makes over-commit and demand paging possible at the hardware level. A process can reserve a huge virtual range, and only the pages it actually touches need a physical frame behind them; the rest have no mapping and cost nothing until used. When a process touches an unmapped page the MMU faults, the OS allocates a frame and installs the mapping, and the access retries and succeeds. The physical space is also where translation hands off to the rest of the system: a physical address can name DRAM, a memory-mapped device register, or a region another core also maps, and the same frame number can appear in two processes' page tables to give them shared memory. All of that is built on the one primitive, a per-process map from a large virtual space onto a smaller physical one.
Huge pages — when and why
x86-64 supports three page sizes: 4 KB (default), 2 MB, and 1 GB. The huge-page variants reduce the walk depth (3 levels for 2 MB, 2 for 1 GB), reduce TLB pressure (one entry covers more memory), and reduce page-fault frequency. The downsides:
- Memory overhead. A 2 MB allocation is the granularity. Allocating 4 KB worth of data costs 2 MB of physical memory. For sparse workloads with small allocations, this is severe.
- Fragmentation. Linux's "transparent huge pages" (THP) tries to allocate huge pages opportunistically and falls back to 4 KB when contiguous physical memory isn't available. THP can cause unpredictable latency spikes (the kernel's
khugepagedcompacts in the background, sometimes blocking user-space). - Less granular swapping. A 2 MB page can't be split when the OS wants to swap part of it out.
Production guidance in 2026: explicit huge pages (HugeTLB on Linux, Windows Large
Pages) for databases, JVMs, ML training. Transparent huge pages set to
madvise mode so applications can opt in. Avoid always
mode in latency-sensitive systems unless you've measured it.
TLB shootdowns
When the OS unmaps a page (munmap, page reclaim, COW splitting), every CPU that
might have a translation for that page in its TLB needs to invalidate it. There's
no hardware coherence for TLBs across cores, so the OS has to do it via an inter-
processor interrupt — the TLB shootdown. The initiating core
sends an IPI; every receiving core stops what it's doing, executes
INVLPG for the affected addresses, and acknowledges. Total cost:
thousands of cycles in the best case, tens of thousands when contended.
TLB shootdowns are a notorious source of latency spikes in multi-threaded workloads that frequently allocate and free memory. Fixes:
- Pool allocators. Reuse memory instead of returning it to the OS.
jemalloc,tcmalloc, and the Go runtime all do this. - Larger huge pages. Fewer pages means fewer shootdowns.
- NUMA-local allocation. Shootdowns within one NUMA node are cheap; across nodes are expensive.
- Hardware shootdown extensions. Recent ARMv8 chips have
TLBIinstructions that broadcast invalidations without IPIs. Intel and AMD are catching up.
Meltdown — the speculative VM crossing
Meltdown (CVE-2017-5754) is an attack on the boundary between user-mode and kernel-mode memory. Both share the same page tables — kernel pages are mapped into every process for performance, with the user-mode bit set so they're inaccessible from user code. The CPU enforces this on retire, but speculatively executes the load before checking. Meltdown abuses this by speculatively reading a kernel address into a register, then using that value as an index into a user-readable array. The architectural load is squashed, but the cache-line touch persists. Time the array access; the index that was fast tells you the kernel byte.
The fix on Intel and ARM: Kernel Page-Table Isolation (KPTI / KAISER). Two separate page tables — one for user mode without kernel mappings, one for kernel mode with full mappings. Switch on every kernel entry. Cost: ~5–30% throughput on syscall-heavy workloads. AMD chips weren't vulnerable to Meltdown (they enforced the user-mode bit on speculation), so they didn't pay the KPTI cost. Apple silicon is similarly safe by design.
5-level paging (LA57)
48-bit virtual addresses cover 256 TB per process. That's enough for now, but cloud server hardware with 12 TB+ of physical memory is starting to exist, and user-space addressable spaces want to be 10× larger. LA57 (Linear Addressing with 57 bits) adds a fifth level above PML4, called PML5. Total virtual address space: 128 PB. Walk cost: 5 levels instead of 4.
Available on Intel Ice Lake-SP (2021) and later, AMD Genoa (2022) and later. Linux supports it as of kernel 5.5, but it's typically disabled at boot unless the machine actually needs >48 bits. The 25% extra walk cost is real, so most workloads stay on 4-level paging.
Common misconceptions
- "Page faults and TLB misses are the same thing." No. A TLB miss is a hardware event — the CPU walks the page table, finds the translation, fills the TLB, continues. Cost: ~16 cycles. A page fault is a software event — the OS has to handle it (allocate physical memory, load from disk, COW, etc). Cost: 1 µs to 5 ms.
- "Virtual memory means swapping." Swapping uses the virtual memory machinery, but virtual memory predates swapping by decades and most modern systems don't swap heavily. Virtual memory's primary purpose is process isolation and over-commit, not paging out to disk.
- "Huge pages are always faster." Not for sparse workloads with small allocations. The 2 MB minimum allocation can waste enormous memory. Measure before turning on transparent huge pages.
- "The kernel uses physical addresses." Mostly false. Modern kernels run with virtual memory enabled and use a "kernel virtual address" map that's pre-populated. Direct physical access happens only in early boot, in interrupt handlers, and in some I/O paths (DMA).
Numbers worth remembering
| Quantity | Value | Notes |
|---|---|---|
| x86-64 page size (default) | 4 KB | 2¹² bytes |
| x86-64 huge page sizes | 2 MB, 1 GB | Different walk depths |
| x86-64 page-table levels (default) | 4 | PML4 → PDPT → PD → PT |
| x86-64 page-table levels (LA57) | 5 | Adds PML5 above; 128 PB address space |
| L1 TLB size, mainstream | ~64–192 entries | 1-cycle access |
| L2 TLB size, mainstream | ~1024–4096 entries | ~7–10 cycle access |
| TLB miss penalty (cached page table) | ~16 cycles | 4 sequential L1/L2 hits |
| TLB miss penalty (cold page table) | ~300–500 ns | 4 DRAM accesses |
| Page fault cost (clean fault) | ~1–5 µs | Allocate physical page, zero, return |
| Page fault cost (read from disk) | ~10 µs – 5 ms | Depends on storage backend |
| Meltdown disclosed | January 2018 | Same disclosure as Spectre |
| KPTI throughput cost | ~5–30% | Syscall-heavy workloads worst-affected |
Further reading
- Wikipedia — Translation lookaside buffer — comprehensive coverage of TLB designs.
- Wikipedia — x86-64 virtual address space — exact bit layout for 4-level and 5-level paging.
- Patterson & Hennessy — Computer Organization and Design (RISC-V Edition). Chapter 5 covers virtual memory at undergraduate depth.
- Hennessy & Patterson — Computer Architecture: A Quantitative Approach. Section B.4 covers virtual memory and TLB design with quantitative analysis.
- Meltdown & Spectre disclosure — original papers and timeline.
- LWN — KAISER: hiding the kernel from user space — the kernel-side fix.
- Linux — Transparent Hugepage Support — the official guide on tuning huge pages on Linux.
- Chips and Cheese — measured TLB sizes and miss penalties on every recent CPU.