10 / 15
Internals / 10

Virtual memory and the TLB

Every load is two loads. The CPU starts with a virtual address, walks a multi-level page table to find the physical address, then issues the actual data fetch. The page table walk itself is 4 memory accesses on x86-64 — five with LA57 — which would be unbearable if the CPU paid that cost on every load. The TLB hides it. The TLB is the most precious silicon in your CPU you've never heard of, and the single biggest performance lever for workloads with large working sets.


Why virtual memory exists

This page looks at virtual memory from the hardware side: the silicon that turns a virtual address into a physical one on every memory access. The operating-system side, how pages get allocated, faulted in, swapped to disk, and shared between processes, lives on the OS virtual memory page. The two halves meet at the page table: the OS builds and edits it, the hardware reads and walks it. Here we follow the read side.

Without virtual memory, every program would see the same physical RAM directly. Two processes accessing address 0x1000 would collide. There'd be no isolation between them, no protection from a buggy program scribbling on the kernel, no way to over-commit memory or swap pages to disk. Virtual memory fixes all of this with one trick: each process gets its own private address space, and the hardware quietly translates every memory access through a per- process page table.

The cost is real: every memory access now has two phases. First, translate the virtual address into a physical one. Second, fetch the actual data. The translation requires reading the page table — which is itself in memory. Without help, this is one memory access per level of the page table, plus the data access. On x86-64 with 4-level paging, that's 5 memory accesses per load. Unbearable. The fix is to cache the translations in the TLB.

The MMU does the translating

The hardware that performs translation is the memory management unit, or MMU. It sits between the CPU core and the memory system. The core only ever deals in virtual addresses. The moment a load or store leaves the core, the MMU intercepts the address and turns it into a physical one before the request reaches DRAM. To the program this is invisible. To the silicon it is a fixed pipeline stage with a hard latency budget, because it runs on every access.

The MMU has three jobs and it does all of them on each access. It translates the virtual page number to a physical page-frame number. It checks permissions: is this page readable, writable, executable, and is the current privilege level allowed to touch it? And it signals faults when a translation is missing or a permission is violated, handing control to the OS. Translation and protection ride on the same lookup, which is why the page table is both a map and an access-control list.

CPU coreissues VAMMUtranslate VA to PAcheck R/W/X + levelfault if missingcachesL1 / L2 / L3DRAMphysicalVAPAa permission violation traps back to the OS as a fault
The MMU is a fixed pipeline stage. Every virtual address the core issues is translated and permission-checked before the request can reach the cache hierarchy.

The MMU does not hold the map itself. The map is the page table, which lives in DRAM and is far too large to keep on chip. A full 4-level table for a large process can span many megabytes. So the MMU has two working parts: a small fast cache of recent translations (the TLB) and a small state machine that reads the page table from memory when the TLB misses (the page-table walker, sometimes called the hardware page walker or PMH on Intel). The fast path is a TLB hit. The slow path is a walk. The rest of this page is mostly about the gap between the two.

A 4-level page walk

x86-64 splits a 48-bit virtual address into five parts: four 9-bit indices, one for each level of the page table, plus a 12-bit byte offset within the 4 KB page. Each level lookup reads one 8-byte page-table entry. The bottom level entry holds the physical page-frame number; combine it with the offset and you have the physical address.

The walk is a pointer chase through a tree. CR3 points at the top table; each index selects an entry; each entry points at the next table down; the last entry holds the frame. The diagram below traces one address all the way to a physical frame. The offset bits never get translated, they pass straight through, which is the whole reason a page is the unit of mapping.

PML4 idxPDPT idxPD idxPT idxpage offset9 bits9 bits9 bits9 bits12 bitsPML4CR3 basePDPT512 entriesPD512 entriesPT512 entriesframe+ offset = PAfour memory reads, one per level, to resolve one addressthe offset bypasses the tree and joins the frame at the end
Each 9-bit index selects 1 of 512 entries at its level. The walker chases pointers down four tables; the 12-bit offset is concatenated with the final frame to form the physical address.
011111111111111110101011110011010001001001000000
PML4 idx · 255 PDPT idx · 510 PD idx · 350 PT idx · 209 offset · 576
1
CR3 register holds the physical address of the PML4 table. The CPU adds 255 × 8 to find the PML4 entry. Reads 8 bytes. ~4 cy if cached, ~80 ns if not
2
PML4 entry holds the physical address of a PDPT. Add 510 × 8, read the PDPT entry. ~4 cy if cached
3
PD entry holds the physical address of a PT. Add 350 × 8, read the PD entry. ~4 cy if cached
4
PT entry holds the final physical page-frame number. Combine with the 12-bit offset (576) to form the physical address. ~4 cy if cached
Physical address resolved. The actual data load can now proceed. Total: 4 level lookups on a TLB miss with cached page tables; 4 × 80 ns ≈ 320 ns if the page-table itself is cold.
Why huge pages help: a 2 MB page reduces the walk from 4 to 3 levels and means a single TLB entry covers 512× more memory. A 1 GB page goes down to 2 levels and covers 262,144× more. For workloads with multi-GB working sets, this is the difference between TLB hit rate of 99% and 60%.

The TLB pays for itself

Modern CPUs have a small TLB at L1 (~64–256 entries, accessed in 1 cycle) and a larger one at L2 (~1024–4096 entries, accessed in 7–10 cycles). A TLB hit is free — the translation is already in the cache. A miss kicks off a page walk; the cost depends on whether the page-table memory itself is in cache.

The two paths look nothing alike in cost. A hit returns the physical frame in a single cycle, in time for the load to use it the same pipeline stage. A miss hands the address to the page-table walker, which issues the level reads itself, without interrupting the OS. Those reads hit the regular data caches, so the realistic cost of a miss is a handful of cache accesses, roughly 16 cycles when the page tables are warm in L1 or L2. If the page-table entries are cold and have to come from DRAM, each level read is a full memory access, and four of those is several hundred nanoseconds. The diagram below puts the two paths side by side.

VA invirtual page #TLBlookuphit: frame ready~1 cycle, fast pathwalk4 readsPML4PDPTPDPTrefill the TLB and move on~16 cycles warm, ~300–500 ns cold, then refill TLBone path is a cache; the other is the whole tree
The fast path is a TLB hit: one cycle, frame ready. The slow path is a hardware walk through all four levels, which then refills the TLB so the next access to that page is fast.
256 MB
1024
page size
pages needed
65,536
at 4 KB per page
TLB hit rate
2%
if accesses random across set
avg access cost
19.8
cycles, including TLB
Try (working set 1024 MB, TLB 1024 entries, 4 KB pages) — pages-needed is 262,144 vs TLB capacity 1024, hit rate ≈ 0.4%. Average access cost balloons. Switch to 2 MB pages: pages-needed drops to 512, hit rate jumps to 100%, cost goes back to L1 speeds. This is what JVM tuning, Postgres huge_pages=on, and Linux transparent_hugepage=always are about.

TLB sizes on real chips

ChipL1 D-TLBL1 I-TLBL2 unified TLBNotes
Apple M4 (P-core)1922564096Separate I/D L1 TLBs; unified L2; covers ~16 MB working set with 4 KB pages
Intel Raptor Lake (P-core)1282562048Plus 1024 1-GB-page entries in L2
AMD Zen 596964096Bigger L2 than Intel; fewer L1 entries
AMD Zen 472643072Predecessor to Zen 5
Apple M11281283072Established Apple's big-TLB approach

The pattern: Apple invests heavily in TLB capacity (matching its big-ROB, big-cache philosophy); AMD has caught up; Intel's ahead in 1-GB-page entries; everyone's growing. A modern L2 TLB at 4096 entries covers 16 MB of working set at 4 KB pages, or 8 GB at 2 MB pages. Beyond that, TLB miss rates dominate.

TLB coverage: the number that actually matters

Entry count is the headline figure, but the figure that predicts performance is TLB coverage: entries times page size. It is how much memory the TLB can map without a single walk. With 4 KB pages, a 1024-entry TLB covers 4 MB. A program whose hot data fits in 4 MB rarely misses; a program that touches 4 GB at random misses on almost every access, because only one access in a thousand finds its page already cached.

This is why page size is a hardware-performance lever, not just an allocation detail. Bigger pages do not make the TLB bigger, they make each entry cover more ground. One 2 MB entry maps what 512 separate 4 KB entries would, so the same 1024-entry TLB jumps from 4 MB of coverage to 2 GB. A 1 GB entry covers a full gigabyte by itself. For databases, JVMs, and machine-learning training, where the working set is measured in gigabytes, that multiplier is the difference between a TLB that does its job and one that is hopelessly outsized. The interactive cost model above lets you watch coverage collapse and recover as you change the page size.

There is a second-order effect worth knowing. CPUs keep separate TLB entries for each page size, and often separate structures for them. An L2 TLB might hold thousands of 4 KB entries but only a few hundred 1 GB entries. Mixing page sizes in one process can leave some of that capacity stranded. The practical rule: pick a page size that matches the working set, and do not assume the largest page is free, because the structure that maps it may be small.

Translation as the hardware boundary for protection

Process isolation is not a software policy that the kernel enforces by checking every pointer. It is a hardware property of the address space. Two processes can both hold the virtual address 0x401000 and never see each other's data, because each has its own page table and the MMU resolves the same virtual address to a different physical frame for each. The isolation is real because the only path from a virtual address to DRAM runs through that translation, and a process cannot name a physical frame it has no mapping for. There is no virtual address it can form that reaches another process's private memory.

The same lookup carries permission bits. Each page-table entry has a writable bit, a no-execute bit, a user/supervisor bit, and a few others. The MMU checks them on every access in parallel with the translation, at no extra latency. A store to a read-only page, a jump into a no-execute page, or a user-mode access to a kernel page all fault before the access completes. This is what makes constant data, executable code, and the kernel safe from a misbehaving program: the enforcement is in silicon, on the same critical path as the address itself, so it cannot be skipped for speed.

Switching processes means switching maps. On a context switch the OS loads a new value into CR3, which points the MMU at a different top-level table, and the new process now sees its own address space. The catch is that TLB entries from the old process are still cached, so they must not be trusted. Older designs flushed the whole TLB on every switch, which was correct but threw away every translation. Modern x86 and ARM tag each TLB entry with an address-space identifier (a PCID on x86, an ASID on ARM), so entries from different processes coexist in the TLB and a switch keeps both processes' translations warm. The same tagging is what lets the OS keep kernel mappings out of user space cheaply, which matters for the Meltdown fix below.

How translation meets the cache: VIPT

Translation sits in front of memory, and the first thing past the MMU is the L1 cache. That creates a timing problem. Caches are indexed to pick the right set and tagged to confirm the right line. If the CPU waited for the full virtual-to-physical translation before it could even start the cache lookup, every load would pay the TLB latency before the cache latency, in series. The fix is a trick that lets the two happen at once, called VIPT: virtually indexed, physically tagged.

The insight is that the low bits of a virtual address are the page offset, and the offset is not translated, so those bits are the same in the virtual and physical address. If the cache index is taken entirely from offset bits, the CPU can start the cache lookup using the untranslated virtual address while the TLB resolves the frame number in parallel. By the time the cache has read out the candidate lines and needs a tag to compare, the TLB has produced the physical frame, and the tag check uses that physical address. Index and translate overlap; the load sees only the longer of the two latencies, not their sum. This is why L1 caches are fast and small, and why their organization is tied to the page size.

The constraint VIPT imposes shapes real cache geometry. For the index to fit inside the offset bits, the number of sets times the line size must not exceed one page. With 4 KB pages and 64-byte lines, that caps a VIPT cache at 64 sets, so an 8-way L1 lands around 32 KB, which is exactly the size that has held for years. Designers who want a larger L1 either add associativity, accept a few index bits that come from the translated part of the address (and deal with the aliasing that creates), or lean on larger pages. The full story of indexing, tags, and associativity lives on the caches page; the point here is that virtual memory and the cache are co-designed, not independent layers. Where translation sits in the broader latency picture is the subject of the memory hierarchy page.

The physical address space

Virtual addresses are 48 bits on mainstream x86-64, giving each process 256 TB of address space. Physical addresses are narrower, because no machine has 256 TB of DRAM. Current cores support roughly 46 to 52 physical address bits depending on the generation, which is plenty to cover the largest servers shipping today. The gap between the two is the point: the virtual space can be sparse and enormous while physical memory stays small and dense, and translation is what bridges them.

That sparseness is what makes over-commit and demand paging possible at the hardware level. A process can reserve a huge virtual range, and only the pages it actually touches need a physical frame behind them; the rest have no mapping and cost nothing until used. When a process touches an unmapped page the MMU faults, the OS allocates a frame and installs the mapping, and the access retries and succeeds. The physical space is also where translation hands off to the rest of the system: a physical address can name DRAM, a memory-mapped device register, or a region another core also maps, and the same frame number can appear in two processes' page tables to give them shared memory. All of that is built on the one primitive, a per-process map from a large virtual space onto a smaller physical one.

Huge pages — when and why

x86-64 supports three page sizes: 4 KB (default), 2 MB, and 1 GB. The huge-page variants reduce the walk depth (3 levels for 2 MB, 2 for 1 GB), reduce TLB pressure (one entry covers more memory), and reduce page-fault frequency. The downsides:

  • Memory overhead. A 2 MB allocation is the granularity. Allocating 4 KB worth of data costs 2 MB of physical memory. For sparse workloads with small allocations, this is severe.
  • Fragmentation. Linux's "transparent huge pages" (THP) tries to allocate huge pages opportunistically and falls back to 4 KB when contiguous physical memory isn't available. THP can cause unpredictable latency spikes (the kernel's khugepaged compacts in the background, sometimes blocking user-space).
  • Less granular swapping. A 2 MB page can't be split when the OS wants to swap part of it out.

Production guidance in 2026: explicit huge pages (HugeTLB on Linux, Windows Large Pages) for databases, JVMs, ML training. Transparent huge pages set to madvise mode so applications can opt in. Avoid always mode in latency-sensitive systems unless you've measured it.

TLB shootdowns

When the OS unmaps a page (munmap, page reclaim, COW splitting), every CPU that might have a translation for that page in its TLB needs to invalidate it. There's no hardware coherence for TLBs across cores, so the OS has to do it via an inter- processor interrupt — the TLB shootdown. The initiating core sends an IPI; every receiving core stops what it's doing, executes INVLPG for the affected addresses, and acknowledges. Total cost: thousands of cycles in the best case, tens of thousands when contended.

TLB shootdowns are a notorious source of latency spikes in multi-threaded workloads that frequently allocate and free memory. Fixes:

  • Pool allocators. Reuse memory instead of returning it to the OS. jemalloc, tcmalloc, and the Go runtime all do this.
  • Larger huge pages. Fewer pages means fewer shootdowns.
  • NUMA-local allocation. Shootdowns within one NUMA node are cheap; across nodes are expensive.
  • Hardware shootdown extensions. Recent ARMv8 chips have TLBI instructions that broadcast invalidations without IPIs. Intel and AMD are catching up.

Meltdown — the speculative VM crossing

Meltdown (CVE-2017-5754) is an attack on the boundary between user-mode and kernel-mode memory. Both share the same page tables — kernel pages are mapped into every process for performance, with the user-mode bit set so they're inaccessible from user code. The CPU enforces this on retire, but speculatively executes the load before checking. Meltdown abuses this by speculatively reading a kernel address into a register, then using that value as an index into a user-readable array. The architectural load is squashed, but the cache-line touch persists. Time the array access; the index that was fast tells you the kernel byte.

The fix on Intel and ARM: Kernel Page-Table Isolation (KPTI / KAISER). Two separate page tables — one for user mode without kernel mappings, one for kernel mode with full mappings. Switch on every kernel entry. Cost: ~5–30% throughput on syscall-heavy workloads. AMD chips weren't vulnerable to Meltdown (they enforced the user-mode bit on speculation), so they didn't pay the KPTI cost. Apple silicon is similarly safe by design.

Why this is in the virtual-memory chapter: Meltdown is fundamentally a virtual-memory bug. The shared page table that made syscalls fast also enabled the attack. Modern OSes have separated user and kernel address spaces — at a real cost — to defend against it.

5-level paging (LA57)

48-bit virtual addresses cover 256 TB per process. That's enough for now, but cloud server hardware with 12 TB+ of physical memory is starting to exist, and user-space addressable spaces want to be 10× larger. LA57 (Linear Addressing with 57 bits) adds a fifth level above PML4, called PML5. Total virtual address space: 128 PB. Walk cost: 5 levels instead of 4.

Available on Intel Ice Lake-SP (2021) and later, AMD Genoa (2022) and later. Linux supports it as of kernel 5.5, but it's typically disabled at boot unless the machine actually needs >48 bits. The 25% extra walk cost is real, so most workloads stay on 4-level paging.

Common misconceptions

  • "Page faults and TLB misses are the same thing." No. A TLB miss is a hardware event — the CPU walks the page table, finds the translation, fills the TLB, continues. Cost: ~16 cycles. A page fault is a software event — the OS has to handle it (allocate physical memory, load from disk, COW, etc). Cost: 1 µs to 5 ms.
  • "Virtual memory means swapping." Swapping uses the virtual memory machinery, but virtual memory predates swapping by decades and most modern systems don't swap heavily. Virtual memory's primary purpose is process isolation and over-commit, not paging out to disk.
  • "Huge pages are always faster." Not for sparse workloads with small allocations. The 2 MB minimum allocation can waste enormous memory. Measure before turning on transparent huge pages.
  • "The kernel uses physical addresses." Mostly false. Modern kernels run with virtual memory enabled and use a "kernel virtual address" map that's pre-populated. Direct physical access happens only in early boot, in interrupt handlers, and in some I/O paths (DMA).

Numbers worth remembering

QuantityValueNotes
x86-64 page size (default)4 KB2¹² bytes
x86-64 huge page sizes2 MB, 1 GBDifferent walk depths
x86-64 page-table levels (default)4PML4 → PDPT → PD → PT
x86-64 page-table levels (LA57)5Adds PML5 above; 128 PB address space
L1 TLB size, mainstream~64–192 entries1-cycle access
L2 TLB size, mainstream~1024–4096 entries~7–10 cycle access
TLB miss penalty (cached page table)~16 cycles4 sequential L1/L2 hits
TLB miss penalty (cold page table)~300–500 ns4 DRAM accesses
Page fault cost (clean fault)~1–5 µsAllocate physical page, zero, return
Page fault cost (read from disk)~10 µs – 5 msDepends on storage backend
Meltdown disclosedJanuary 2018Same disclosure as Spectre
KPTI throughput cost~5–30%Syscall-heavy workloads worst-affected

Further reading

Found this useful?