New to this? · ELI5 · 1 min Read Virtual memory explained simply, in plain English

Virtual Memory Simulator, end to end

Virtual memory gives each process its own flat address space and lets the CPU and kernel translate those virtual pages to physical frames on demand. Here a process has 16 virtual pages, the hardware has 6 physical frames and a 4-entry TLB, and the kernel has a tiny swap area. Click any virtual page. The translation either hits the TLB (1 cycle), walks the page table (50 ns), takes a minor fault (1 µs), or takes a major fault from swap (80 µs). All four happen on a real CPU, all the time.

tlb hit

—%

resident

0/6

swap

Virtual address space (click to touch a page · shift+click = write)

TLB (4 entries, LRU)

— empty —

Physical frames

ppn	vpn	lru
p0	—	—
p1	—	—
p2	—	—
p3	—	—
p4	—	—
p5	—	—

Stats · 0 TLB hit · 0 walk · 0 minor · 0 major · 0 evict · 0 cow

Process spawned. Virtual address space empty (all PTEs invalid).

Try this

Click "preset: sequential 8". Watch the first 6 pages become resident; the 7th and 8th evict the LRU pages (0 and 1). TLB warms up.
Click "preset: thrashing". 10 pages walk through 6 frames — every access faults, every frame evicts a different page each time. Hit rate ≈ 0.
Click "preset: fork() + write". After fork(), every page is COW. The write to page 2 triggers a copy; the read of page 3 reuses the shared frame.
Manually shift+click an evicted page. Watch it return as a major fault from swap (or as a minor fault if it was clean and just discarded).

The four kinds of fault

Minor. Page was never in memory but doesn't need disk — usually zero-fill for fresh anon pages. ~1 µs.

Major. Page is in swap or backed by a file not in cache. Real disk read. ~50–500 µs depending on storage.

COW. Page is mapped but read-only; on write, kernel duplicates and remaps writeable. ~1 µs plus a 4 KiB copy.

Invalid. No VMA covers the address — SIGSEGV. The exception your null pointer becomes.

Adjacent

What you're looking at

The top grid is the process's 16 virtual pages; the two tables below are the hardware the kernel maps them onto — a 4-entry TLB and 6 physical frames — plus a small swap area. Click a page to read it, shift+click to write. Each touch runs the real translation path: a TLB hit (one cycle), a page-table walk (50 ns), a minor fault that zero-fills a fresh frame (1 µs), or a major fault that reads back from swap (80 µs). The log narrates every step and the stats line tallies hits, walks, faults, evictions, and copy-on-writes.

Click "preset: sequential 8" first. Six pages become resident, then the seventh and eighth force the LRU pages out, and the TLB warms up. Now click "preset: thrashing": ten pages cycle through six frames and the hit rate collapses toward zero — every access faults because the working set no longer fits. What should surprise you is how sharp that cliff is. Six frames cope; the moment the active set is seven, the machine spends its time shuffling pages instead of running your code. The fork() preset shows the other trick — every page goes copy-on-write, and only the page you write gets its own copy.

Every pointer is a lie

The address in your variable is not the address where the bytes live.

Virtual memory exists because three problems all wanted solving at once. Without it, every process would have to know where it sat in physical RAM and arrange not to collide with its neighbours. Every program would need its own loader-time relocation. Every process would have to trust every other process not to overwrite it. And no program could ever use more memory than the machine physically had. Virtual memory is the kernel's promise to handle all three by lying to every process about where its bytes live.

The mechanism: every memory access goes through the MMU, which translates virtual addresses into physical ones using per-process page tables. On x86_64 those tables are four levels deep (PML4 → PDPT → PD → PT) and each level is itself a 4 KiB page of 512 entries. Walking them costs a memory access per level — 200+ cycles for a full miss. The TLB exists to cache the translation so the next access to the same page is one cycle, not 200.

Three downstream consequences follow. Processes are isolated — your page table doesn't have an entry for my pages, so your code can't address them at all. Sharing is opt-in — mmap(MAP_SHARED) tells the kernel to install the same physical page in two address spaces. Overcommit is possible — the kernel can promise more virtual address space than physical RAM, and as long as processes don't all touch every page at once, swap absorbs the rest. The third is the source of every "my server got OOM-killed" debugging story.

Copy-on-write made fork() free

The reason your shell can spawn a thousand processes a second.

A naive fork() would have to copy the parent's entire address space — a multi-gigabyte process forking would mean multi-gigabyte allocation and copy on every call. Early Unix did exactly this; the only escape was vfork(), which required the child to immediately exec() and behave well in between.

Copy-on-write turned that cost from "as much as the address space" to "as much as the pages the child writes to." After fork(), the kernel duplicates the page tables but marks every writeable page read-only in both parent and child, with a COW flag set. The first write to any page triggers a fault; the handler allocates a fresh frame, copies the original page's contents, and remaps that page writeable in the writing process only. The other process still sees the original. A child that calls exec() immediately throws its address space away and pays zero copy cost.

This is the entire reason Redis's BGSAVE works: the parent process keeps serving while the child snapshots its memory to disk, and the cost is only the pages the parent writes during the snapshot (typically a small fraction of the dataset). Containers depend on the same mechanism — every docker run is a clone() with extra namespace flags, and the COW behaviour is what lets the new container start cheaply.

When the working set won't fit, performance falls off a cliff

Not gracefully. Catastrophically.

A program's working set is the set of pages it actually touches over some short window of time. If the working set fits in physical memory, every access is cheap — TLB hit or page-walk, both microsecond-scale. If the working set exceeds physical memory by even one page, the kernel evicts a page every time a new one comes in, and if the workload is loopy, the evicted page is exactly the one needed next.

This is thrashing. The press the preset above triggers it deliberately: 10 pages cycled through 6 frames. Every access faults. Every fault evicts a page that will be needed within the next few accesses. Throughput collapses by 4-5 orders of magnitude because you've replaced sub-microsecond accesses with sub-millisecond ones. The CPU spends almost all its time in the page-fault handler. In production this presents as "my service was fine all day, then suddenly the latency went from 5 ms to 5 s and the CPU is at 100% but the kernel says 80% iowait" — that's thrashing through swap.

The two ways out: reduce the working set (use a smaller dataset, stream rather than load, use mmap with MADV_DONTNEED to drop pages you're done with) or add memory (more RAM, or fewer concurrent workloads on the box). The kernel's heuristics (swappiness, the WSS estimator, MGLRU) can shift the cliff but cannot abolish it. Treat physical memory as a hard architectural constraint and the curve never bites.

Found this useful?