Syscall Journey Simulator: a read(), end to end
One read(fd, buf, 16) call. Twenty stages between your program asking and the
bytes arriving. Most of them are invisible to strace; some of them dwarf the others
by four orders of magnitude. Click run.
Each row in the trace is one stage a single read(fd, buf, 16) passes through: the libc wrapper, the syscall instruction that drops the CPU from ring 3 to ring 0, the KPTI page-table switch, VFS dispatch, ext4, the page cache, the block layer, the NVMe queue, the drive itself, then DMA and copy_to_user on the way back. The cycle counts are realistic estimates, and the totals at the top convert them to wall-clock time.
Run it once with the cache set to hit and note the total — a few microseconds, with the kernel doing everything in memory. Then switch to miss and run again: one stage, waiting on the SSD, swallows the entire budget by four orders of magnitude. Try a socket or a pipe to see how much of the path disappears. Two things should surprise you: how many of these stages strace will never show you, and how expensive even the "free" cache-hit path is next to a plain function call — which is exactly why batching and io_uring exist.
A syscall is a controlled fall
You're not calling a function. You're throwing the CPU into a different universe.
From the program's point of view, read(fd, buf, n) looks like an ordinary function
call into glibc. What actually happens at the bottom of that call is the
syscall instruction — a single byte sequence that pulls a lever the
rest of userspace cannot pull. The CPU's privilege level flips from ring 3 to ring 0; the
GS segment base swaps (so the kernel can find its per-CPU data); the stack pointer is
switched to a per-CPU kernel stack; and execution resumes at the address stored in
MSR_LSTAR, the syscall entry point the kernel installed at boot.
None of this is free. On a modern Intel CPU, the bare trap costs about 80–100 cycles each way; with KPTI (the Meltdown mitigation) toggled on it's closer to 250 each way, because the CR3 register has to swap between two distinct page tables — one for userspace, one for the kernel. The TLB rebuilds partially on each switch unless PCID is in use to tag entries by address space. This is why your syscall-heavy benchmark got 5–15% slower after the 2018 kernel updates and never recovered.
read() is a vtable call
The "everything is a file" promise, paid for in indirection.
Once inside the kernel, __x64_sys_read looks up the file descriptor in the
task's open-file table, gets a struct file *, and calls
file->f_op->read_iter(). That f_op is a vtable;
the implementation depends on what the fd refers to. A regular ext4 file dispatches into
ext4_file_read_iter, which calls generic_file_read_iter, which
either hits the page cache or schedules a disk read. A TCP socket dispatches into
tcp_recvmsg, which walks sk_receive_queue for packets. A pipe
dispatches into pipe_read, which reads from a circular buffer and sleeps when
it's empty. The syscall is the same; the universe behind it is not.
This is why read() on a socket "works the same way" as read()
on a file. The VFS — the virtual filesystem layer — exists to make that lie convincing.
And it's why specialised systems like databases sometimes bypass it: a database with its
own page cache wants the kernel to copy bytes from disk into the database's buffer without
passing through the kernel's cache (that's what O_DIRECT does), and a
zero-copy network server wants the kernel to splice bytes from one fd to another without
ever materializing them in userspace (that's what sendfile() and
splice() do).
80,000 cycles in the SSD, 800 everywhere else
Cold reads aren't slow because of the syscall path. They're slow because of physics.
Tally the stages on a cache-hit read: it's a few hundred cycles in the syscall trap, a few
hundred more in VFS and ext4, eighty in the actual copy_to_user. The whole
thing is under 1,500 cycles — well under a microsecond on a 3 GHz CPU.
Tally a cache-miss read on a fast NVMe SSD and the picture flips entirely. The block-layer submission, doorbell write, and SSD-side flash lookup adds roughly a quarter-million cycles, with the device dominating — typical 4 KiB read latency is 50–100 microseconds on a good drive, and that's the floor. Everything else in the syscall path is rounding error in comparison. This is the entire argument for the page cache, for prefetching, for io_uring (which lets you have many outstanding requests at once so the device-side wait can overlap), and for keeping working sets in RAM. The CPU is enormously faster than the disk, and the kernel's design is mostly an attempt to hide that fact.
Why this matters in real code
Three patterns whose causes live in this diagram.
Why strace slows your program 10×. Every syscall stops twice
at ptrace — once on entry, once on exit. Each stop wakes the tracer, copies the register
state, lets the tracer inspect, then schedules the tracee back in. On syscall-heavy code
(networking, log shipping, anything with a tight read/write loop), each call goes from
~1 µs to ~10 µs.
Why gettimeofday can run a billion times a second. It's not
actually a syscall in the diagram above. The kernel maps a tiny page (the vDSO) into every
process, and gettimeofday calls a function in that page that reads a shared
clock variable directly. No trap, no ring switch, no KPTI cost. Same trick covers
clock_gettime and a handful of others.
Why io_uring exists. The epoll model is "one syscall per ready fd."
At 100,000 connections you do 100,000 syscalls per round of work, each paying the trap
tax. io_uring (Linux 5.1, 2019) replaces this with a pair of ring buffers shared
between the kernel and userspace — the application fills the submission ring, the kernel
processes them and writes results to the completion ring, and a single
io_uring_enter (or none, in SQ-POLL mode where a kernel thread polls) amortises
the trap across thousands of operations. The diagram above is exactly what gets
eliminated, repeatedly.