Syscall Journey Simulator: a read(), end to end

One read(fd, buf, 16) call. Twenty stages between your program asking and the bytes arriving. Most of them are invisible to strace; some of them dwarf the others by four orders of magnitude. Click run.

stages
cycles
0
~time

What you're looking at

Each row in the trace is one stage a single read(fd, buf, 16) passes through: the libc wrapper, the syscall instruction that drops the CPU from ring 3 to ring 0, the KPTI page-table switch, VFS dispatch, ext4, the page cache, the block layer, the NVMe queue, the drive itself, then DMA and copy_to_user on the way back. The cycle counts are realistic estimates, and the totals at the top convert them to wall-clock time.

Run it once with the cache set to hit and note the total — a few microseconds, with the kernel doing everything in memory. Then switch to miss and run again: one stage, waiting on the SSD, swallows the entire budget by four orders of magnitude. Try a socket or a pipe to see how much of the path disappears. Two things should surprise you: how many of these stages strace will never show you, and how expensive even the "free" cache-hit path is next to a plain function call — which is exactly why batching and io_uring exist.

file: cache: mode: flags:
USER · ring 3
KERNEL · ring 0
HARDWARE
— click "Run read()" to trace the call —

A syscall is a controlled fall

You're not calling a function. You're throwing the CPU into a different universe.

From the program's point of view, read(fd, buf, n) looks like an ordinary function call into glibc. What actually happens at the bottom of that call is the syscall instruction — a single byte sequence that pulls a lever the rest of userspace cannot pull. The CPU's privilege level flips from ring 3 to ring 0; the GS segment base swaps (so the kernel can find its per-CPU data); the stack pointer is switched to a per-CPU kernel stack; and execution resumes at the address stored in MSR_LSTAR, the syscall entry point the kernel installed at boot.

None of this is free. On a modern Intel CPU, the bare trap costs about 80–100 cycles each way; with KPTI (the Meltdown mitigation) toggled on it's closer to 250 each way, because the CR3 register has to swap between two distinct page tables — one for userspace, one for the kernel. The TLB rebuilds partially on each switch unless PCID is in use to tag entries by address space. This is why your syscall-heavy benchmark got 5–15% slower after the 2018 kernel updates and never recovered.


read() is a vtable call

The "everything is a file" promise, paid for in indirection.

Once inside the kernel, __x64_sys_read looks up the file descriptor in the task's open-file table, gets a struct file *, and calls file->f_op->read_iter(). That f_op is a vtable; the implementation depends on what the fd refers to. A regular ext4 file dispatches into ext4_file_read_iter, which calls generic_file_read_iter, which either hits the page cache or schedules a disk read. A TCP socket dispatches into tcp_recvmsg, which walks sk_receive_queue for packets. A pipe dispatches into pipe_read, which reads from a circular buffer and sleeps when it's empty. The syscall is the same; the universe behind it is not.

This is why read() on a socket "works the same way" as read() on a file. The VFS — the virtual filesystem layer — exists to make that lie convincing. And it's why specialised systems like databases sometimes bypass it: a database with its own page cache wants the kernel to copy bytes from disk into the database's buffer without passing through the kernel's cache (that's what O_DIRECT does), and a zero-copy network server wants the kernel to splice bytes from one fd to another without ever materializing them in userspace (that's what sendfile() and splice() do).


80,000 cycles in the SSD, 800 everywhere else

Cold reads aren't slow because of the syscall path. They're slow because of physics.

Tally the stages on a cache-hit read: it's a few hundred cycles in the syscall trap, a few hundred more in VFS and ext4, eighty in the actual copy_to_user. The whole thing is under 1,500 cycles — well under a microsecond on a 3 GHz CPU.

Tally a cache-miss read on a fast NVMe SSD and the picture flips entirely. The block-layer submission, doorbell write, and SSD-side flash lookup adds roughly a quarter-million cycles, with the device dominating — typical 4 KiB read latency is 50–100 microseconds on a good drive, and that's the floor. Everything else in the syscall path is rounding error in comparison. This is the entire argument for the page cache, for prefetching, for io_uring (which lets you have many outstanding requests at once so the device-side wait can overlap), and for keeping working sets in RAM. The CPU is enormously faster than the disk, and the kernel's design is mostly an attempt to hide that fact.

Why this matters in real code

Three patterns whose causes live in this diagram.

Why strace slows your program 10×. Every syscall stops twice at ptrace — once on entry, once on exit. Each stop wakes the tracer, copies the register state, lets the tracer inspect, then schedules the tracee back in. On syscall-heavy code (networking, log shipping, anything with a tight read/write loop), each call goes from ~1 µs to ~10 µs.

Why gettimeofday can run a billion times a second. It's not actually a syscall in the diagram above. The kernel maps a tiny page (the vDSO) into every process, and gettimeofday calls a function in that page that reads a shared clock variable directly. No trap, no ring switch, no KPTI cost. Same trick covers clock_gettime and a handful of others.

Why io_uring exists. The epoll model is "one syscall per ready fd." At 100,000 connections you do 100,000 syscalls per round of work, each paying the trap tax. io_uring (Linux 5.1, 2019) replaces this with a pair of ring buffers shared between the kernel and userspace — the application fills the submission ring, the kernel processes them and writes results to the completion ring, and a single io_uring_enter (or none, in SQ-POLL mode where a kernel thread polls) amortises the trap across thousands of operations. The diagram above is exactly what gets eliminated, repeatedly.

Found this useful?