10 / 10

Internals / 10

System calls

A user-mode program cannot open a file, send a packet, allocate a page, or learn what time it is without asking the kernel. The mechanism for asking is the syscall — a single hardware instruction that flips privilege level, jumps to a kernel entry point, runs the requested work, and returns. Everything libc, every language runtime, every container, and every sandbox is ultimately a structure built around that boundary crossing.

What a syscall actually is.

A user-mode program runs at CPU privilege level 3 (ring 3 on x86). It cannot touch the page tables, talk to a NIC, write to disk, or even ask what time it is without going through the kernel, which runs at ring 0. The only legal way to cross that boundary is to execute the SYSCALL instruction (on x86_64; SVC on ARM64, ECALL on RISC-V). The instruction saves the return address into rcx and the flags into r11, switches the CPU to ring 0, loads the kernel's stack pointer, and jumps to entry_SYSCALL_64 — a single, well-known address the kernel registered with the CPU at boot via the LSTAR MSR.

From there the kernel reads the syscall number out of rax, indexes sys_call_table, and dispatches to sys_read or sys_openat or whichever handler the program asked for. Arguments come in from rdi, rsi, rdx, r10, r8, r9 — the kernel's own calling convention, slightly different from the userspace System V ABI so that the kernel never has to touch rcx, which holds the return address. When the handler is done it returns through SYSRET, dropping back to ring 3 with the result in rax.

On modern x86 the round-trip floor is roughly 50–200 ns, depending on the generation, the workload, and whether any of the Spectre/Meltdown mitigations are active. That floor is the price of the privilege transition itself — no useful work yet, just crossing the boundary. Everything else syscall-related is built on top of, or around, that number.

Why the boundary has to exist at all.

The first question worth answering is why any of this machinery is needed. A program could, in principle, talk to the disk controller itself — write the right bytes to the right I/O ports, set up a DMA transfer, wait for the completion interrupt. The reason it cannot is that the CPU refuses to let it. Instructions that touch device registers, edit the page tables, change the interrupt mask, or read another process's memory are privileged instructions. At ring 3 they fault. The hardware enforces the wall; the kernel does not have to trust the program to stay on its side of it.

That enforcement is the whole point. If any process could write to disk directly, one buggy program could corrupt another's files, read another's memory, or hang the machine by never releasing the controller. The kernel exists to be the single arbiter of every shared resource — the disk, the network card, physical memory, the CPU's own time — and the only way to make that arbitration real is to put the hardware in a mode where the program physically cannot reach those resources, then give it exactly one supervised door. The syscall is that door. Every privileged operation goes through code the kernel wrote, with arguments the kernel checks, at a moment the kernel chose.

This is the same idea as the boundary between a process and the rest of the system, pushed down into silicon. A process gets its own address space so it cannot see its neighbours; a ring gets its own privilege level so it cannot touch the hardware. The two protections compose: even with a valid pointer, a ring-3 program writing to a kernel page faults, and even at ring 0 the kernel walks page tables to reach the right physical frame. The syscall is the one sanctioned crossing of the inner of those two walls.

x86 defines four privilege rings. Linux uses only the outer and inner two; the syscall instruction is the only sanctioned move from ring 3 to ring 0.

The mechanism, step by step.

It helps to walk one crossing all the way through, because the cost and the security properties both fall out of the steps. Say a program calls read(fd, buf, n). libc's wrapper loads the syscall number 0 into rax, puts fd in rdi, buf in rsi, n in rdx, and executes SYSCALL. That single instruction does a lot: it saves the user return address into rcx and the flags into r11, raises the privilege level to ring 0, loads the kernel stack pointer, and jumps to the address the kernel parked in the LSTAR model-specific register at boot. The CPU has now switched mode and landed inside the kernel at entry_SYSCALL_64, all in hardware.

The entry trampoline is not glamorous, but it is where the real per-call tax lives. It runs swapgs to switch from the user's GS base to the kernel's per-CPU one, swaps the page tables if KPTI is on, saves the rest of the user registers onto the kernel stack, and only then reads rax, bounds-checks it against the table size, and calls sys_call_table[rax]. The handler — sys_read here — copies the caller's arguments in, validates the pointer with copy_from_user so a bad buf cannot trick the kernel into writing somewhere it should not, does the work, and returns a value. The trampoline restores registers, swaps back, and runs SYSRET, which drops to ring 3 and resumes the program at the saved address with the result in rax. The whole thing is one mode switch out and one mode switch back, with a table lookup in the middle.

This is the same trap-and-dispatch shape the CPU uses for interrupts and page faults — a single hardware event redirects execution to a fixed kernel entry point, the kernel figures out what happened from a number, and a table sends it to the right handler. If you have read the instruction cycle page, a syscall is best understood as a deliberate, software-triggered trap: the program asks for the redirect on purpose instead of having it forced by a fault.

The ABI — which register holds what.

The contract between userspace and the kernel is fixed down to individual registers, and it is deliberately not the same as the ordinary function-call ABI. On x86_64, a normal C call passes its first six integer arguments in rdi, rsi, rdx, rcx, r8, r9. A syscall passes them in rdi, rsi, rdx, r10, r8, r9 — the fourth argument moves from rcx to r10. The reason is mechanical: the SYSCALL instruction clobbers rcx to stash the return address, so the kernel ABI cannot use rcx for an argument. The syscall number goes in rax, and the return value comes back in rax.

The x86_64 syscall convention. The number and the result share rax; the fourth argument lives in r10 because SYSCALL needs rcx for the return address.

The return value carries the error too. A successful call returns a non-negative number — a byte count, a file descriptor, zero. A failure returns a small negative number whose absolute value is the error code: -EBADF is -9, -ENOENT is -2. There is no separate error channel inside the kernel; the sign of rax is the whole signal. This is tight enough that a raw syscall takes one instruction to issue and one comparison to check, which is exactly what you want at the bottom of every I/O path.

libc wrappers, raw syscalls, and errno.

The negative-return convention is the kernel's, not C's, and the gap between them is what the libc wrapper papers over. POSIX programs expect a failing call to return -1 and leave the real code in the thread-local variable errno. So glibc's read() issues the syscall, and if rax comes back negative it flips the sign, stores it in errno, and returns -1. That translation is most of what the wrappers do; the rest is setting up arguments, occasionally splitting a 64-bit value across two registers on 32-bit targets, and picking the right syscall number for the architecture it was built for.

You can skip the wrapper. syscall(SYS_read, fd, buf, n) issues the call by number, and Go, Rust, and Zig often emit the instruction sequence inline rather than linking libc at all. The catch is that you then own the translation: a static Go binary carries its own copy of the syscall numbers, and a number that differs between architectures is a bug the runtime has to handle itself. Raw syscalls are also how you reach calls that libc never wrapped — for years gettid() had no glibc wrapper and you had to call syscall(SYS_gettid) by hand. The rule of thumb is plain: portable code asks libc; code that needs a brand-new or deliberately unwrapped syscall goes raw and accepts the responsibility.

This split also explains a class of confusing bugs. A program that checks errno after a call that succeeded reads a stale value, because the wrappers only set errno on failure. And a program that calls a raw syscall and then expects errno to be populated gets nothing, because the kernel never touched it — the error was sitting in the return value the whole time. The boundary is precise; the confusion comes from forgetting which side of it set what.

The syscall table.

Linux on x86_64 exposes roughly 370 syscalls. They live in arch/x86/entry/syscalls/syscall_64.tbl, a plain text file mapping a number, an ABI tag, a name, and an implementation symbol. read is 0, write is 1, open is 2, close is 3, and so on up through the modern additions. Build scripts turn the table into sys_call_table at compile time; the kernel never reparses it at runtime.

Almost no application code calls syscalls directly. Instead it links against libc (glibc, musl, bionic), and libc provides a thin wrapper for each: read() in libc puts 0 in rax, the user-supplied arguments in the right registers, executes SYSCALL, and translates a negative return into errno. Languages with their own runtimes — Go, Rust, Zig — often skip libc and emit the syscall sequence directly, which is why a statically linked Go binary contains its own copy of the syscall numbers and breaks when a kernel ABI quirk turns out to differ between architectures.

Different architectures have different tables. ARM64 has its own unistd.h numbering, RISC-V another. The user-visible name (read) is the same; the integer (SYS_read) is not. Portable code asks libc, not the constant.

Cost breakdown — where the nanoseconds go.

For a trivial syscall like getpid() there is essentially no work in the handler; almost the entire cost is the boundary crossing itself. On a recent Skylake or later Intel part with mitigations off, the floor is roughly 30 ns for the privilege switch, plus another ~20 ns for argument validation, the table dispatch, and the return path. Total: ~50 ns for the cheapest case.

Then the mitigations land. KPTI (Kernel Page Table Isolation), shipped in early 2018 in response to Meltdown, gave the kernel its own set of page tables and forced a CR3 write on every entry and exit. On Skylake that alone added roughly 500 ns per syscall before Intel's PCID optimisations clawed some of it back; on older parts without PCID the cost roughly doubled. Spectre v2 mitigations (IBRS, retpoline) added another tax on indirect branches. Together they raised the practical floor on a busy server to closer to 200 ns per syscall.

Spectre fixes raised the syscall floor to ~200 ns. A 2018-era kernel patched against Meltdown and Spectre v2 on Skylake takes roughly 2–3x longer per syscall than the same hardware running an unpatched kernel — the numbers Brendan Gregg's KPTI benchmarks and the Phoronix regressions both reported. Newer silicon (Ice Lake, Sapphire Rapids, AMD Zen 3+) pays much less because PTI is partly handled in hardware. The point: when you read "a syscall costs 100 ns" in a pre-2018 textbook, double or triple the number for any kernel built since.

vDSO — syscalls without the trap.

A handful of "syscalls" don't actually need the kernel's help on the hot path. The current time, for instance, lives in a couple of words of kernel memory that the timer interrupt updates regularly. Asking for it doesn't require any privileged work — it just requires reading those words and doing some arithmetic. So Linux maps that arithmetic, plus a small read-only window onto the time data, directly into every process's address space.

That mapping is the vDSO (virtual dynamic shared object). It looks like a tiny .so linked into every program, exporting symbols like __vdso_clock_gettime, __vdso_gettimeofday, and __vdso_getcpu. glibc's clock_gettime() wrapper checks for the vDSO symbol at startup and, if present, jumps to it instead of issuing a syscall. No SYSCALL instruction, no ring transition, no KPTI tax — just a function call.

Why vDSO halves clock_gettime cost — actually more like 20x. A real clock_gettime(CLOCK_MONOTONIC) syscall on a KPTI-patched kernel takes roughly 200–400 ns. The vDSO path completes in 15–25 ns — a single function call plus a few loads from a kernel-maintained data page. For any program reading the clock in a hot loop (latency histograms, tracing, request timing), the difference is the gap between "this is free" and "we need to batch this". This is also why gettimeofday() never shows up in strace output for most modern programs: it never went near the kernel.

Why this matters at scale.

A single syscall at 100 ns sounds negligible. A busy HTTP server serving 100 000 requests/sec through an event loop with maybe 10 syscalls per request — accept, read, epoll_wait, a couple of writev, close — is 1 000 000 syscalls/sec. At 100 ns each that is 100 ms of CPU per wall-clock second burned on syscall overhead alone, before any of the actual request work. On a 16-core box that's 10% of one core gone to boundary crossings.

Add KPTI and the number doubles. Add a tracing wrapper or a strace and it goes up another order of magnitude. This is exactly the cost io_uring was designed to collapse: by batching dozens or hundreds of operations into a single io_uring_enter() call (and with IORING_SETUP_SQPOLL, even zero syscalls in steady state), the per-operation share of the boundary tax goes to near zero. See the I/O models deep dive for how the ring buffers actually work.

The same pressure shows up elsewhere. Userspace networking (DPDK, AF_XDP) exists partly to eliminate per-packet syscalls. The vDSO exists to eliminate the timing syscalls. Memory-mapped I/O exists to eliminate per-byte read/write syscalls. Every time you see a modern kernel API that looks unusual, the question to ask is usually "what syscall is this trying to avoid?"

strace, ltrace, and bpftrace — watching syscalls happen.

strace attaches to a process via ptrace and logs every syscall with its arguments and return value. It is the first tool to reach for when a program is mysteriously failing — almost always you'll see the failing open() with the wrong path, the connect() to the wrong host, or the read() that returned -1 EAGAIN when the code expected data. ltrace does the same for library calls.

The catch is the cost. ptrace stops the traced process at every syscall entry and exit, copies registers and memory in and out of the tracer, and formats the result. Slowdowns of 10x to 100x are normal; for a syscall-heavy workload it can be worse. strace on a production database is a denial-of-service tool.

$ strace -e trace=openat,read,close -c curl -s example.com > /dev/null
openat(AT_FDCWD, "/etc/ssl/certs/ca-certificates.crt", O_RDONLY|O_CLOEXEC) = 5
read(5, "-----BEGIN CERTIFICATE-----\nMII"..., 200704) = 200704
read(5, "", 4096)                       = 0
close(5)                                = 0
openat(AT_FDCWD, "/etc/resolv.conf", O_RDONLY|O_CLOEXEC) = 5
read(5, "nameserver 1.1.1.1\n", 4096)  = 19
close(5)                                = 0
...
% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
 47.21    0.001823          11       163         1 read
 31.04    0.001198          14        83           openat
 21.75    0.000840          10        82           close

The modern answer for production is bpftrace (and the wider eBPF ecosystem). An eBPF program attached to a tracepoint:syscalls:sys_enter_* runs in the kernel, in the syscall path, with overhead in the tens of nanoseconds per event. You can count syscalls per process, histogram their latencies, capture argument values, or sample only the slow ones — all on a running production system. This is what Netflix and Facebook's production tracing is built on; Brendan Gregg's bcc/bpftrace tools (execsnoop, opensnoop, syscount) are the canonical examples.

seccomp — locking the syscall surface down.

A process that has gained a foothold in your system has, by default, the full Linux syscall table available to it. Most programs need maybe twenty syscalls. The other 350 are pure attack surface — old, obscure, occasionally buggy code paths that the program will never legitimately touch.

seccomp (secure computing) lets a process restrict which syscalls it can make. The modern form, seccomp-bpf, takes a BPF program that inspects the syscall number and arguments and returns an action: SECCOMP_RET_ALLOW, SECCOMP_RET_ERRNO, SECCOMP_RET_KILL_PROCESS, or trap to a supervisor. Once installed the filter is inherited across execve and cannot be loosened.

Docker installs a seccomp profile by default that blocks roughly 40 syscalls (keyctl, reboot, the older set_thread_area, raw BPF, mount, swapon, and friends). The Chrome renderer process runs under a much tighter seccomp filter — a couple of dozen syscalls — because a compromised renderer is the most likely path into the rest of the machine. systemd services can opt into SystemCallFilter=. OpenSSH's privilege-separated child uses seccomp to lock itself to a minimal set.

Capabilities — the end of all-or-nothing root.

Classic Unix has two privilege levels: root, which can do anything, and not-root, which can do nothing privileged. That binary is too coarse. A web server needs to bind to port 80 but otherwise should not be able to mount filesystems, load kernel modules, or read arbitrary process memory.

Linux capabilities split root's omnipotence into roughly 40 distinct powers. CAP_NET_BIND_SERVICE allows binding to ports below 1024 but nothing else. CAP_NET_RAW allows opening raw and packet sockets — what ping and tcpdump need. CAP_SYS_ADMIN is the "basically root" capability and the one most container escapes orbit around; CAP_DAC_OVERRIDE bypasses filesystem permission checks; CAP_SYS_PTRACE permits attaching to other processes.

A modern container or systemd service is configured with a minimal capability set — CapabilityBoundingSet=, the Docker --cap-drop ALL --cap-add NET_BIND_SERVICE pattern — so that a compromise gives the attacker only the specific powers the workload actually needed. Capabilities and seccomp are complementary: seccomp removes syscalls, capabilities remove the privileges that some syscalls check for.

The Linux ABI promise — and the historical baggage.

Linux has an unusual stance on the kernel-userspace interface, articulated repeatedly by Linus: "we do not break userspace." Once a syscall is added with a given number, arguments, and semantics, it stays that way forever. Static binaries from the mid-1990s still run on a 2025 kernel. A glibc compiled for Linux 2.0 still works. The kernel has changed almost everything internally a dozen times over; the interface has not.

The trade-off is that the syscall table is full of historical baggage. There are five different stat variants because the struct stat layout grew over time (32-bit inodes, 64-bit inodes, nanosecond timestamps, statx). There are old select versions with 32-bit fd sets, the original socket calls glued into a single multiplexed socketcall on some architectures, and curiosities like getpmsg (a STREAMS holdover that was reserved but never implemented). New variants get added — openat replaces open, preadv2 generalises pread — but the old ones cannot be removed.

The payoff is real. Steam runs ancient game binaries. RHEL ships 10-year support windows because the kernel doesn't break the libc the userspace was built against. The gnarliness of the syscall table is the price of that compatibility.

Recent additions — the table still grows.

Even with the no-break-userspace rule, the kernel keeps adding new syscalls. Each one fixes a category of race, perf cost, or security gap that the existing API couldn't.

io_uring (5.1, May 2019) is the big one — Jens Axboe's rings-based interface that batches dozens of operations per syscall and runs many of them asynchronously. pidfd (5.3, Sept 2019) gives processes a file descriptor that refers to a specific PID, closing the long-standing race where a PID can be reused between the time you look it up and the time you signal it — pidfd_send_signal and pidfd_open let modern process supervisors (systemd, Kubernetes' OCI shims) avoid the classic kill-the-wrong-process bug.

openat2 (5.6, March 2020) adds a resolve flags field with options like RESOLVE_NO_SYMLINKS and RESOLVE_BENEATH, finally making it possible to safely open a file under a directory without traversing out via a symlink — the bug class that has caused dozens of CVEs in container runtimes. close_range (5.9, Oct 2020) closes an inclusive range of file descriptors in one call, replacing the for (i = 3; i < sysconf(_SC_OPEN_MAX); i++) close(i); loop that every fork-exec helper used to run after fork — a loop that took milliseconds when file descriptor limits got raised to a million.

Syscall	Purpose	Typical latency
`getpid`	return current PID	~50–100 ns (cheapest real syscall)
`clock_gettime` (vDSO)	monotonic / wall clock	~15–25 ns (no trap)
`clock_gettime` (real)	e.g. CLOCK_BOOTTIME	~200–400 ns
`read` (cached)	read from page cache	~300 ns–1 us
`write` (buffered)	write into kernel buffer	~300 ns–1 us
`epoll_wait`	reap ready fds	~200 ns–microseconds
`mmap` / `munmap`	map / unmap memory	~1–5 us (TLB shootdowns)
`fork`	clone a process	~50–200 us
`execve`	replace process image	~100 us–ms
`io_uring_enter`	submit N ops, reap N	~100 ns per call, <10 ns per op