System calls
A user-mode program cannot open a file, send a packet, allocate a page, or learn what time it is without asking the kernel. The mechanism for asking is the syscall — a single hardware instruction that flips privilege level, jumps to a kernel entry point, runs the requested work, and returns. Everything libc, every language runtime, every container, and every sandbox is ultimately a structure built around that boundary crossing.
What a syscall actually is.
A user-mode program runs at CPU privilege level 3 (ring 3 on x86). It cannot touch the page
tables, talk to a NIC, write to disk, or even ask what time it is without going through the
kernel, which runs at ring 0. The only legal way to cross that boundary is to execute the
SYSCALL instruction (on x86_64; SVC on ARM64, ECALL
on RISC-V). The instruction saves the return address into rcx and the flags
into r11, switches the CPU to ring 0, loads the kernel's stack pointer, and
jumps to entry_SYSCALL_64 — a single, well-known address the kernel registered
with the CPU at boot via the LSTAR MSR.
From there the kernel reads the syscall number out of rax, indexes
sys_call_table, and dispatches to sys_read or
sys_openat or whichever handler the program asked for. Arguments come in from
rdi, rsi, rdx, r10, r8,
r9 — the kernel's own calling convention, slightly different from the userspace
System V ABI so that the kernel never has to touch rcx, which holds the return
address. When the handler is done it returns through SYSRET, dropping back to
ring 3 with the result in rax.
On modern x86 the round-trip floor is roughly 50–200 ns, depending on the generation, the workload, and whether any of the Spectre/Meltdown mitigations are active. That floor is the price of the privilege transition itself — no useful work yet, just crossing the boundary. Everything else syscall-related is built on top of, or around, that number.
Why the boundary has to exist at all.
The first question worth answering is why any of this machinery is needed. A program could, in principle, talk to the disk controller itself — write the right bytes to the right I/O ports, set up a DMA transfer, wait for the completion interrupt. The reason it cannot is that the CPU refuses to let it. Instructions that touch device registers, edit the page tables, change the interrupt mask, or read another process's memory are privileged instructions. At ring 3 they fault. The hardware enforces the wall; the kernel does not have to trust the program to stay on its side of it.
That enforcement is the whole point. If any process could write to disk directly, one buggy program could corrupt another's files, read another's memory, or hang the machine by never releasing the controller. The kernel exists to be the single arbiter of every shared resource — the disk, the network card, physical memory, the CPU's own time — and the only way to make that arbitration real is to put the hardware in a mode where the program physically cannot reach those resources, then give it exactly one supervised door. The syscall is that door. Every privileged operation goes through code the kernel wrote, with arguments the kernel checks, at a moment the kernel chose.
This is the same idea as the boundary between a process and the rest of the system, pushed down into silicon. A process gets its own address space so it cannot see its neighbours; a ring gets its own privilege level so it cannot touch the hardware. The two protections compose: even with a valid pointer, a ring-3 program writing to a kernel page faults, and even at ring 0 the kernel walks page tables to reach the right physical frame. The syscall is the one sanctioned crossing of the inner of those two walls.
The mechanism, step by step.
It helps to walk one crossing all the way through, because the cost and the security
properties both fall out of the steps. Say a program calls read(fd, buf, n).
libc's wrapper loads the syscall number 0 into rax, puts fd in
rdi, buf in rsi, n in rdx,
and executes SYSCALL. That single instruction does a lot: it saves the user
return address into rcx and the flags into r11, raises the
privilege level to ring 0, loads the kernel stack pointer, and jumps to the address the
kernel parked in the LSTAR model-specific register at boot. The CPU has now
switched mode and landed inside the kernel at entry_SYSCALL_64, all in
hardware.
The entry trampoline is not glamorous, but it is where the real per-call tax lives. It runs
swapgs to switch from the user's GS base to the kernel's per-CPU one, swaps the
page tables if KPTI is on, saves the rest of the user registers onto the kernel stack, and
only then reads rax, bounds-checks it against the table size, and calls
sys_call_table[rax]. The handler — sys_read here — copies the
caller's arguments in, validates the pointer with copy_from_user so a bad
buf cannot trick the kernel into writing somewhere it should not, does the work,
and returns a value. The trampoline restores registers, swaps back, and runs
SYSRET, which drops to ring 3 and resumes the program at the saved address with
the result in rax. The whole thing is one mode switch out and one mode switch
back, with a table lookup in the middle.
This is the same trap-and-dispatch shape the CPU uses for interrupts and page faults — a single hardware event redirects execution to a fixed kernel entry point, the kernel figures out what happened from a number, and a table sends it to the right handler. If you have read the instruction cycle page, a syscall is best understood as a deliberate, software-triggered trap: the program asks for the redirect on purpose instead of having it forced by a fault.
The ABI — which register holds what.
The contract between userspace and the kernel is fixed down to individual registers, and it
is deliberately not the same as the ordinary function-call ABI. On x86_64, a normal C call
passes its first six integer arguments in rdi, rsi,
rdx, rcx, r8, r9. A syscall passes them
in rdi, rsi, rdx, r10, r8,
r9 — the fourth argument moves from rcx to r10. The
reason is mechanical: the SYSCALL instruction clobbers rcx to
stash the return address, so the kernel ABI cannot use rcx for an argument. The
syscall number goes in rax, and the return value comes back in rax.
rax; the fourth argument lives in r10 because SYSCALL needs rcx for the return address.The return value carries the error too. A successful call returns a non-negative number — a
byte count, a file descriptor, zero. A failure returns a small negative number whose
absolute value is the error code: -EBADF is -9, -ENOENT is -2.
There is no separate error channel inside the kernel; the sign of rax is the
whole signal. This is tight enough that a raw syscall takes one instruction to issue and one
comparison to check, which is exactly what you want at the bottom of every I/O path.
libc wrappers, raw syscalls, and errno.
The negative-return convention is the kernel's, not C's, and the gap between them is what the
libc wrapper papers over. POSIX programs expect a failing call to return -1 and leave the
real code in the thread-local variable errno. So glibc's read()
issues the syscall, and if rax comes back negative it flips the sign, stores it
in errno, and returns -1. That translation is most of what the wrappers do; the
rest is setting up arguments, occasionally splitting a 64-bit value across two registers on
32-bit targets, and picking the right syscall number for the architecture it was built for.
You can skip the wrapper. syscall(SYS_read, fd, buf, n) issues the call by
number, and Go, Rust, and Zig often emit the instruction sequence inline rather than linking
libc at all. The catch is that you then own the translation: a static Go binary carries its
own copy of the syscall numbers, and a number that differs between architectures is a bug
the runtime has to handle itself. Raw syscalls are also how you reach calls that libc never
wrapped — for years gettid() had no glibc wrapper and you had to call
syscall(SYS_gettid) by hand. The rule of thumb is plain: portable code asks
libc; code that needs a brand-new or deliberately unwrapped syscall goes raw and accepts the
responsibility.
This split also explains a class of confusing bugs. A program that checks errno
after a call that succeeded reads a stale value, because the wrappers only set
errno on failure. And a program that calls a raw syscall and then expects
errno to be populated gets nothing, because the kernel never touched it — the
error was sitting in the return value the whole time. The boundary is precise; the confusion
comes from forgetting which side of it set what.
The syscall table.
Linux on x86_64 exposes roughly 370 syscalls. They live in
arch/x86/entry/syscalls/syscall_64.tbl, a plain text file mapping a number, an
ABI tag, a name, and an implementation symbol. read is 0, write
is 1, open is 2, close is 3, and so on up through the modern
additions. Build scripts turn the table into sys_call_table at compile time;
the kernel never reparses it at runtime.
Almost no application code calls syscalls directly. Instead it links against
libc (glibc, musl, bionic), and libc provides a thin wrapper for each:
read() in libc puts 0 in rax, the user-supplied arguments in the
right registers, executes SYSCALL, and translates a negative return into
errno. Languages with their own runtimes — Go, Rust, Zig — often skip libc and
emit the syscall sequence directly, which is why a statically linked Go binary contains its
own copy of the syscall numbers and breaks when a kernel ABI quirk turns out to differ
between architectures.
Different architectures have different tables. ARM64 has its own
unistd.h numbering, RISC-V another. The user-visible name (read) is
the same; the integer (SYS_read) is not. Portable code asks libc, not the
constant.
Cost breakdown — where the nanoseconds go.
For a trivial syscall like getpid() there is essentially no work in the
handler; almost the entire cost is the boundary crossing itself. On a recent Skylake or
later Intel part with mitigations off, the floor is roughly 30 ns for the
privilege switch, plus another ~20 ns for argument validation, the table dispatch, and the
return path. Total: ~50 ns for the cheapest case.
Then the mitigations land. KPTI (Kernel Page Table Isolation), shipped in
early 2018 in response to Meltdown, gave the kernel its own set of page tables and forced a
CR3 write on every entry and exit. On Skylake that alone added roughly
500 ns per syscall before Intel's PCID optimisations clawed some of it
back; on older parts without PCID the cost roughly doubled. Spectre v2 mitigations
(IBRS, retpoline) added another tax on indirect branches.
Together they raised the practical floor on a busy server to closer to
200 ns per syscall.
vDSO — syscalls without the trap.
A handful of "syscalls" don't actually need the kernel's help on the hot path. The current time, for instance, lives in a couple of words of kernel memory that the timer interrupt updates regularly. Asking for it doesn't require any privileged work — it just requires reading those words and doing some arithmetic. So Linux maps that arithmetic, plus a small read-only window onto the time data, directly into every process's address space.
That mapping is the vDSO (virtual dynamic shared object). It looks like a
tiny .so linked into every program, exporting symbols like
__vdso_clock_gettime, __vdso_gettimeofday, and
__vdso_getcpu. glibc's clock_gettime() wrapper checks for the
vDSO symbol at startup and, if present, jumps to it instead of issuing a syscall. No
SYSCALL instruction, no ring transition, no KPTI tax — just a function call.
clock_gettime(CLOCK_MONOTONIC) syscall on a KPTI-patched kernel takes roughly
200–400 ns. The vDSO path completes in 15–25 ns — a single function call
plus a few loads from a kernel-maintained data page. For any program reading the clock in a
hot loop (latency histograms, tracing, request timing), the difference is the gap between
"this is free" and "we need to batch this". This is also why gettimeofday()
never shows up in strace output for most modern programs: it never went near
the kernel.Why this matters at scale.
A single syscall at 100 ns sounds negligible. A busy HTTP server serving
100 000 requests/sec through an event loop with maybe 10 syscalls per
request — accept, read, epoll_wait, a couple of
writev, close — is 1 000 000 syscalls/sec. At
100 ns each that is 100 ms of CPU per wall-clock second burned on
syscall overhead alone, before any of the actual request work. On a 16-core box that's 10%
of one core gone to boundary crossings.
Add KPTI and the number doubles. Add a tracing wrapper or a strace and it goes up another
order of magnitude. This is exactly the cost io_uring was designed to
collapse: by batching dozens or hundreds of operations into a single
io_uring_enter() call (and with IORING_SETUP_SQPOLL, even zero
syscalls in steady state), the per-operation share of the boundary tax goes to near zero.
See the I/O models deep dive for how
the ring buffers actually work.
The same pressure shows up elsewhere. Userspace networking (DPDK, AF_XDP) exists partly to eliminate per-packet syscalls. The vDSO exists to eliminate the timing syscalls. Memory-mapped I/O exists to eliminate per-byte read/write syscalls. Every time you see a modern kernel API that looks unusual, the question to ask is usually "what syscall is this trying to avoid?"
strace, ltrace, and bpftrace — watching syscalls happen.
strace attaches to a process via ptrace and logs every
syscall with its arguments and return value. It is the first tool to reach for when a
program is mysteriously failing — almost always you'll see the failing open()
with the wrong path, the connect() to the wrong host, or the
read() that returned -1 EAGAIN when the code expected data.
ltrace does the same for library calls.
The catch is the cost. ptrace stops the traced process at every syscall entry
and exit, copies registers and memory in and out of the tracer, and formats the result.
Slowdowns of 10x to 100x are normal; for a syscall-heavy workload it can
be worse. strace on a production database is a denial-of-service tool.
$ strace -e trace=openat,read,close -c curl -s example.com > /dev/null
openat(AT_FDCWD, "/etc/ssl/certs/ca-certificates.crt", O_RDONLY|O_CLOEXEC) = 5
read(5, "-----BEGIN CERTIFICATE-----\nMII"..., 200704) = 200704
read(5, "", 4096) = 0
close(5) = 0
openat(AT_FDCWD, "/etc/resolv.conf", O_RDONLY|O_CLOEXEC) = 5
read(5, "nameserver 1.1.1.1\n", 4096) = 19
close(5) = 0
...
% time seconds usecs/call calls errors syscall
------ ----------- ----------- --------- --------- ----------------
47.21 0.001823 11 163 1 read
31.04 0.001198 14 83 openat
21.75 0.000840 10 82 closeThe modern answer for production is bpftrace (and the wider eBPF
ecosystem). An eBPF program attached to a tracepoint:syscalls:sys_enter_*
runs in the kernel, in the syscall path, with overhead in the tens of nanoseconds per
event. You can count syscalls per process, histogram their latencies, capture argument
values, or sample only the slow ones — all on a running production system. This is what
Netflix and Facebook's production tracing is built on; Brendan Gregg's
bcc/bpftrace tools (execsnoop,
opensnoop, syscount) are the canonical examples.
seccomp — locking the syscall surface down.
A process that has gained a foothold in your system has, by default, the full Linux syscall table available to it. Most programs need maybe twenty syscalls. The other 350 are pure attack surface — old, obscure, occasionally buggy code paths that the program will never legitimately touch.
seccomp (secure computing) lets a process restrict which syscalls it can
make. The modern form, seccomp-bpf, takes a BPF program that inspects the
syscall number and arguments and returns an action: SECCOMP_RET_ALLOW,
SECCOMP_RET_ERRNO, SECCOMP_RET_KILL_PROCESS, or trap to a
supervisor. Once installed the filter is inherited across execve and cannot
be loosened.
Docker installs a seccomp profile by default that blocks roughly 40 syscalls
(keyctl, reboot, the older set_thread_area, raw
BPF, mount, swapon, and friends). The Chrome renderer process runs under
a much tighter seccomp filter — a couple of dozen syscalls — because a compromised
renderer is the most likely path into the rest of the machine. systemd
services can opt into SystemCallFilter=. OpenSSH's privilege-separated child
uses seccomp to lock itself to a minimal set.
Capabilities — the end of all-or-nothing root.
Classic Unix has two privilege levels: root, which can do anything, and not-root, which can do nothing privileged. That binary is too coarse. A web server needs to bind to port 80 but otherwise should not be able to mount filesystems, load kernel modules, or read arbitrary process memory.
Linux capabilities split root's omnipotence into roughly 40 distinct
powers. CAP_NET_BIND_SERVICE allows binding to ports below 1024 but nothing
else. CAP_NET_RAW allows opening raw and packet sockets — what
ping and tcpdump need. CAP_SYS_ADMIN is the
"basically root" capability and the one most container escapes orbit around;
CAP_DAC_OVERRIDE bypasses filesystem permission checks;
CAP_SYS_PTRACE permits attaching to other processes.
A modern container or systemd service is configured with a minimal capability set —
CapabilityBoundingSet=, the Docker --cap-drop ALL --cap-add
NET_BIND_SERVICE pattern — so that a compromise gives the attacker only the
specific powers the workload actually needed. Capabilities and seccomp are complementary:
seccomp removes syscalls, capabilities remove the privileges that some syscalls check
for.
The Linux ABI promise — and the historical baggage.
Linux has an unusual stance on the kernel-userspace interface, articulated repeatedly by Linus: "we do not break userspace." Once a syscall is added with a given number, arguments, and semantics, it stays that way forever. Static binaries from the mid-1990s still run on a 2025 kernel. A glibc compiled for Linux 2.0 still works. The kernel has changed almost everything internally a dozen times over; the interface has not.
The trade-off is that the syscall table is full of historical baggage. There are five
different stat variants because the struct stat layout grew over
time (32-bit inodes, 64-bit inodes, nanosecond timestamps, statx). There are old
select versions with 32-bit fd sets, the original socket calls glued into a
single multiplexed socketcall on some architectures, and curiosities like
getpmsg (a STREAMS holdover that was reserved but never implemented). New
variants get added — openat replaces open,
preadv2 generalises pread — but the old ones cannot be removed.
The payoff is real. Steam runs ancient game binaries. RHEL ships 10-year support windows because the kernel doesn't break the libc the userspace was built against. The gnarliness of the syscall table is the price of that compatibility.
Recent additions — the table still grows.
Even with the no-break-userspace rule, the kernel keeps adding new syscalls. Each one fixes a category of race, perf cost, or security gap that the existing API couldn't.
io_uring (5.1, May 2019) is the big one — Jens Axboe's rings-based
interface that batches dozens of operations per syscall and runs many of them
asynchronously. pidfd (5.3, Sept 2019) gives processes a file descriptor
that refers to a specific PID, closing the long-standing race where a PID can be reused
between the time you look it up and the time you signal it — pidfd_send_signal
and pidfd_open let modern process supervisors (systemd, Kubernetes' OCI
shims) avoid the classic kill-the-wrong-process bug.
openat2 (5.6, March 2020) adds a resolve flags field with
options like RESOLVE_NO_SYMLINKS and RESOLVE_BENEATH, finally
making it possible to safely open a file under a directory without traversing out via a
symlink — the bug class that has caused dozens of CVEs in container runtimes.
close_range (5.9, Oct 2020) closes an inclusive range of file
descriptors in one call, replacing the for (i = 3; i < sysconf(_SC_OPEN_MAX);
i++) close(i); loop that every fork-exec helper used to run after fork — a loop
that took milliseconds when file descriptor limits got raised to a million.
| Syscall | Purpose | Typical latency |
|---|---|---|
getpid | return current PID | ~50–100 ns (cheapest real syscall) |
clock_gettime (vDSO) | monotonic / wall clock | ~15–25 ns (no trap) |
clock_gettime (real) | e.g. CLOCK_BOOTTIME | ~200–400 ns |
read (cached) | read from page cache | ~300 ns–1 us |
write (buffered) | write into kernel buffer | ~300 ns–1 us |
epoll_wait | reap ready fds | ~200 ns–microseconds |
mmap / munmap | map / unmap memory | ~1–5 us (TLB shootdowns) |
fork | clone a process | ~50–200 us |
execve | replace process image | ~100 us–ms |
io_uring_enter | submit N ops, reap N | ~100 ns per call, <10 ns per op |
Further reading.
- Linux man-pages — syscalls(2) — the canonical list of every Linux syscall, the kernel version it appeared in, and the architectures that support it. The first place to look for any syscall question.
- LWN — Anatomy of a system call (David Drysdale)
— a two-part walkthrough of exactly what happens between the userspace
SYSCALLand the kernel handler, including the entry trampolines and the register conventions. - Brendan Gregg — KPTI/KAISER Meltdown initial performance regressions — measured numbers for the KPTI cost across workloads, the reference for the "syscalls got slower in 2018" story.
- Brendan Gregg — Learn eBPF Tracing — the entry point to eBPF and bcc/bpftrace, the modern production replacement for strace.
- Linux man-pages — vdso(7) — how the vDSO is laid out, which symbols it exports per architecture, and how to call them directly.
- Linux man-pages — seccomp(2) — the kernel API for installing a seccomp filter; the basis of every container runtime's syscall sandbox.
- Jens Axboe — Efficient IO with io_uring — the whitepaper from the author of the kernel's biggest syscall-batching effort.
- LWN — The rapid growth of io_uring — how io_uring expanded from a disk-IO interface into a near-universal kernel ABI.
- Semicolony — I/O models — the deep dive on how syscall-heavy designs (blocking, select, epoll) gave way to the batched, near-syscall-free io_uring model.