07 / 10

Internals / 07

I/O models

A process wants to read from a socket or a disk; the data may or may not be there. What the kernel does in the gap — block the thread, return an error, queue a notification, or hand back a completion — is the single decision that shapes how many connections one box can hold and how cheaply it can hold them. The history of server software is mostly the history of this choice getting better.

The setup — a syscall, and nothing there yet.

A process calls read(fd, buf, n). The file descriptor points at a TCP socket waiting on the other side of the planet, or a disk block that hasn't come back from the SSD yet, or a pipe whose writer is busy. The data isn't ready. What is the kernel supposed to do with the calling thread until it is?

Every I/O model is an answer to that single question. The thread can be put to sleep until the data arrives. It can be sent back immediately with an error telling it to try again. It can register interest with the kernel and be told later when something is ready. Or it can hand the kernel a whole batch of operations and pick up the results when they're done. Each choice changes the cost of supporting one more concurrent connection, and it's that per-connection cost that ultimately decides whether a single box serves a thousand clients or ten million.

The progression below is roughly chronological. Blocking is what every Unix shipped with in the 1970s. Non-blocking + select() arrived with BSD in the early 80s. epoll showed up in Linux 2.6 (2002) and ate the server market in five years. io_uring landed in 5.1 (2019) and is still in the process of doing the same to disk I/O.

The path from app to device.

Before picking a waiting model, it helps to see where the bytes actually travel. When your code calls read(), that call is a system call: a controlled trap into the kernel that switches the CPU from user mode to kernel mode, validates the arguments, and routes the request to the right subsystem. For a file the request goes through the virtual filesystem layer, the page cache, then the block layer and the device driver. For a socket it goes through the protocol stack and the network driver. Either way the application never touches the hardware itself; the kernel sits in the middle of every transfer.

The hardware does not copy bytes one register at a time. The driver hands the device a descriptor pointing at a region of physical memory and lets direct memory access (DMA) move the data. The NIC or the disk controller writes straight into RAM over the memory bus while the CPU does other work. When the transfer finishes the device raises an interrupt; the CPU stops what it is doing, jumps to the driver's handler, and the handler marks the I/O complete and wakes whatever was waiting. High-rate devices soften the interrupt cost with coalescing (one interrupt per batch) and NAPI-style polling, where the driver switches from interrupts to a poll loop once traffic is heavy enough that the interrupts themselves become the bottleneck.

One read: a syscall trap down, a DMA transfer into RAM, an interrupt back up, then the data lands in the user buffer.

That last hop matters for the rest of this page. After DMA lands the data in a kernel buffer, a normal read() still copies it once more into your userspace buffer. That extra copy, plus the mode switch on every syscall, is the overhead that zero-copy and io_uring exist to shave off. Hold onto the picture: trap down, DMA in, interrupt up, copy out.

Blocking I/O — one thread per connection.

The default for any newly opened file descriptor. read() on an empty socket puts the thread into TASK_INTERRUPTIBLE; the kernel parks it on the socket's wait queue and runs something else. When a packet arrives, the network softirq wakes every thread on that queue and the scheduler picks one to run. From the application's perspective, read() just returned with bytes in the buffer; the wait was invisible.

This is the easiest model to write. The code reads top to bottom — accept a connection, read a request, do some work, write a response — and the kernel handles the waiting. It's how Apache's prefork MPM worked, how every CGI script worked, and how most tutorials still teach sockets. The problem is that one connection costs one thread, and one thread costs ~8 MB of address space for its default stack plus a task_struct and assorted kernel bookkeeping.

At a thousand connections you're fine. At ten thousand you're paying ~80 GB of virtual memory, fighting the scheduler's per-CPU runqueue, and burning serious time on context switches. This was the wall Dan Kegel named in 1999 as the C10K problem: how do you handle ten thousand simultaneous clients on one machine? Blocking I/O can't. Every model that follows is a way around it.

Non-blocking I/O — EAGAIN and a busy loop.

Open the descriptor with O_NONBLOCK (or fcntl(fd, F_SETFL, O_NONBLOCK) after the fact) and read() changes character. If data is available it copies it and returns; if not it returns -1 immediately and sets errno to EAGAIN (or EWOULDBLOCK, the same value on Linux). The thread never sleeps inside the syscall.

On its own this just shifts the waiting upstairs. An application managing N sockets has to walk all of them, call read() on each, ignore the EAGAINs, and try again — a busy loop that pegs a CPU at 100% to discover that nothing has happened. The version that works needs a way to ask the kernel "tell me which of these descriptors actually have something for me". That is what select() and poll() are for, and what epoll later made cheap.

Non-blocking on its own still has a role: paired with an event-loop multiplexer it's essential, because the loop has to be able to drain a ready socket without blocking when the kernel's read buffer turns out to be smaller than expected. The standard idiom is while ((n = read(fd, buf, sizeof buf)) > 0) process(buf, n); stopping on EAGAIN.

select() and poll() — O(n) over the FD set.

select(), inherited from 4.2BSD, takes three bitmaps (read, write, exception) of file descriptors and blocks until any of them is ready or a timeout fires. The kernel walks the entire bitmap on entry, queues the calling thread on every relevant socket, then on wakeup walks the bitmap again to report which fds were the ones that triggered. Cost is O(n) per call, both in kernel and userspace, where n is the highest fd number plus one.

The historical kicker is FD_SETSIZE, the compile-time cap on the bitmap, almost always 1024. Open the 1025th fd and pass it to FD_SET and you get memory corruption — a footgun that took out plenty of servers in the 1990s.

poll() (System V, standardised in POSIX.1-2001) replaced the bitmaps with an array of struct pollfd. No more FD_SETSIZE cap, no rebuilding the input set every call. The complexity, though, is still O(n): kernel walks the array on every call, userspace walks it again on return. At 10 000 connections each poll() invocation copies 10 000 structs in and 10 000 out and scans both — minutes of CPU per day for the privilege of finding out which ten sockets ticked.

epoll, kqueue, IOCP — O(1) event multiplexing.

The fix in all three major kernels was the same idea: stop passing the whole set in and out every time. Register the descriptors with the kernel once, let the kernel keep the watch list, and only return the ones that are actually ready.

On Linux this is epoll, introduced in kernel 2.5.44 (2002) and stabilised in 2.6. You call epoll_create1() to get an epoll fd, then epoll_ctl(epfd, EPOLL_CTL_ADD, fd, &ev) once per descriptor to register it. epoll_wait() then returns just the subset of descriptors that have fired, in O(number-of-ready-events). Internally the kernel keeps the watched fds in a red-black tree and the ready ones in a linked list updated by the same wakeup that the blocking model uses.

BSD's answer was kqueue (Jonathan Lemon, FreeBSD 4.1, 2000) — a more general mechanism that handles sockets, files, signals, timers, and process events through one kevent structure. Windows had I/O Completion Ports (IOCP) from NT 3.5 (1993) onwards, which take a step further into completion-based I/O rather than readiness-based. macOS uses kqueue.

Why C10K became C10M with epoll. nginx (released 2004) built its event loop on epoll/kqueue and demonstrated tens of thousands of concurrent connections per worker process — a regime Apache prefork couldn't reach without horizontal scaling. Redis, HAProxy, Node.js, and Envoy all followed the same shape. A well-tuned modern Linux box with enough RAM for socket buffers and the right fs.nr_open/net.core.somaxconn sysctls will hold 10 million+ idle TCP connections on a single epoll fd, the regime Errata Security and WhatsApp's engineering posts have documented as C10M.

/* Linux — register a socket with epoll and reap ready events */
#include <sys/epoll.h>

int epfd = epoll_create1(EPOLL_CLOEXEC);

struct epoll_event ev = {
    .events = EPOLLIN | EPOLLET,   /* edge-triggered */
    .data.fd = sock,
};
epoll_ctl(epfd, EPOLL_CTL_ADD, sock, &ev);

struct epoll_event events[256];
for (;;) {
    int n = epoll_wait(epfd, events, 256, -1);   /* blocks */
    for (int i = 0; i < n; i++) {
        int fd = events[i].data.fd;
        /* drain until EAGAIN (edge-triggered contract) */
        while (read(fd, buf, sizeof buf) > 0) { ... }
    }
}

Two modes matter. Level-triggered (the default, like poll) reports a descriptor as ready every time you wait, as long as it has data. Edge-triggered (EPOLLET) reports it once when it becomes ready, and you must drain it completely or you'll miss the next notification. ET is faster and what every high-performance loop uses; it's also where the famous "drain to EAGAIN" idiom comes from.

Thread-per-connection vs event loop.

With epoll in place the architectural question becomes: do you keep one thread per connection and let the kernel block them (now cheaper, since you can pick lighter threads or coroutines), or do you keep one thread per CPU running an event loop that demultiplexes thousands of connections?

Apache prefork and most JVM servlet containers picked threads. The model is straightforward, blocking calls work, and per-request state lives on the stack. The cost is the memory and context-switch overhead — modest with a few hundred threads, painful past a few thousand, and incidentally why Go's goroutines and Java's virtual threads (Project Loom, JDK 21) exist: they keep the blocking programming model while making the "thread" a ~2 KB stackful userspace object that the runtime multiplexes onto OS threads over an epoll loop.

nginx, Redis, Node.js, HAProxy, and Envoy went the other way: one event loop per CPU, everything non-blocking, request state lives in heap-allocated state machines. The win is that idle connections cost only the per-fd entry in the epoll tree (~hundred bytes); the pain is that any accidentally blocking call — a synchronous disk read, a DNS lookup, a malloc() under memory pressure — freezes the whole loop. Node.js's reputation for getting "stuck" almost always traces back to a blocking call sneaking into the event-loop thread.

io_uring — one syscall, many operations.

Even with epoll, every read and write is still its own syscall, and every syscall pays a context-switch tax that the Meltdown/Spectre mitigations roughly doubled (KPTI alone added ~500 ns on Skylake). At a million ops/sec that's serious money.

io_uring, added by Jens Axboe in Linux 5.1 (May 2019), restructures the interface entirely. The kernel and the application share two ring buffers via mmap: a submission queue (SQ) where the app writes operation descriptors, and a completion queue (CQ) where the kernel writes results. The application fills entries into the SQ and calls io_uring_enter() once to tell the kernel "there's new work". The kernel processes the entries — possibly thousands — and writes outcomes to the CQ. The app reads completions out of the CQ without another syscall.

With IORING_SETUP_SQPOLL a kernel thread polls the SQ continuously, and the application doesn't even need to call io_uring_enter() in the common case — zero syscalls per operation in steady state. With IORING_REGISTER_BUFFERS you pin a set of userspace buffers once and refer to them by index, skipping the get_user_pages dance on every I/O.

io_uring is where new I/O code is going. It handles network sockets, files (buffered or O_DIRECT), pipes, timers, accept, connect, splice, fallocate, openat — anything that used to be a syscall. ScyllaDB rebuilt its disk path on io_uring and saw 2–3x lower tail latency vs libaio. Ceph BlueStore, Cassandra 5, QEMU, RocksDB, and Cloudflare's quiche all ship io_uring backends. Linking I/O operations (IOSQE_IO_LINK) gives a chain — accept then read then write — submitted in one go, the closest mainstream Linux has come to a kernel-side coroutine.

Buffered vs direct I/O.

There is a second axis that cuts across all the waiting models: whether reads and writes go through the kernel's page cache. By default they do. A buffered read() on a file that is already cached returns instantly from RAM with no device access at all; a buffered write() usually just marks the page dirty and returns, leaving the actual disk write to a background flusher thread that batches dirty pages and writes them out later. The page cache is why a second read of the same file is so much faster than the first, and why a machine with plenty of free memory feels quick even on a slow disk.

That caching is a gift for general workloads and a problem for one specific kind of program: the database. A database already keeps its own buffer pool tuned to its access pattern, so the kernel's page cache just double-buffers the same data, wasting RAM and adding an unpredictable copy. Opening a file with O_DIRECT tells the kernel to skip the page cache and DMA straight between the device and an aligned userspace buffer. The application gives up free caching and readahead, and takes on strict alignment rules, in exchange for predictable latency and control over exactly what is in memory. PostgreSQL, MySQL InnoDB, and most storage engines offer a direct-I/O mode for precisely this reason.

Buffered I/O stages data in the page cache and pays an extra copy. Direct I/O DMAs straight to the application's aligned buffer.

Neither mode is strictly better. Buffered I/O wins for anything that benefits from shared caching and readahead, which is most software. Direct I/O wins when the application knows its own data better than the kernel can guess and needs the latency to be steady rather than merely fast on average. The choice is a statement about who owns the cache, not about speed alone.

Zero-copy — stop touching the data.

Think about the most ordinary server task: read a file and send it down a socket. Done the naive way that is read() into a userspace buffer then write() to the socket, and it drags the same bytes across the user-kernel boundary four times — DMA from disk into the page cache, copy up into the app buffer, copy back down into the socket buffer, DMA out to the NIC — plus two syscalls and two context switches. The application never even looks at the data; it is pure ferrying.

sendfile() collapses that. It tells the kernel to move data from a file descriptor to a socket descriptor without ever surfacing it to userspace. The bytes go from the page cache to the socket buffer inside the kernel, and on hardware that supports scatter-gather DMA the kernel can hand the NIC a pointer into the page cache directly, so the only copies left are the two DMA transfers the hardware has to do anyway. This is how nginx, Apache, and Kafka serve static content and log segments at near line rate. Kafka's throughput story is largely a zero-copy story: messages land in the page cache on write and are sent to consumers with sendfile(), never passing through the broker's heap.

The naive path copies the same bytes four times; sendfile keeps them in the kernel and leaves only the DMA transfers.

splice() generalises the idea. It moves data between two descriptors through a kernel pipe buffer, so you can wire a file to a socket, a socket to a file, or one socket to another without the bytes touching userspace. Its sibling vmsplice() maps userspace pages into a pipe, and tee() duplicates a pipe's contents without consuming them. Together these let a proxy shovel data between two connections at very low cost. io_uring exposes splice as an operation too, so a zero-copy transfer can be batched into the same ring as everything else and submitted with no per-operation syscall at all.

Why old AIO (libaio) disappointed.

Linux had an asynchronous I/O interface long before io_uring — io_submit() and io_getevents(), generally accessed through libaio. On paper it looked similar: submit operations in a batch, reap completions later. In practice almost nothing used it.

It was O_DIRECT only — buffered file I/O secretly became synchronous, defeating the whole point for anyone not doing direct disk access. The API was crusty (iocb structs with union-typed fields, no symmetry with the rest of POSIX), network I/O was unsupported, and metadata operations like fsync would block anyway. The result was that libaio only ever made sense for databases doing raw block I/O (MySQL InnoDB, PostgreSQL with extensions), and even there the codebases tended to keep a fallback thread pool because too many edge cases hit a blocking path.

io_uring was Axboe's deliberate redesign. It handles buffered and direct I/O equally, extends to every syscall worth batching, and gives back completions through a uniform ring. The libaio interface is now deprecated in spirit (still in the kernel for ABI reasons) and new code should reach for io_uring directly or via liburing.

Userspace networking — leave the kernel behind.

At the very top end — high-frequency trading, telco DPI, hyperscale load balancers — even io_uring is too much. Every packet still traverses the kernel's TCP/IP stack, the netfilter chains, the qdisc, the socket buffers. That's hundreds of nanoseconds per packet at minimum, dominated by cacheline bouncing.

DPDK (Intel, 2010) hands the NIC's RX/TX rings directly to a userspace poll-mode driver, bypassing the kernel entirely. A core pinned with isolcpus busy-polls the ring and pulls packets straight into application memory. Per-packet cost drops to ~50 ns; one core can saturate a 100 Gbps NIC. The trade is total: you give up the kernel's TCP/IP stack, sockets API, firewall, and observability tooling. Most DPDK users either run a userspace stack (mTCP, F-Stack, VPP) or only do L2/L3 forwarding.

AF_XDP (Linux 4.18, 2018) is the kernel community's answer — a socket family that lets userspace allocate a UMEM ring shared with the kernel, with an eBPF program at the XDP hook redirecting selected packets straight to the ring. You get most of DPDK's throughput without giving up the kernel stack for everything else; Cilium and Cloudflare both lean on it.

The canonical examples are Google's Maglev and Facebook's Katran, both XDP/eBPF-based L4 load balancers that forward millions of packets per second per core. See the load balancing deep dive in the networking stack for how that fits together with consistent hashing and DSR.

Model	Key syscall	Scalability	Used by
Blocking	`read()`	~10K fds, 1 thread each	Apache prefork, CGI
Non-blocking + select	`select()`	O(n), FD_SETSIZE = 1024	legacy daemons
Non-blocking + poll	`poll()`	O(n), no fd cap	older portable servers
epoll / kqueue / IOCP	`epoll_wait()`	O(1), 10M+ fds	nginx, Redis, Envoy, Node.js
io_uring	`io_uring_enter()`	0–1 syscalls per batch	ScyllaDB, Ceph, QEMU, Cassandra 5
Kernel bypass	none (PMD)	line-rate, per-core	DPDK, AF_XDP, Maglev, Katran