I/O models
A process wants to read from a socket or a disk; the data may or may not be there. What the kernel does in the gap — block the thread, return an error, queue a notification, or hand back a completion — is the single decision that shapes how many connections one box can hold and how cheaply it can hold them. The history of server software is mostly the history of this choice getting better.
The setup — a syscall, and nothing there yet.
A process calls read(fd, buf, n). The file descriptor points at a TCP socket
waiting on the other side of the planet, or a disk block that hasn't come back from the
SSD yet, or a pipe whose writer is busy. The data isn't ready. What is the kernel supposed
to do with the calling thread until it is?
Every I/O model is an answer to that single question. The thread can be put to sleep until the data arrives. It can be sent back immediately with an error telling it to try again. It can register interest with the kernel and be told later when something is ready. Or it can hand the kernel a whole batch of operations and pick up the results when they're done. Each choice changes the cost of supporting one more concurrent connection, and it's that per-connection cost that ultimately decides whether a single box serves a thousand clients or ten million.
The progression below is roughly chronological. Blocking is what every
Unix shipped with in the 1970s. Non-blocking + select()
arrived with BSD in the early 80s. epoll showed up in Linux 2.6 (2002)
and ate the server market in five years. io_uring landed in 5.1 (2019)
and is still in the process of doing the same to disk I/O.
The path from app to device.
Before picking a waiting model, it helps to see where the bytes actually travel. When your
code calls read(), that call is a system call: a controlled
trap into the kernel that switches the CPU from user mode to kernel mode, validates the
arguments, and routes the request to the right subsystem. For a file the request goes through
the virtual filesystem layer, the page cache, then the block layer and the device driver. For
a socket it goes through the protocol stack and the network driver. Either way the application
never touches the hardware itself; the kernel sits in the middle of every transfer.
The hardware does not copy bytes one register at a time. The driver hands the device a descriptor pointing at a region of physical memory and lets direct memory access (DMA) move the data. The NIC or the disk controller writes straight into RAM over the memory bus while the CPU does other work. When the transfer finishes the device raises an interrupt; the CPU stops what it is doing, jumps to the driver's handler, and the handler marks the I/O complete and wakes whatever was waiting. High-rate devices soften the interrupt cost with coalescing (one interrupt per batch) and NAPI-style polling, where the driver switches from interrupts to a poll loop once traffic is heavy enough that the interrupts themselves become the bottleneck.
That last hop matters for the rest of this page. After DMA lands the data in a kernel buffer,
a normal read() still copies it once more into your userspace buffer. That extra
copy, plus the mode switch on every syscall, is the overhead that zero-copy and io_uring exist
to shave off. Hold onto the picture: trap down, DMA in, interrupt up, copy out.
Blocking I/O — one thread per connection.
The default for any newly opened file descriptor. read() on an empty socket
puts the thread into TASK_INTERRUPTIBLE; the kernel parks it on the socket's
wait queue and runs something else. When a packet arrives, the network softirq wakes every
thread on that queue and the scheduler picks one to run. From the application's
perspective, read() just returned with bytes in the buffer; the wait was
invisible.
This is the easiest model to write. The code reads top to bottom — accept a connection,
read a request, do some work, write a response — and the kernel handles the waiting. It's
how Apache's prefork MPM worked, how every CGI script worked, and how most
tutorials still teach sockets. The problem is that one connection costs one thread, and
one thread costs ~8 MB of address space for its default stack plus a
task_struct and assorted kernel bookkeeping.
At a thousand connections you're fine. At ten thousand you're paying ~80 GB of virtual memory, fighting the scheduler's per-CPU runqueue, and burning serious time on context switches. This was the wall Dan Kegel named in 1999 as the C10K problem: how do you handle ten thousand simultaneous clients on one machine? Blocking I/O can't. Every model that follows is a way around it.
Non-blocking I/O — EAGAIN and a busy loop.
Open the descriptor with O_NONBLOCK (or fcntl(fd, F_SETFL,
O_NONBLOCK) after the fact) and read() changes character. If data is
available it copies it and returns; if not it returns -1 immediately and sets
errno to EAGAIN (or EWOULDBLOCK, the same value on
Linux). The thread never sleeps inside the syscall.
On its own this just shifts the waiting upstairs. An application managing N sockets has
to walk all of them, call read() on each, ignore the EAGAINs,
and try again — a busy loop that pegs a CPU at 100% to discover that nothing has
happened. The version that works needs a way to ask the kernel "tell me which of
these descriptors actually have something for me". That is what
select() and poll() are for, and what epoll later
made cheap.
Non-blocking on its own still has a role: paired with an event-loop multiplexer it's
essential, because the loop has to be able to drain a ready socket without blocking when
the kernel's read buffer turns out to be smaller than expected. The standard idiom is
while ((n = read(fd, buf, sizeof buf)) > 0) process(buf, n); stopping on
EAGAIN.
select() and poll() — O(n) over the FD set.
select(), inherited from 4.2BSD, takes three bitmaps (read, write, exception)
of file descriptors and blocks until any of them is ready or a timeout fires. The kernel
walks the entire bitmap on entry, queues the calling thread on every relevant socket,
then on wakeup walks the bitmap again to report which fds were the ones that triggered.
Cost is O(n) per call, both in kernel and userspace, where n is the
highest fd number plus one.
The historical kicker is FD_SETSIZE, the compile-time cap on the bitmap,
almost always 1024. Open the 1025th fd and pass it to FD_SET
and you get memory corruption — a footgun that took out plenty of servers in the 1990s.
poll() (System V, standardised in POSIX.1-2001) replaced the bitmaps with an
array of struct pollfd. No more FD_SETSIZE cap, no rebuilding the input set
every call. The complexity, though, is still O(n): kernel walks the array on every call,
userspace walks it again on return. At 10 000 connections each poll()
invocation copies 10 000 structs in and 10 000 out and scans both — minutes of CPU per
day for the privilege of finding out which ten sockets ticked.
epoll, kqueue, IOCP — O(1) event multiplexing.
The fix in all three major kernels was the same idea: stop passing the whole set in and out every time. Register the descriptors with the kernel once, let the kernel keep the watch list, and only return the ones that are actually ready.
On Linux this is epoll, introduced in kernel 2.5.44 (2002) and
stabilised in 2.6. You call epoll_create1() to get an epoll fd, then
epoll_ctl(epfd, EPOLL_CTL_ADD, fd, &ev) once per descriptor to register
it. epoll_wait() then returns just the subset of descriptors that have
fired, in O(number-of-ready-events). Internally the kernel keeps the watched fds in a
red-black tree and the ready ones in a linked list updated by the same wakeup that the
blocking model uses.
BSD's answer was kqueue (Jonathan Lemon, FreeBSD 4.1, 2000) — a more
general mechanism that handles sockets, files, signals, timers, and process events
through one kevent structure. Windows had I/O Completion Ports
(IOCP) from NT 3.5 (1993) onwards, which take a step further into completion-based I/O
rather than readiness-based. macOS uses kqueue.
fs.nr_open/net.core.somaxconn sysctls will hold
10 million+ idle TCP connections on a single epoll fd, the regime Errata
Security and WhatsApp's engineering posts have documented as C10M./* Linux — register a socket with epoll and reap ready events */
#include <sys/epoll.h>
int epfd = epoll_create1(EPOLL_CLOEXEC);
struct epoll_event ev = {
.events = EPOLLIN | EPOLLET, /* edge-triggered */
.data.fd = sock,
};
epoll_ctl(epfd, EPOLL_CTL_ADD, sock, &ev);
struct epoll_event events[256];
for (;;) {
int n = epoll_wait(epfd, events, 256, -1); /* blocks */
for (int i = 0; i < n; i++) {
int fd = events[i].data.fd;
/* drain until EAGAIN (edge-triggered contract) */
while (read(fd, buf, sizeof buf) > 0) { ... }
}
}Two modes matter. Level-triggered (the default, like poll)
reports a descriptor as ready every time you wait, as long as it has data. Edge-triggered
(EPOLLET) reports it once when it becomes ready, and you must drain
it completely or you'll miss the next notification. ET is faster and what every
high-performance loop uses; it's also where the famous "drain to EAGAIN" idiom comes from.
Thread-per-connection vs event loop.
With epoll in place the architectural question becomes: do you keep one thread per connection and let the kernel block them (now cheaper, since you can pick lighter threads or coroutines), or do you keep one thread per CPU running an event loop that demultiplexes thousands of connections?
Apache prefork and most JVM servlet containers picked threads. The model is
straightforward, blocking calls work, and per-request state lives on the stack. The cost
is the memory and context-switch overhead — modest with a few hundred threads, painful
past a few thousand, and incidentally why Go's goroutines and Java's virtual threads
(Project Loom, JDK 21) exist: they keep the blocking programming model while making the
"thread" a ~2 KB stackful userspace object that the runtime multiplexes onto OS threads
over an epoll loop.
nginx, Redis, Node.js, HAProxy, and Envoy went the other way: one event loop per CPU,
everything non-blocking, request state lives in heap-allocated state machines. The win is
that idle connections cost only the per-fd entry in the epoll tree (~hundred bytes); the
pain is that any accidentally blocking call — a synchronous disk read, a DNS lookup, a
malloc() under memory pressure — freezes the whole loop. Node.js's
reputation for getting "stuck" almost always traces back to a blocking call sneaking into
the event-loop thread.
io_uring — one syscall, many operations.
Even with epoll, every read and write is still its own syscall, and every syscall pays a context-switch tax that the Meltdown/Spectre mitigations roughly doubled (KPTI alone added ~500 ns on Skylake). At a million ops/sec that's serious money.
io_uring, added by Jens Axboe in Linux 5.1 (May 2019), restructures the
interface entirely. The kernel and the application share two ring buffers via
mmap: a submission queue (SQ) where the app writes
operation descriptors, and a completion queue (CQ) where the kernel
writes results. The application fills entries into the SQ and calls
io_uring_enter() once to tell the kernel "there's new work". The
kernel processes the entries — possibly thousands — and writes outcomes to the CQ. The
app reads completions out of the CQ without another syscall.
With IORING_SETUP_SQPOLL a kernel thread polls the SQ continuously, and the
application doesn't even need to call io_uring_enter() in the common case —
zero syscalls per operation in steady state. With IORING_REGISTER_BUFFERS
you pin a set of userspace buffers once and refer to them by index, skipping the
get_user_pages dance on every I/O.
IOSQE_IO_LINK) gives a chain — accept then read then write — submitted in
one go, the closest mainstream Linux has come to a kernel-side coroutine.Buffered vs direct I/O.
There is a second axis that cuts across all the waiting models: whether reads and writes go
through the kernel's page cache. By default they do. A buffered read() on a file
that is already cached returns instantly from RAM with no device access at all; a buffered
write() usually just marks the page dirty and returns, leaving the actual disk
write to a background flusher thread that batches dirty pages and writes them out later. The
page cache is why a second read of the same file is so much faster than the first, and why a
machine with plenty of free memory feels quick even on a slow disk.
That caching is a gift for general workloads and a problem for one specific kind of program:
the database. A database already keeps its own buffer pool tuned to its access pattern, so the
kernel's page cache just double-buffers the same data, wasting RAM and adding an unpredictable
copy. Opening a file with O_DIRECT tells the kernel to skip the page cache and
DMA straight between the device and an aligned userspace buffer. The application gives up free
caching and readahead, and takes on strict alignment rules, in exchange for predictable
latency and control over exactly what is in memory. PostgreSQL, MySQL InnoDB, and most
storage engines offer a direct-I/O mode for precisely this reason.
Neither mode is strictly better. Buffered I/O wins for anything that benefits from shared caching and readahead, which is most software. Direct I/O wins when the application knows its own data better than the kernel can guess and needs the latency to be steady rather than merely fast on average. The choice is a statement about who owns the cache, not about speed alone.
Zero-copy — stop touching the data.
Think about the most ordinary server task: read a file and send it down a
socket. Done the naive way that is
read() into a userspace buffer then write() to the socket, and it
drags the same bytes across the user-kernel boundary four times — DMA from disk into the page
cache, copy up into the app buffer, copy back down into the socket buffer, DMA out to the NIC
— plus two syscalls and two context switches. The application never even looks at the data; it
is pure ferrying.
sendfile() collapses that. It tells the kernel to move data from a file
descriptor to a socket descriptor without ever surfacing it to userspace. The bytes go from
the page cache to the socket buffer inside the kernel, and on hardware that supports
scatter-gather DMA the kernel can hand the NIC a pointer into the page cache directly, so the
only copies left are the two DMA transfers the hardware has to do anyway. This is how nginx,
Apache, and Kafka serve static content and log segments at near line rate. Kafka's throughput
story is largely a zero-copy story: messages land in the page cache on write and are sent to
consumers with sendfile(), never passing through the broker's heap.
splice() generalises the idea. It moves data between two descriptors through a
kernel pipe buffer, so you can wire a file to a socket, a socket to a file, or one socket to
another without the bytes touching userspace. Its sibling vmsplice() maps
userspace pages into a pipe, and tee() duplicates a pipe's contents without
consuming them. Together these let a proxy shovel data between two connections at very low
cost. io_uring exposes splice as an operation too, so a zero-copy transfer can be
batched into the same ring as everything else and submitted with no per-operation syscall at
all.
Why old AIO (libaio) disappointed.
Linux had an asynchronous I/O interface long before io_uring — io_submit()
and io_getevents(), generally accessed through libaio. On paper
it looked similar: submit operations in a batch, reap completions later. In practice
almost nothing used it.
It was O_DIRECT only — buffered file I/O secretly became synchronous,
defeating the whole point for anyone not doing direct disk access. The API was crusty
(iocb structs with union-typed fields, no symmetry with the rest of POSIX),
network I/O was unsupported, and metadata operations like fsync would block
anyway. The result was that libaio only ever made sense for databases doing raw block I/O
(MySQL InnoDB, PostgreSQL with extensions), and even there the codebases tended to keep a
fallback thread pool because too many edge cases hit a blocking path.
io_uring was Axboe's deliberate redesign. It handles buffered and direct I/O equally,
extends to every syscall worth batching, and gives back completions through a uniform
ring. The libaio interface is now deprecated in spirit (still in the kernel for ABI
reasons) and new code should reach for io_uring directly or via liburing.
Userspace networking — leave the kernel behind.
At the very top end — high-frequency trading, telco DPI, hyperscale load balancers — even io_uring is too much. Every packet still traverses the kernel's TCP/IP stack, the netfilter chains, the qdisc, the socket buffers. That's hundreds of nanoseconds per packet at minimum, dominated by cacheline bouncing.
DPDK (Intel, 2010) hands the NIC's RX/TX rings directly to a userspace
poll-mode driver, bypassing the kernel entirely. A core pinned with isolcpus
busy-polls the ring and pulls packets straight into application memory. Per-packet cost
drops to ~50 ns; one core can saturate a 100 Gbps NIC. The trade is total: you give up
the kernel's TCP/IP stack, sockets API, firewall, and observability tooling. Most DPDK
users either run a userspace stack (mTCP, F-Stack, VPP) or only do L2/L3 forwarding.
AF_XDP (Linux 4.18, 2018) is the kernel community's answer — a socket family that lets userspace allocate a UMEM ring shared with the kernel, with an eBPF program at the XDP hook redirecting selected packets straight to the ring. You get most of DPDK's throughput without giving up the kernel stack for everything else; Cilium and Cloudflare both lean on it.
The canonical examples are Google's Maglev and Facebook's Katran, both XDP/eBPF-based L4 load balancers that forward millions of packets per second per core. See the load balancing deep dive in the networking stack for how that fits together with consistent hashing and DSR.
| Model | Key syscall | Scalability | Used by |
|---|---|---|---|
| Blocking | read() | ~10K fds, 1 thread each | Apache prefork, CGI |
| Non-blocking + select | select() | O(n), FD_SETSIZE = 1024 | legacy daemons |
| Non-blocking + poll | poll() | O(n), no fd cap | older portable servers |
| epoll / kqueue / IOCP | epoll_wait() | O(1), 10M+ fds | nginx, Redis, Envoy, Node.js |
| io_uring | io_uring_enter() | 0–1 syscalls per batch | ScyllaDB, Ceph, QEMU, Cassandra 5 |
| Kernel bypass | none (PMD) | line-rate, per-core | DPDK, AF_XDP, Maglev, Katran |
Further reading.
- Dan Kegel — The C10K Problem (1999) — the page that named the problem and surveyed every contemporary answer. Still the clearest framing of why blocking I/O hits a wall.
- Jens Axboe — Efficient IO with io_uring — the whitepaper from the author. Walks through the ring layout, the submission and completion paths, and SQPOLL.
- Shuveb Hussain — Lord of the io_uring — the most thorough tutorial outside the whitepaper, with working examples.
- Jonathan Lemon — Kqueue: A generic and scalable event notification facility (2001) — the design paper for BSD's answer to the same problem, predating epoll by a year.
- ScyllaDB — How io_uring and eBPF will revolutionise programming in Linux — production-grade analysis from a database that rebuilt its disk path on io_uring.
- NGINX — Inside NGINX, how we designed for performance and scale — the canonical writeup of an epoll-driven event-loop server.
- Semicolony — Event loops — how the userspace half of epoll-based servers actually works.
- Semicolony — Go's netpoller
— how Go hides epoll behind the blocking
net.ConnAPI using goroutine-park / wake. - Semicolony — Load balancing (Maglev, Katran) — where kernel bypass meets consistent hashing at hyperscale.