08 / 10

Internals / 08

Inter-process communication

Once you split work across processes — for isolation, for fault tolerance, for a language boundary — they have to talk. The kernel offers half a dozen mechanisms ranging from a one-line pipe() call to shared pages and lock-free ring buffers. Picking the right one is mostly a question of how much you care about latency, decoupling, and the cost of writing the synchronisation yourself.

Why IPC exists in the first place.

A single process with a few threads is the cheapest way to share data — every thread sees the same address space, and a pointer is a pointer. The moment you split that work across separate processes, those pointers stop meaning anything to anyone else. The OS gives each process its own page tables, its own heap, its own file-descriptor table. Sharing now requires the kernel's help.

People split work across processes for good reasons. Isolation: a crash in one doesn't take the other down (this is most of why Chrome runs a process per tab). Privilege separation: one process holds the credentials, another handles untrusted input. Language boundaries: Python here, Rust there. Independent deployment: restart the request handler without restarting the database. Once you've made that choice, IPC is the bill you pay.

The kernel offers a small menu. Shared memory is fastest — once mapped, there are no syscalls, just memory accesses. Pipes are simplest — a single call sets one up. Unix domain sockets are the most general — full bidirectional byte streams with credentials passing. Message queues are the most decoupled — sender and receiver don't have to be alive at the same instant. Signals are the cheapest control plane. Everything below is which to reach for and when.

It helps to hold one picture in your head before the details. Each process is a private address space the kernel set up, and the page tables that translate its virtual addresses to physical frames belong to that process alone. A pointer like 0x7ffd4c00 in process A maps to one frame; the same number in process B maps to a different frame, or to nothing. So "sending data" between two processes is really one of two physical acts. Either the kernel copies bytes out of A's frames into its own buffers and then out again into B's frames, which is what pipes, sockets, and queues do, or the kernel arranges for one physical frame to appear in both page tables at once, which is what shared memory and memory-mapped files do. The first family is easy and safe and costs you copies. The second family is fast and shares the frame and costs you the synchronisation you now have to write yourself. Almost every trade-off on this page is a version of that one choice.

The two physical models. The copy model is simple and safe; the share model is fast and hands you the locking.

Pipes and FIFOs — a one-way byte hose.

pipe(int fds[2]) returns a pair of descriptors: fds[0] for reading, fds[1] for writing. Whatever bytes you write into the write end come out in order at the read end. The kernel sits in the middle with a circular buffer — 64 KB by default on Linux (fcntl(F_SETPIPE_SZ) can grow it up to /proc/sys/fs/pipe-max-size). The writer blocks when the buffer is full, the reader blocks when it's empty. Close the write end and the reader sees EOF.

Anonymous pipes only work between processes that share the descriptor. In practice that means parent and child: you call pipe(), then fork(), and both processes inherit both ends. Each closes the end it doesn't need, and you've got one-way communication. The shell does exactly this for ls | grep foo — ls's stdout is dup2'd onto the pipe's write end, grep's stdin onto the read end.

Named pipes, also called FIFOs (mkfifo(path, mode)), give the pipe a path in the filesystem. Any process with the right permissions can open it; the kernel buffers between them. Useful when the two processes have no parent-child relationship — logging daemons, ad-hoc shell glue, plugin sockets. The byte stream has no message boundaries, so writers and readers have to agree on framing themselves.

Unix domain sockets — TCP without the network.

A Unix domain socket (AF_UNIX) speaks the standard sockets API — socket(), bind(), listen(), accept(), connect(), send(), recv() — but the endpoints are filesystem paths instead of IP addresses, and the traffic never leaves the kernel. No checksums, no fragmentation, no routing, no congestion control. Round-trip latency is single-digit microseconds; throughput on a single connection regularly clears 10 GB/s on modern hardware.

Two flavours mirror TCP and UDP. SOCK_STREAM gives a reliable, ordered byte stream with backpressure — the workhorse. SOCK_DGRAM preserves message boundaries and is reliable too (the kernel just doesn't drop frames it accepted), useful when you want one syscall to send and one to receive an entire message. SOCK_SEQPACKET combines both: stream-like reliability with message boundaries.

The killer feature is ancillary data via sendmsg/ recvmsg. You can pass file descriptors between processes (SCM_RIGHTS) — the kernel sets up a fresh fd in the receiver pointing at the same open file. You can pass peer credentials (SCM_CREDENTIALS on Linux, getpeereid() on BSD), so a server can authenticate the caller without a handshake. This is what makes Unix sockets the default local IPC: Docker's daemon listens on /var/run/docker.sock, systemd's socket activation passes fds through them, X11 can use one, and PostgreSQL / Redis / MySQL all prefer the Unix-socket path for local clients over loopback TCP — roughly 30% lower latency, and it avoids the loopback's per-packet overhead.

Message queues — prioritised, persistent, mostly historical.

Two flavours ship with Linux. System V message queues (msgget, msgsnd, msgrcv) date to early Unix and identify queues by an integer key. Messages have a type field; readers can ask for "any message" or "type ≤ N", which gives a crude form of prioritisation. POSIX message queues (mq_open, mq_send, mq_receive) are the modern incarnation — names look like /myqueue, queues live under /dev/mqueue, and each message carries an explicit 0–32767 priority that the kernel uses to order delivery.

Both are bounded — /proc/sys/fs/mqueue/msg_max caps the depth, default 10 messages per queue on stock Linux, raisable to 65536. Once full, mq_send either blocks or fails with EAGAIN. Messages persist until read or until the queue is unlinked, so a producer can write before the consumer exists.

In practice almost nobody reaches for these in new code. Application-level queues — Redis, RabbitMQ, Kafka, NATS, SQS — give you network reach, real durability, replay, and tooling that POSIX queues never grew. Kernel message queues survive in tightly embedded systems and a handful of real-time codebases that need the priority-based delivery guarantee without a broker in the loop.

Shared memory — the same physical page in two processes.

The fastest IPC mechanism, by a wide margin, because once it's set up there is no kernel involvement at all. Two processes call shm_open("/name", ...) to get an fd backed by a tmpfs file under /dev/shm, size it with ftruncate, and then mmap(... MAP_SHARED, fd, 0). Both mappings point at the same physical frames; a store in one process is visible to the other as soon as the cache line propagates. Latency is nanoseconds, not microseconds.

The older System V interface (shmget, shmat, shmdt) still works and a lot of legacy code uses it; the POSIX/shm_open path is the modern recommendation because it composes with the rest of the fd-based API. memfd_create on Linux gives you an anonymous shared-memory fd you can pass over a Unix socket without ever touching the filesystem.

Why shm is fastest. Every other mechanism copies bytes through the kernel — a pipe write traps into ring 0, the kernel copies your buffer into its own circular buffer, the reader's read() traps back in and copies it out again. That's two copies and at least one round-trip across the kernel boundary, with a ~100 ns syscall floor on modern Linux and the cache-line ping-pong on top. Shared memory replaces all of that with a single store to a cache line both cores already have access to. The catch is that you now own the synchronisation — no kernel means no automatic ordering. Place a pthread_mutex_t with PTHREAD_PROCESS_SHARED inside the mapping, or a futex, or a lock-free ring with atomic load/store and explicit memory fences. The fast path is fast precisely because the kernel isn't watching.

/* Linux — two processes share a 4 KiB page via shm_open + mmap */
#include <fcntl.h>
#include <sys/mman.h>
#include <unistd.h>

int fd = shm_open("/demo", O_CREAT | O_RDWR, 0600);
ftruncate(fd, 4096);

void *p = mmap(NULL, 4096, PROT_READ | PROT_WRITE,
               MAP_SHARED, fd, 0);

/* fork(); both halves now see the same physical page at *p.
   No syscall on the steady-state read/write path — just
   load and store. Lock with a PTHREAD_PROCESS_SHARED mutex
   or std::atomic placed inside the mapping. */
__atomic_store_n((int *)p, 42, __ATOMIC_RELEASE);

Copy cost versus zero-copy.

It is worth slowing down on the number that decides most of these choices: the cost of a copy. When you write a megabyte into a pipe, the kernel reads it out of your buffer and writes it into its own. When the reader calls read, the kernel reads it out of that buffer and writes it into the reader's. The data crossed the user/kernel boundary twice and was physically copied twice. At a few gigabytes per second of memory bandwidth per core, a megabyte is hundreds of microseconds of pure copying, plus the two syscall traps at roughly a hundred nanoseconds each, plus whatever cache pollution the copies cause. For small control messages none of this matters. For a video frame, a model weight tensor, or a database page moving sixty times a second, it is the whole budget.

Shared memory makes that cost vanish because there is nothing to copy. The producer writes the frame once, into a page the consumer already maps, and the consumer reads it in place. This is what people mean by zero-copy: the bytes are produced and consumed in the same physical frame, and the only thing that crosses between the processes is a small notification saying "frame ready." That notification can itself be an eventfd write or a byte on a pipe; the heavy payload never moves. The cost you pay instead is that two processes are now reading and writing the same memory with no kernel arbitrating, so you have to build the ordering yourself, which is the subject of the synchronisation page.

The common high-throughput pattern: bulk data in a shared-memory ring, a tiny eventfd doorbell to wake the reader.

This split — bulk over shared memory, notification over something cheap — is the shape of nearly every high-performance local pipeline. The kernel's own splice and vmsplice calls chase the same goal from the other direction, moving pages between a pipe and a file without copying them through user space. io_uring pushes it further, letting a process submit reads and writes through a ring it shares with the kernel, so even the syscall trap mostly disappears. The throughline is that fast IPC is the steady removal of copies and traps until the only thing left moving is the data itself.

Signals — a single bit, asynchronously.

A signal is the lowest-bandwidth IPC the kernel offers: one integer, delivered asynchronously to a target process by kill(pid, signum). It interrupts whatever the target was doing and runs a handler, or kills it, or stops it, depending on disposition. The vocabulary is tiny — about 30 numbered signals, plus the SIGRTMIN..SIGRTMAX realtime range that queues rather than coalescing.

Signals are the right tool for control-plane messages: SIGTERM to ask a daemon to shut down cleanly, SIGHUP to reload config (the convention predates restarts being cheap), SIGUSR1/SIGUSR2 for whatever the application wants — nginx uses SIGUSR1 to reopen log files and SIGUSR2 to start a new binary for zero-downtime upgrade. The payload is the signal number plus, if the sender uses sigqueue, a single siginfo_t word.

Signal handlers are heavily restricted. The only functions you may safely call inside one are async-signal-safe — a short list pinned by POSIX that excludes most of the C library, including malloc, printf, and anything locking. The idiomatic modern pattern is to do nothing in the handler except set a volatile sig_atomic_t flag (or write a byte to a self-pipe / eventfd) and let the main loop notice. Anything else risks deadlock or memory corruption.

eventfd, signalfd, timerfd — everything is a file descriptor.

Linux's epoll loop only knows how to wait on file descriptors, so the kernel grew a family of fd-backed primitives to fit. They turn out-of-band events into something you can drop into the same epoll_wait as your sockets.

eventfd (Linux 2.6.22) wraps a 64-bit counter behind a single fd. Any thread or process holding the fd can write(fd, &1, 8) to bump the counter and wake the waiter; read drains it. It replaces the older self-pipe trick — the standard hack of pipe()-ing a byte to yourself from a signal handler to wake an event loop — with a primitive that costs one fd instead of two and doesn't have to actually move bytes through a buffer. signalfd turns a signal mask into a readable fd: instead of installing a handler you read signalfd_siginfo structs from the fd, safely, in your main loop. timerfd does the same for setitimer/POSIX timers — a fd that becomes readable when the timer fires, with the count of expirations available via read.

This "everything is an fd" pattern is what lets a modern Linux server multiplex sockets, IPC, signals, and timers through a single epoll_wait call. It is also why libraries like libuv and tokio's mio look so symmetric on Linux: their dispatcher is just one epoll loop, every event source has been coerced into an fd.

Memory-mapped files — shared state via the filesystem.

mmap(... MAP_SHARED, fd, 0) on a regular file (not shm_open) gives you the file's contents addressable as memory, with stores eventually flushed back to disk by the page cache. Two processes that map the same file get the same physical pages — instant shared memory, persistent across reboots, browsable from the shell.

This is how LMDB works end-to-end: the database is a single mmap'd file, reads are pointer chases through B-tree pages, writes go through a single writer process that ends with an msync. Readers and writers don't synchronise via locks — they coordinate through copy-on-write versioning of pages. RocksDB, SQLite in mmap mode, and Boltdb (/etcd's storage engine) all lean on the same model for some subset of their hot paths. Plan-9 took it further still — almost every system facility was a file, and mapping the right path gave you a shared view of the kernel's state.

The trade-off is that the page cache is now in the loop. A miss costs a page fault and possibly a disk read; an msync is the kernel's job and its latency is whatever the storage stack gives you. For truly cross-process state that doesn't need persistence, shm_open + mmap is cleaner; for state that naturally lives on disk anyway, mapping the file beats reading it.

Which to reach for — a comparison.

The rough decision rule. Local request/response with structured data: Unix domain socket, almost always. Bulk one-way streaming inside a process tree: pipe. Sharing a hot data structure between cooperating processes at near-thread speed: shared memory plus a process-shared mutex or lock-free ring. Asking a daemon to reload or quit: signal. Wanting to wait on "something happened" inside an existing epoll loop: eventfd. Decoupled producer/consumer that has to survive the consumer being briefly absent: an application-level queue (Redis, NATS) rather than POSIX mqueues. Reach for shm only when a profile says the kernel boundary is the bottleneck — the synchronisation tax you take on is real.

Mechanism	Latency	Throughput	Decoupling	Complexity
Pipe / FIFO	~1–5 us	~5 GB/s, 64 KB buffer	Both ends must be alive	Trivial
Unix domain socket	~3–10 us RTT	10+ GB/s, full-duplex	Listener can outlive client	Low — same as TCP API
Shared memory (mmap)	~10–100 ns (cache line)	memory bandwidth, ~30 GB/s	None — both must coordinate live	High — you write the locks
POSIX message queue	~5–20 us	~1 GB/s, bounded depth	Persists across reader absence	Medium — extra mount, ulimits
Signals	~1 us delivery	~0 — control plane only	Fire-and-forget	Low syntax, high handler care

Where you actually see this in production.

Docker's client talks to its daemon over /var/run/docker.sock — a Unix domain socket — and exchange of credentials via SO_PEERCRED is how it decides whether you're in the docker group. Kubernetes' container runtimes (containerd, CRI-O) expose their CRI APIs the same way, on /run/containerd/containerd.sock.

PostgreSQL defaults to a Unix domain socket at /var/run/postgresql/.s.PGSQL.5432 for local connections; tools like psql pick it over loopback TCP whenever the host argument is omitted. Same story for Redis (unixsocket /tmp/redis.sock in the conf) and MySQL. The latency saving over loopback is consistently ~30% for short queries.

Chrome's multi-process architecture uses Mojo, an IPC system layered on top of Unix domain sockets for the control messages and shared memory for the bulk payloads (frame buffers, mostly). Each tab is a separate process; they exchange small control messages over the socket and hand large bitmaps via shm so the GPU process can read them without a copy.

nginx's master process and its worker pool share configuration and cache state through shared-memory zones declared in the conf — limit_req_zone, proxy_cache_path, and the OCSP stapling cache all live in shm with rwlocks. systemd's socket activation hands fds to spawned services via Unix-socket ancillary data, so a service can be started lazily on the first connection and keep the listening fd across restarts. PulseAudio and JACK move audio frames through shared-memory ring buffers because anything slower drops samples.

How this scales up to microservices and containers.

The same trade-off you just walked through at the level of two processes is the trade-off the whole industry made at the level of services. A microservice is, at bottom, a process you chose to isolate so hard that it runs on a different host and talks only over the network. You took the IPC boundary and stretched it across a wire. The benefits are exactly the ones from the first section, scaled up: a crash, a bad deploy, or a memory leak in one service can't corrupt another, teams can ship on their own schedule, and you can write each service in whatever language fits. The bill is also the same one, scaled up: every call that used to be a function call is now a serialise, a copy into a kernel buffer, a trip through the socket layer and the network, and a deserialise on the far side. Network IPC is the copy model with the copies made expensive.

That is why the patterns rhyme. Where a single machine reaches for shared memory to dodge the copy, a fleet reaches for batching, streaming protocols, and binary formats like protobuf to amortise it. Where a single machine uses a signal or eventfd to say "something happened without sending the data," a fleet uses a lightweight event on a broker. And where two local processes use a message queue so the producer can run before the consumer exists, a distributed system uses a durable message queue like Kafka or SQS for the same decoupling, now with persistence and replay the kernel queues never had. The kernel mechanisms are the small, fast, in-the-box version; the distributed ones are the same ideas with durability and reach bolted on, paid for in latency.

Containers sit exactly on the seam. A container is not a virtual machine; it is ordinary processes on the host kernel, fenced off by namespaces and cgroups. Because they share the kernel, containers on one host can still use every mechanism on this page. The container runtime itself is built on them: Kubernetes talks to containerd over a Unix domain socket, containerd talks to the kernel, and a pod's containers share a network namespace so the loopback and Unix sockets between them behave as if they were on one host. When two containers in a pod need to move bulk data fast, a shared tmpfs volume backed by /dev/shm gives them real shared memory across the container boundary. The isolation is a policy the kernel enforces; the IPC underneath it is the same set of primitives, which is precisely why containers are cheap and a VM is not.

A short checklist before you choose.

When the question comes up in design review or an interview, a few quick passes settle it. Ask first whether the two parties even share a kernel. If they don't, you're doing network IPC and the local mechanisms are off the table; reach for a socket and a wire protocol. If they do share a kernel, ask how much data moves and how often. Small messages, occasionally: a Unix domain socket, because it is general, bidirectional, carries credentials, and is fast enough that the copy is free. Large payloads on a hot path: shared memory with a ring and a doorbell, and accept that you now own the locking.

Then ask about lifetime. If the producer must be able to run before the consumer exists, or survive it briefly, you want something that buffers independently of both, which rules out a plain pipe and points at a queue. Ask about direction: one-way bulk inside a process tree is the textbook pipe. Ask about the control plane separately from the data plane; "please shut down" or "reload your config" is a signal, not a message, and trying to push real data through signals is a classic mistake. Finally, ask whether you already have an epoll loop, in which case coercing every event source into a file descriptor with eventfd, signalfd, and timerfd keeps the whole thing in one dispatcher. Most real systems use three or four of these at once, each for the job it fits, rather than forcing everything through a single one.

Failure modes worth knowing.

Each mechanism has a way it bites you, and the bugs are distinctive enough to recognise on sight. Pipes give you SIGPIPE: write to a pipe whose read end has closed and the kernel kills your process by default, which is why long-running servers almost always ignore SIGPIPE and handle the EPIPE error from write instead. Pipes and sockets also surprise people with partial writes on a stream: a single write of 64 KB can return having written less, and code that assumes the whole buffer went out will silently truncate. Stream sockets and pipes carry no message boundaries at all, so two writes can arrive coalesced into one read; you must frame messages yourself with a length prefix or a delimiter.

Shared memory's failure mode is the worst to debug because it doesn't announce itself. A missing memory fence or a mishandled lock gives you a data race that corrupts state under load and disappears when you attach a debugger. Worse, a process-shared mutex held by a process that then crashes leaves the lock held forever unless you mark it with PTHREAD_MUTEX_ROBUST and handle the EOWNERDEAD recovery. Message queues fail by filling up: once a bounded queue is full, the producer blocks or gets EAGAIN, and a system that doesn't handle backpressure will either stall or drop work. Signals fail through races inside the handler and through coalescing — send two standard SIGUSR1s in quick succession and the target may only see one, which is why the realtime signal range exists for cases that must queue. Knowing the failure mode of each tool is most of knowing when not to use it.