Inter-process communication
Once you split work across processes — for isolation, for fault tolerance, for a language
boundary — they have to talk. The kernel offers half a dozen mechanisms ranging from a
one-line pipe() call to shared pages and lock-free ring buffers. Picking the
right one is mostly a question of how much you care about latency, decoupling, and the cost
of writing the synchronisation yourself.
Why IPC exists in the first place.
A single process with a few threads is the cheapest way to share data — every thread sees the same address space, and a pointer is a pointer. The moment you split that work across separate processes, those pointers stop meaning anything to anyone else. The OS gives each process its own page tables, its own heap, its own file-descriptor table. Sharing now requires the kernel's help.
People split work across processes for good reasons. Isolation: a crash in one doesn't take the other down (this is most of why Chrome runs a process per tab). Privilege separation: one process holds the credentials, another handles untrusted input. Language boundaries: Python here, Rust there. Independent deployment: restart the request handler without restarting the database. Once you've made that choice, IPC is the bill you pay.
The kernel offers a small menu. Shared memory is fastest — once mapped, there are no syscalls, just memory accesses. Pipes are simplest — a single call sets one up. Unix domain sockets are the most general — full bidirectional byte streams with credentials passing. Message queues are the most decoupled — sender and receiver don't have to be alive at the same instant. Signals are the cheapest control plane. Everything below is which to reach for and when.
It helps to hold one picture in your head before the details. Each process is a private
address space the kernel set up, and the page tables that translate its virtual addresses
to physical frames belong to that process alone. A pointer like 0x7ffd4c00 in
process A maps to one frame; the same number in process B maps to a different frame, or to
nothing. So "sending data" between two processes is really one of two physical acts. Either
the kernel copies bytes out of A's frames into its own buffers and then out again into B's
frames, which is what pipes, sockets, and queues do, or the kernel arranges for one
physical frame to appear in both page tables at once, which is what shared memory and
memory-mapped files do. The first family is easy and safe and costs you copies. The second
family is fast and shares the frame and costs you the synchronisation you now have to write
yourself. Almost every trade-off on this page is a version of that one choice.
Pipes and FIFOs — a one-way byte hose.
pipe(int fds[2]) returns a pair of descriptors: fds[0] for
reading, fds[1] for writing. Whatever bytes you write into the write end come
out in order at the read end. The kernel sits in the middle with a circular buffer —
64 KB by default on Linux (fcntl(F_SETPIPE_SZ) can grow it
up to /proc/sys/fs/pipe-max-size). The writer blocks when the buffer is full,
the reader blocks when it's empty. Close the write end and the reader sees EOF.
Anonymous pipes only work between processes that share the descriptor.
In practice that means parent and child: you call pipe(), then
fork(), and both processes inherit both ends. Each closes the end it doesn't
need, and you've got one-way communication. The shell does exactly this for
ls | grep foo — ls's stdout is dup2'd onto the pipe's write
end, grep's stdin onto the read end.
Named pipes, also called FIFOs
(mkfifo(path, mode)), give the pipe a path in the filesystem. Any process
with the right permissions can open it; the kernel buffers between them. Useful when the
two processes have no parent-child relationship — logging daemons, ad-hoc shell glue,
plugin sockets. The byte stream has no message boundaries, so writers and readers have to
agree on framing themselves.
Unix domain sockets — TCP without the network.
A Unix domain socket (AF_UNIX) speaks the standard sockets API —
socket(), bind(), listen(), accept(),
connect(), send(), recv() — but the endpoints are
filesystem paths instead of IP addresses, and the traffic never leaves the kernel. No
checksums, no fragmentation, no routing, no congestion control. Round-trip latency is
single-digit microseconds; throughput on a single connection regularly clears
10 GB/s on modern hardware.
Two flavours mirror TCP and UDP. SOCK_STREAM gives a reliable, ordered byte
stream with backpressure — the workhorse. SOCK_DGRAM preserves message
boundaries and is reliable too (the kernel just doesn't drop frames it accepted), useful
when you want one syscall to send and one to receive an entire message.
SOCK_SEQPACKET combines both: stream-like reliability with message
boundaries.
The killer feature is ancillary data via sendmsg/
recvmsg. You can pass file descriptors between processes
(SCM_RIGHTS) — the kernel sets up a fresh fd in the receiver pointing at the
same open file. You can pass peer credentials (SCM_CREDENTIALS on Linux,
getpeereid() on BSD), so a server can authenticate the caller without a
handshake. This is what makes Unix sockets the default local IPC: Docker's
daemon listens on /var/run/docker.sock, systemd's socket
activation passes fds through them, X11 can use one, and
PostgreSQL / Redis / MySQL all prefer
the Unix-socket path for local clients over loopback TCP — roughly 30% lower latency, and
it avoids the loopback's per-packet overhead.
Message queues — prioritised, persistent, mostly historical.
Two flavours ship with Linux. System V message queues
(msgget, msgsnd, msgrcv) date to early Unix and
identify queues by an integer key. Messages have a type field; readers can ask for "any
message" or "type ≤ N", which gives a crude form of prioritisation. POSIX message
queues (mq_open, mq_send, mq_receive) are
the modern incarnation — names look like /myqueue, queues live under
/dev/mqueue, and each message carries an explicit 0–32767 priority that the
kernel uses to order delivery.
Both are bounded — /proc/sys/fs/mqueue/msg_max caps the depth, default
10 messages per queue on stock Linux, raisable to 65536. Once full,
mq_send either blocks or fails with EAGAIN. Messages persist
until read or until the queue is unlinked, so a producer can write before the consumer
exists.
In practice almost nobody reaches for these in new code. Application-level queues — Redis, RabbitMQ, Kafka, NATS, SQS — give you network reach, real durability, replay, and tooling that POSIX queues never grew. Kernel message queues survive in tightly embedded systems and a handful of real-time codebases that need the priority-based delivery guarantee without a broker in the loop.
Shared memory — the same physical page in two processes.
The fastest IPC mechanism, by a wide margin, because once it's set up there is no kernel
involvement at all. Two processes call shm_open("/name", ...) to get an fd
backed by a tmpfs file under /dev/shm, size it with ftruncate,
and then mmap(... MAP_SHARED, fd, 0). Both mappings point at the same
physical frames; a store in one process is visible to the other as soon as the cache line
propagates. Latency is nanoseconds, not microseconds.
The older System V interface (shmget, shmat, shmdt)
still works and a lot of legacy code uses it; the POSIX/shm_open path is the
modern recommendation because it composes with the rest of the fd-based API.
memfd_create on Linux gives you an anonymous shared-memory fd you can pass
over a Unix socket without ever touching the filesystem.
read() traps back in and copies it out again.
That's two copies and at least one round-trip across the kernel boundary, with a
~100 ns syscall floor on modern Linux and the cache-line ping-pong on
top. Shared memory replaces all of that with a single store to a cache line both cores
already have access to. The catch is that you now own the synchronisation — no kernel
means no automatic ordering. Place a pthread_mutex_t with
PTHREAD_PROCESS_SHARED inside the mapping, or a futex, or a lock-free ring
with atomic load/store and explicit memory fences. The fast path is fast precisely because
the kernel isn't watching./* Linux — two processes share a 4 KiB page via shm_open + mmap */
#include <fcntl.h>
#include <sys/mman.h>
#include <unistd.h>
int fd = shm_open("/demo", O_CREAT | O_RDWR, 0600);
ftruncate(fd, 4096);
void *p = mmap(NULL, 4096, PROT_READ | PROT_WRITE,
MAP_SHARED, fd, 0);
/* fork(); both halves now see the same physical page at *p.
No syscall on the steady-state read/write path — just
load and store. Lock with a PTHREAD_PROCESS_SHARED mutex
or std::atomic placed inside the mapping. */
__atomic_store_n((int *)p, 42, __ATOMIC_RELEASE);Copy cost versus zero-copy.
It is worth slowing down on the number that decides most of these choices: the cost of a
copy. When you write a megabyte into a pipe, the kernel reads it out of your buffer and
writes it into its own. When the reader calls read, the kernel reads it out
of that buffer and writes it into the reader's. The data crossed the user/kernel boundary
twice and was physically copied twice. At a few gigabytes per second of memory bandwidth
per core, a megabyte is hundreds of microseconds of pure copying, plus the two syscall
traps at roughly a hundred nanoseconds each, plus whatever cache pollution the copies cause.
For small control messages none of this matters. For a video frame, a model weight tensor,
or a database page moving sixty times a second, it is the whole budget.
Shared memory makes that cost vanish because there is nothing to copy. The producer writes the frame once, into a page the consumer already maps, and the consumer reads it in place. This is what people mean by zero-copy: the bytes are produced and consumed in the same physical frame, and the only thing that crosses between the processes is a small notification saying "frame ready." That notification can itself be an eventfd write or a byte on a pipe; the heavy payload never moves. The cost you pay instead is that two processes are now reading and writing the same memory with no kernel arbitrating, so you have to build the ordering yourself, which is the subject of the synchronisation page.
This split — bulk over shared memory, notification over something cheap — is the shape of
nearly every high-performance local pipeline. The kernel's own splice and
vmsplice calls chase the same goal from the other direction, moving pages
between a pipe and a file without copying them through user space. io_uring
pushes it further, letting a process submit reads and writes through a ring it shares with
the kernel, so even the syscall trap mostly disappears. The throughline is that fast IPC is
the steady removal of copies and traps until the only thing left moving is the data itself.
Signals — a single bit, asynchronously.
A signal is the lowest-bandwidth IPC the kernel offers: one integer, delivered
asynchronously to a target process by kill(pid, signum). It interrupts
whatever the target was doing and runs a handler, or kills it, or stops it, depending on
disposition. The vocabulary is tiny — about 30 numbered signals, plus the
SIGRTMIN..SIGRTMAX realtime range that queues rather than coalescing.
Signals are the right tool for control-plane messages: SIGTERM to ask a
daemon to shut down cleanly, SIGHUP to reload config (the convention
predates restarts being cheap), SIGUSR1/SIGUSR2 for
whatever the application wants — nginx uses SIGUSR1 to reopen log files and SIGUSR2 to
start a new binary for zero-downtime upgrade. The payload is the signal number plus, if
the sender uses sigqueue, a single siginfo_t word.
Signal handlers are heavily restricted. The only functions you may safely call inside one
are async-signal-safe — a short list pinned by POSIX that excludes most
of the C library, including malloc, printf, and anything
locking. The idiomatic modern pattern is to do nothing in the handler except set a
volatile sig_atomic_t flag (or write a byte to a self-pipe / eventfd) and
let the main loop notice. Anything else risks deadlock or memory corruption.
eventfd, signalfd, timerfd — everything is a file descriptor.
Linux's epoll loop only knows how to wait on file descriptors, so the kernel grew a
family of fd-backed primitives to fit. They turn out-of-band events into something you can
drop into the same epoll_wait as your sockets.
eventfd (Linux 2.6.22) wraps a 64-bit counter behind a single fd. Any
thread or process holding the fd can write(fd, &1, 8) to bump the
counter and wake the waiter; read drains it. It replaces the older
self-pipe trick — the standard hack of pipe()-ing a byte to
yourself from a signal handler to wake an event loop — with a primitive that costs one fd
instead of two and doesn't have to actually move bytes through a buffer.
signalfd turns a signal mask into a readable fd: instead of installing a
handler you read signalfd_siginfo structs from the fd, safely, in your main
loop. timerfd does the same for setitimer/POSIX timers — a
fd that becomes readable when the timer fires, with the count of expirations available
via read.
This "everything is an fd" pattern is what lets a modern Linux server multiplex sockets,
IPC, signals, and timers through a single epoll_wait call. It is also why
libraries like libuv and tokio's mio look so symmetric on Linux: their dispatcher is just
one epoll loop, every event source has been coerced into an fd.
Memory-mapped files — shared state via the filesystem.
mmap(... MAP_SHARED, fd, 0) on a regular file (not shm_open)
gives you the file's contents addressable as memory, with stores eventually flushed back
to disk by the page cache. Two processes that map the same file get the same physical
pages — instant shared memory, persistent across reboots, browsable from the shell.
This is how LMDB works end-to-end: the database is a single mmap'd file,
reads are pointer chases through B-tree pages, writes go through a single writer process
that ends with an msync. Readers and writers don't synchronise via locks —
they coordinate through copy-on-write versioning of pages. RocksDB,
SQLite in mmap mode, and Boltdb
(/etcd's storage engine) all lean on the same model for some subset of
their hot paths. Plan-9 took it further still — almost every system facility was a file,
and mapping the right path gave you a shared view of the kernel's state.
The trade-off is that the page cache is now in the loop. A miss costs a page fault and
possibly a disk read; an msync is the kernel's job and its latency is
whatever the storage stack gives you. For truly cross-process state that doesn't need
persistence, shm_open + mmap is cleaner; for state that
naturally lives on disk anyway, mapping the file beats reading it.
Which to reach for — a comparison.
| Mechanism | Latency | Throughput | Decoupling | Complexity |
|---|---|---|---|---|
| Pipe / FIFO | ~1–5 us | ~5 GB/s, 64 KB buffer | Both ends must be alive | Trivial |
| Unix domain socket | ~3–10 us RTT | 10+ GB/s, full-duplex | Listener can outlive client | Low — same as TCP API |
| Shared memory (mmap) | ~10–100 ns (cache line) | memory bandwidth, ~30 GB/s | None — both must coordinate live | High — you write the locks |
| POSIX message queue | ~5–20 us | ~1 GB/s, bounded depth | Persists across reader absence | Medium — extra mount, ulimits |
| Signals | ~1 us delivery | ~0 — control plane only | Fire-and-forget | Low syntax, high handler care |
Where you actually see this in production.
Docker's client talks to its daemon over
/var/run/docker.sock — a Unix domain socket — and exchange of credentials
via SO_PEERCRED is how it decides whether you're in the
docker group. Kubernetes' container runtimes (containerd, CRI-O) expose
their CRI APIs the same way, on /run/containerd/containerd.sock.
PostgreSQL defaults to a Unix domain socket at
/var/run/postgresql/.s.PGSQL.5432 for local connections; tools like
psql pick it over loopback TCP whenever the host argument is omitted. Same
story for Redis (unixsocket /tmp/redis.sock in the conf)
and MySQL. The latency saving over loopback is consistently ~30% for
short queries.
Chrome's multi-process architecture uses Mojo, an IPC system layered on top of Unix domain sockets for the control messages and shared memory for the bulk payloads (frame buffers, mostly). Each tab is a separate process; they exchange small control messages over the socket and hand large bitmaps via shm so the GPU process can read them without a copy.
nginx's master process and its worker pool share configuration and
cache state through shared-memory zones declared in the conf — limit_req_zone,
proxy_cache_path, and the OCSP stapling cache all live in shm with
rwlocks. systemd's socket activation hands fds to spawned services via
Unix-socket ancillary data, so a service can be started lazily on the first connection
and keep the listening fd across restarts. PulseAudio and
JACK move audio frames through shared-memory ring buffers because
anything slower drops samples.
How this scales up to microservices and containers.
The same trade-off you just walked through at the level of two processes is the trade-off the whole industry made at the level of services. A microservice is, at bottom, a process you chose to isolate so hard that it runs on a different host and talks only over the network. You took the IPC boundary and stretched it across a wire. The benefits are exactly the ones from the first section, scaled up: a crash, a bad deploy, or a memory leak in one service can't corrupt another, teams can ship on their own schedule, and you can write each service in whatever language fits. The bill is also the same one, scaled up: every call that used to be a function call is now a serialise, a copy into a kernel buffer, a trip through the socket layer and the network, and a deserialise on the far side. Network IPC is the copy model with the copies made expensive.
That is why the patterns rhyme. Where a single machine reaches for shared memory to dodge the copy, a fleet reaches for batching, streaming protocols, and binary formats like protobuf to amortise it. Where a single machine uses a signal or eventfd to say "something happened without sending the data," a fleet uses a lightweight event on a broker. And where two local processes use a message queue so the producer can run before the consumer exists, a distributed system uses a durable message queue like Kafka or SQS for the same decoupling, now with persistence and replay the kernel queues never had. The kernel mechanisms are the small, fast, in-the-box version; the distributed ones are the same ideas with durability and reach bolted on, paid for in latency.
Containers sit exactly on the seam. A container is not a virtual machine; it is ordinary
processes on the host kernel, fenced off by namespaces and cgroups. Because they share the
kernel, containers on one host can still use every mechanism on this page. The container
runtime itself is built on them: Kubernetes talks
to containerd over a Unix domain socket, containerd talks to the kernel, and a pod's
containers share a network namespace so the loopback and Unix sockets between them behave as
if they were on one host. When two containers in a pod need to move bulk data fast, a shared
tmpfs volume backed by /dev/shm gives them real shared memory across
the container boundary. The isolation is a policy the kernel enforces; the IPC underneath it
is the same set of primitives, which is precisely why containers are cheap and a VM is not.
A short checklist before you choose.
When the question comes up in design review or an interview, a few quick passes settle it. Ask first whether the two parties even share a kernel. If they don't, you're doing network IPC and the local mechanisms are off the table; reach for a socket and a wire protocol. If they do share a kernel, ask how much data moves and how often. Small messages, occasionally: a Unix domain socket, because it is general, bidirectional, carries credentials, and is fast enough that the copy is free. Large payloads on a hot path: shared memory with a ring and a doorbell, and accept that you now own the locking.
Then ask about lifetime. If the producer must be able to run before the consumer exists, or survive it briefly, you want something that buffers independently of both, which rules out a plain pipe and points at a queue. Ask about direction: one-way bulk inside a process tree is the textbook pipe. Ask about the control plane separately from the data plane; "please shut down" or "reload your config" is a signal, not a message, and trying to push real data through signals is a classic mistake. Finally, ask whether you already have an epoll loop, in which case coercing every event source into a file descriptor with eventfd, signalfd, and timerfd keeps the whole thing in one dispatcher. Most real systems use three or four of these at once, each for the job it fits, rather than forcing everything through a single one.
Failure modes worth knowing.
Each mechanism has a way it bites you, and the bugs are distinctive enough to recognise on
sight. Pipes give you SIGPIPE: write to a pipe whose read end has closed and
the kernel kills your process by default, which is why long-running servers almost always
ignore SIGPIPE and handle the EPIPE error from write instead. Pipes
and sockets also surprise people with partial writes on a stream: a single
write of 64 KB can return having written less, and code that assumes the whole
buffer went out will silently truncate. Stream sockets and pipes carry no message
boundaries at all, so two writes can arrive coalesced into one read; you must frame
messages yourself with a length prefix or a delimiter.
Shared memory's failure mode is the worst to debug because it doesn't announce itself. A
missing memory fence or a mishandled lock gives you a data race that corrupts state under load
and disappears when you attach a debugger. Worse, a process-shared mutex held by a process
that then crashes leaves the lock held forever unless you mark it with
PTHREAD_MUTEX_ROBUST and handle the EOWNERDEAD recovery. Message
queues fail by filling up: once a bounded queue is full, the producer blocks or gets
EAGAIN, and a system that doesn't handle backpressure will either stall or drop
work. Signals fail through races inside the handler and through coalescing — send two
standard SIGUSR1s in quick succession and the target may only see one, which is why the
realtime signal range exists for cases that must queue. Knowing the failure mode of each tool
is most of knowing when not to use it.
Further reading.
- W. Richard Stevens & Stephen Rago — Advanced Programming in the UNIX Environment (3rd ed.) — APUE. Chapters 15 (pipes/FIFOs), 17 (Unix domain sockets, fd passing), and 14 (signals) are the canonical reference for everything in this page.
- W. Richard Stevens — UNIX Network Programming, Vol. 2: IPC — the dedicated IPC volume. Covers System V and POSIX message queues, semaphores, and shared memory with worked examples.
- Andrew S. Tanenbaum — Modern Operating Systems — Chapter 2 on processes and IPC for the conceptual model; the MINIX message-passing kernel is a clean illustration of microkernel-style IPC.
- Linux man-pages — pipe(7), unix(7), shm_overview(7), mq_overview(7), signal(7) — Michael Kerrisk's man pages are exhaustive and accurate; the section-7 overview pages are the right place to start for each mechanism.
- Michael Kerrisk — The Linux Programming Interface (TLPI) — the modern successor to Stevens for Linux specifically. Chapters 43–58 cover IPC in depth; the chapter on eventfd/signalfd/timerfd is the clearest writeup of the fd-based primitives.
- LWN — Passing file descriptors with SCM_RIGHTS — a focused tour of fd passing over Unix sockets, including the subtle ownership and close-on-exec semantics.
- Chromium — Mojo IPC overview — how a large multi-process application layers a real IPC system on top of Unix sockets and shared memory.
- Semicolony — I/O models — how epoll and io_uring tie the fd-based IPC primitives back into a single event loop.
- Semicolony — Synchronisation — up next: the locks, futexes, and lock-free patterns shared-memory IPC actually rests on.