Processes
A process is three things bundled together: an address space, at least one thread of
execution, and a set of operating-system resources the kernel holds on its behalf. It is
the unit of isolation in Unix. Each one has its own virtual memory, its own file descriptor
table, its own credentials, and its own signal mask. The kernel keeps all of this in a
struct called task_struct. Here's what's in it, how a process moves through its
states, how it starts, and how it goes away.
A program is not a process.
A program is a passive thing: a file on disk, an ELF binary with some machine code, some
constants, and a header that says where to start. It does nothing on its own. A process is
that program brought to life. The kernel reads the binary, lays its sections out in memory,
gives it a stack, points the CPU at its entry point, and starts running it. The same program
can back many processes at once. Run cat three times and you have three
processes, each with its own memory and its own place in the program's execution, all
sharing the one read-only copy of the code on disk.
So a process is the running instance, and it is made of three parts. The first is an address space: a private map of virtual memory holding the code, the global data, the heap, and one or more stacks. The second is at least one thread of execution: the CPU registers, the program counter that says which instruction is next, and the stack pointer. The third is the set of operating-system resources the kernel holds for it: open file descriptors, the current working directory, signal handlers, credentials, timers, and accounting. Take away any one of these and the abstraction stops being a process. A thread alone has no private address space. An address space alone never runs. The kernel binds them together and tracks the bundle.
Two design goals drive the whole abstraction. Isolation: one process should not be able to read or corrupt another's memory by accident or on purpose, which is what the private address space buys. Multiplexing: many processes should share one machine and take turns on the CPUs without knowing about each other, which is what the separate thread of execution and the kernel's bookkeeping buy. Everything below is the machinery that delivers those two properties.
What lives in a task_struct.
The kernel's record of a process is its process control block. On Linux that record is the
task_struct, and it is the single source of truth for everything the kernel
knows about a running task. When the scheduler decides what to run, when a signal arrives,
when a page fault happens, when the process exits — all of it reads from or writes to this
struct. Understanding what is inside it is most of understanding what a process is.
Linux doesn't have separate process and thread structures. It has one
task_struct, defined in include/linux/sched.h. A "process" is
a task whose thread group leader is itself. A "thread" is a task that shares
mm, files, fs, and signal with its
thread group leader. Same struct, different sharing rules.
The fields that matter most:
| Field | What it points to |
|---|---|
mm | The memory descriptor — page tables, VMAs, brk, stack base |
files | The file descriptor table — array of struct file * |
fs | cwd, root, umask |
signal / sighand | Signal handlers, blocked mask, pending signals |
cred | uid, gid, capabilities, security labels |
pid / tgid | Process ID and thread group ID |
parent | The parent task that will receive SIGCHLD on exit |
state | RUNNING, INTERRUPTIBLE, UNINTERRUPTIBLE, ZOMBIE, STOPPED |
The list keeps going. There is scheduling data — priority, the scheduling class, the runqueue this task sits on, and the accumulated CPU time used. There are the saved registers and kernel stack pointer that let the scheduler put a task to sleep and resume it later exactly where it stopped. There are pointers that thread the task into several lists at once: the list of its children, its siblings, the global task list, and the wait queue it is parked on if it is blocked. The struct is large because it has to answer every question the kernel might ask about a task, from any subsystem, at any time.
Two things are worth pulling out. Most of the fields are pointers to shared
sub-structures rather than the data itself, which is what makes the
process-versus-thread distinction so cheap to express: two threads in the same process point
their mm and files at the same objects, while two separate
processes each get their own. And the kernel keeps these structs allocated even after a
process has finished running, because a parent may still need to read the exit status. That
lingering, half-dead struct is the zombie, which we get to below.
Address space — the most important piece.
Every process gets its own virtual address space, an illusion of contiguous memory from 0 to 2⁴⁸ (on x86-64). The kernel tracks it as a list of virtual memory areas (VMAs), each a contiguous range with a backing (file or anonymous), permissions, and flags.
$ cat /proc/self/maps
55cd28e7c000-55cd28e84000 r--p 00000000 fd:01 1234567 /usr/bin/cat
55cd28e84000-55cd28e8e000 r-xp 00008000 fd:01 1234567 /usr/bin/cat
55cd28e8e000-55cd28e91000 r--p 00012000 fd:01 1234567 /usr/bin/cat
55cd28e91000-55cd28e93000 r--p 00014000 fd:01 1234567 /usr/bin/cat
55cd28e93000-55cd28e95000 rw-p 00016000 fd:01 1234567 /usr/bin/cat
7f8e2c4d2000-7f8e2c4f5000 r--p 00000000 fd:01 7654321 /usr/lib/libc.so.6
...
7ffd9b8a7000-7ffd9b8c8000 rw-p 00000000 00:00 0 [stack]
7ffd9bb1a000-7ffd9bb1d000 r--p 00000000 00:00 0 [vvar]
7ffd9bb1d000-7ffd9bb1f000 r-xp 00000000 00:00 0 [vdso]
ffffffffff600000-ffffffffff601000 --xp 00000000 00:00 0 [vsyscall]Every line is one VMA. The first column is the address range; the second is permissions
(rwxp — last char is p for private, s for shared);
next is the offset in the backing file; then the device + inode + path.
This is the record of where every byte of your process's memory came from.
When you mmap something, a new VMA appears here. When you fork,
the child gets a copy with the same VMAs but its own page tables, with copy-on-write
set on the writable ones.
It helps to picture the layout. At low addresses sit the read-only text (the code) and the
read-only constants, mapped straight from the binary. Above them is the initialised and
zero-filled global data. Then the heap, which grows upward as the process asks for memory
through brk or, more often now, mmap. Far up near the top of the
user range is the stack, which grows downward toward the heap. The shared libraries the
program links against — libc and friends — are mapped somewhere in the middle.
The wide gap between heap and stack is not wasted; it is virtual address space, costing
nothing until a page is actually touched and backed by physical memory.
Two properties matter for the rest of this page. The address space is private:
address 0x4000 in one process and the same address in another point at
different physical pages, or at nothing. And it is lazy: the kernel hands out
virtual ranges freely and only allocates a real page when the process faults on it. Both
properties are what make fork affordable, and both rest on the page tables and
TLB underneath, which the
virtual memory page
covers from the hardware side.
fork() — the cleverest syscall.
fork() creates a new process that is an almost-exact copy of the caller. Same
address space contents, same open file descriptors, same cwd, same signal mask, same
everything. What differs: a new PID, a new parent, an empty
tms_* accounting, and the return value (0 in the child, child's PID in the
parent).
Done naively this would be unworkable, copying gigabytes of address space. The trick is copy-on-write: the kernel duplicates the page tables but marks every writable page read-only in both processes. The first write from either side traps to the kernel, which allocates a fresh page, copies the original, marks it writable in the writer's page table, and returns. Pages neither side ever writes stay shared forever.
pid_t pid = fork();
if (pid < 0) {
// fork failed
} else if (pid == 0) {
// we are the child
execve("/usr/bin/ls", argv, envp);
} else {
// we are the parent; pid is the child's PID
int status;
waitpid(pid, &status, 0);
}The single line fork() returns twice — once in each process — and the return
value is the only way each side knows who it is. Both processes then continue from the same
instruction, with the same open files, the same variables, the same everything, diverging
only as they each follow different branches of the if. This is the part that
trips people up the first time: there is no "main" copy that keeps going while a "child" copy
starts fresh. There are two equal processes, distinguished by a number.
fork() calls are followed
right away by exec(), which throws away the entire address space. So why
fork at all? Because between the two, the child can do anything the parent can do.
It can
chdir, setuid, redirect file descriptors, or install signal
handlers — without touching the parent's state. It's the cleanest way Unix has to set up a new
environment for a child program.Copy-on-write is not free, even when nothing is copied. The kernel still has to duplicate
the page tables and walk every writable mapping to flip it read-only, and the first write to
each shared page takes a minor fault. For a process with a small address space this is
cheap. For a multi-gigabyte process — a JVM heap, a big in-memory cache — the page-table
work alone can be slow, and a child that then writes widely will trigger a storm of
copy faults. This is why a large server process that needs to spawn helpers often uses
posix_spawn or vfork+exec, or keeps a small
pre-forked helper around, rather than forking the whole heap. It is also why thread-based
concurrency, covered on the threads
page, sidesteps the address-space copy entirely.
exec() — replacing the whole program.
execve(path, argv, envp) doesn't create a new process. It replaces the
current one's address space with the contents of path, sets up a new stack
with argv and envp, and jumps to the new program's entry point.
The PID stays the same. Open file descriptors stay open, unless they have
FD_CLOEXEC set. Signal handlers reset to default; the signal mask is kept.
Modern Linux supports two formats for the executable: ELF (the default) and
#! interpreters, which the kernel rewrites internally to run the
interpreter on the script. The binfmt_misc mechanism lets userspace
register handlers for other file formats, which is how java -jar works
as a one-liner on most distros.
What survives an exec is the small set of things that belong to the process
rather than the program: the PID, the parent relationship, open file descriptors without
FD_CLOEXEC, the working directory, and the credentials. What does not survive is
everything that belonged to the old program's running state: the heap, the stack, the loaded
code, any memory mappings, and the registered signal handlers, which reset to default because
the new program never installed them. The continuity of file descriptors across exec is what
makes shell redirection work. The shell forks, the child opens or dups the file descriptors
it wants, and then execs the command, which inherits them.
wait(), zombies, and orphans.
When a process exits, its address space is torn down, its file descriptors are closed,
and its task_struct moves to the ZOMBIE state. That's a stub
holding only the exit status, the killing signal, and timing accounting. The parent has
to call wait() or waitpid() to read those out and reap the
child. Until it does, the zombie sits there holding a PID slot.
Reaping matters because PID slots are a finite resource. The zombie holds nothing but its
exit status, yet it keeps its entry in the process table until reaped. A parent that forks
children in a loop and never calls wait() will slowly fill the table, and
eventually fork() itself starts failing with EAGAIN. The fix is
either to call wait()/waitpid() for every child, or to handle the
SIGCHLD signal the kernel sends on each child's exit and reap from there. A
third option is to tell the kernel you do not care about the exit status by setting
SIGCHLD to SIG_IGN, after which children are reaped automatically
and never become zombies.
Processes form a tree. Every process has exactly one parent, recorded in its
task_struct, and the tree is rooted at PID 1. When you run a pipeline in a
shell, the shell is the parent of each command; when a server accepts a connection and forks
a worker, the server is the worker's parent. The pstree command draws this
tree, and the relationships in it decide who is responsible for reaping whom.
If the parent dies first, the child is reparented to the nearest ancestor
that called prctl(PR_SET_CHILD_SUBREAPER, 1), falling back to PID 1
(init). Init's only job is to wait() for orphans in a loop. This is why
containers without a real init (the most common production failure mode for
docker run my-binary) pile up zombies forever once their child
processes spawn grandchildren.
It is worth being precise about the two failure shapes, because they look similar and have opposite causes. An orphan is a live process whose parent has died; it is fine, because it gets reparented to a subreaper or to PID 1, which will reap it when it eventually exits. A zombie is a dead process whose parent is still alive but has not reaped it; it is the leak. A pile of orphans is usually harmless. A pile of zombies means some living parent is not doing its job, and the parent is the bug to fix, not the zombie.
docker run --init, or write your application to handle SIGCHLD
itself.Process states.
A process spends its life moving between a handful of states, and almost every interesting
thing the kernel does is a transition between them. The textbook model has five: a process is
new while it is being set up, ready when it could run but
is waiting for a CPU, running when it is actually on a CPU,
blocked when it is waiting for something like disk or a network reply, and
terminated when it has finished. Linux splits and renames these — its ready
and running states are both TASK_RUNNING, and blocked is split into
interruptible and uninterruptible sleep — but the shape is the same.
Notice who owns each edge. The two edges between ready and running belong to the
scheduler: dispatching a ready
task onto a CPU, and preempting a running task back to the ready queue when its time slice
ends or a higher-priority task wakes. The edges into and out of blocked belong to events: a
task calls read() on a socket with no data and blocks, then an interrupt
signals the data arrived and the task is moved back to ready. A process can only be running
on as many CPUs as it has runnable threads, and the number of ready-but-not-running tasks is
exactly the run-queue depth that load average measures.
| State | What it means | How to wake |
|---|---|---|
| R (running) | On a CPU or runqueue | — |
| S (interruptible) | Waiting on something, will wake on signal | Event or signal |
| D (uninterruptible) | Waiting on disk I/O, will not wake on signal | I/O completes |
| T (stopped) | Sent SIGSTOP/SIGTSTP | SIGCONT |
| Z (zombie) | Exited; awaiting parent's wait() | Reaped |
| X (dead) | Reaped; about to be freed | — |
A process stuck in D state for any real length of time is worth
investigating. Usually it's a stuck NFS mount, a hung block device, or a kernel bug. ps
auxf + look for state D.
The file descriptor table.
Every open file, socket, pipe, and device a process holds is reached through a small integer:
a file descriptor. The descriptor is an index into the process's file descriptor table, the
files field on the task_struct. Slot 0 is standard input, 1 is
standard output, 2 is standard error by convention, and everything a process opens after
that gets the lowest free slot. The integer means nothing on its own; it is a per-process
handle that the kernel resolves to a real kernel object.
There are actually three layers here, and the distinction explains a lot of Unix behaviour.
The descriptor is per-process. It points into a system-wide open file table, where
each entry holds the current file offset and the access mode. That entry in turn points at
the inode, the kernel's record of the actual file. When you dup a
descriptor, or when a child inherits descriptors across fork, the two
descriptors point at the same open file table entry, so they share the offset:
a write through one advances the position for the other. When two processes
open the same file independently, they get separate open file entries and
separate offsets, even though both reach the same inode.
The table is what makes shell redirection and pipes work. To run cmd > out.txt
the shell forks, opens out.txt, uses dup2 to make that descriptor
become slot 1, then execs cmd — which writes to standard output as usual,
unaware it is now going to a file. The size of the table is capped by
RLIMIT_NOFILE, the per-process open-file limit, which is the thing you raise
when a busy server logs "too many open files." Each descriptor is small, but they are not
free, and a process that opens connections without closing them leaks slots until it hits
the cap.
Processes versus threads.
The line between a process and a thread comes down to what gets shared. A process has its own
address space; the threads inside it share that one address space. Concretely, when Linux
creates a thread it makes a new task_struct that points its mm,
files, fs, and signal at the same sub-structures as the
leader, instead of getting fresh copies. The threads share memory, open files, and the
working directory; each keeps only its own registers, its own stack, and its own scheduling
state. That is why one thread can corrupt another's data with a stray pointer, while one
process cannot reach into another's memory at all.
The trade-off follows directly. Threads are cheap to create and cheap to switch between, because there is no address space to copy and no page tables to swap, and they share data for free — which is exactly why they need locks, because that shared data races. Processes are more expensive and communicate only through explicit channels like pipes, sockets, or shared memory, but the isolation they buy means a crash in one does not take down the others. Most servers pick a blend: a small number of processes for fault isolation, many threads inside each for throughput. The threads page covers the kernel thread model in full, and the thread pools page covers the standard pattern for reusing threads instead of paying creation cost per task.
What a process costs.
A process is not a free abstraction, and it pays at three different moments. Creation costs
the page-table duplication and the copy-on-write setup we saw with fork, plus
the work of building the new task_struct and threading it into the kernel's
lists. For a small program this is microseconds; for a huge address space it is meaningfully
more. Existence costs memory: the kernel keeps the task_struct, the page tables,
and the kernel stack for the lifetime of the process, regardless of whether it is running.
Thousands of mostly-idle processes still consume kernel memory.
The third cost is the context switch, paid every time the CPU moves from one process to another. The kernel saves the outgoing process's registers, swaps the page-table root, and restores the incoming process's registers. Swapping the address space flushes much of the TLB, so the new process takes a wave of TLB misses as it warms back up, and its working set may have been evicted from cache while it was away. These indirect costs — cold cache, cold TLB — usually dwarf the direct register-save work. Switching between two threads of the same process is cheaper precisely because the address space does not change, so the TLB and much of the cache survive. How the kernel decides who runs next, and how it tries to keep switches productive, is the scheduling page.
PID 1 — the init process.
PID 1 is special. It can't be killed by signals it doesn't have handlers for (so
kill -9 1 from another process is a no-op against the kernel default). It
adopts orphans. If it dies, the kernel panics, by design. On modern Linux distros PID 1
is systemd; on container hosts it's the kernel's init; inside
containers it's whatever your ENTRYPOINT resolves to, which is the source
of the zombie problem above.
Two practical points. First, inside a container your application is PID 1 and inherits its duties, including signal forwarding to children, which Bash and Python don't do by default. Second, outside a container, talking to systemd is the supported way to start, restart, and watch long-running processes.
Further reading.
- fork(2), execve(2), wait(2) — the man pages, and the most authoritative explanation of corner cases.
- linux/sched.h
— the
task_structdefinition. - OSTEP — Process API — the cleanest textbook treatment of fork/exec/wait.
- Kerrisk — TLPI ch. 24–26 — the canonical Linux programming reference on processes.
- Docker and the PID 1 zombie problem — the Phusion post that made everyone start shipping tini.
- Semicolony — Virtual memory and the TLB — the hardware below the OS process abstraction. Page tables, the four-level walk, huge pages, TLB shootdowns, and Meltdown.
- Semicolony — Power-on, firmware, boot — how the first process (PID 1) ever gets started. Reset vector, UEFI, kernel handoff.