01 / 10

Internals / 01

Processes

A process is three things bundled together: an address space, at least one thread of execution, and a set of operating-system resources the kernel holds on its behalf. It is the unit of isolation in Unix. Each one has its own virtual memory, its own file descriptor table, its own credentials, and its own signal mask. The kernel keeps all of this in a struct called task_struct. Here's what's in it, how a process moves through its states, how it starts, and how it goes away.

A program is not a process.

A program is a passive thing: a file on disk, an ELF binary with some machine code, some constants, and a header that says where to start. It does nothing on its own. A process is that program brought to life. The kernel reads the binary, lays its sections out in memory, gives it a stack, points the CPU at its entry point, and starts running it. The same program can back many processes at once. Run cat three times and you have three processes, each with its own memory and its own place in the program's execution, all sharing the one read-only copy of the code on disk.

So a process is the running instance, and it is made of three parts. The first is an address space: a private map of virtual memory holding the code, the global data, the heap, and one or more stacks. The second is at least one thread of execution: the CPU registers, the program counter that says which instruction is next, and the stack pointer. The third is the set of operating-system resources the kernel holds for it: open file descriptors, the current working directory, signal handlers, credentials, timers, and accounting. Take away any one of these and the abstraction stops being a process. A thread alone has no private address space. An address space alone never runs. The kernel binds them together and tracks the bundle.

Two design goals drive the whole abstraction. Isolation: one process should not be able to read or corrupt another's memory by accident or on purpose, which is what the private address space buys. Multiplexing: many processes should share one machine and take turns on the CPUs without knowing about each other, which is what the separate thread of execution and the kernel's bookkeeping buy. Everything below is the machinery that delivers those two properties.

What lives in a task_struct.

The kernel's record of a process is its process control block. On Linux that record is the task_struct, and it is the single source of truth for everything the kernel knows about a running task. When the scheduler decides what to run, when a signal arrives, when a page fault happens, when the process exits — all of it reads from or writes to this struct. Understanding what is inside it is most of understanding what a process is.

Linux doesn't have separate process and thread structures. It has one task_struct, defined in include/linux/sched.h. A "process" is a task whose thread group leader is itself. A "thread" is a task that shares mm, files, fs, and signal with its thread group leader. Same struct, different sharing rules.

The fields that matter most:

Field	What it points to
`mm`	The memory descriptor — page tables, VMAs, brk, stack base
`files`	The file descriptor table — array of `struct file *`
`fs`	cwd, root, umask
`signal` / `sighand`	Signal handlers, blocked mask, pending signals
`cred`	uid, gid, capabilities, security labels
`pid` / `tgid`	Process ID and thread group ID
`parent`	The parent task that will receive SIGCHLD on exit
`state`	RUNNING, INTERRUPTIBLE, UNINTERRUPTIBLE, ZOMBIE, STOPPED

The list keeps going. There is scheduling data — priority, the scheduling class, the runqueue this task sits on, and the accumulated CPU time used. There are the saved registers and kernel stack pointer that let the scheduler put a task to sleep and resume it later exactly where it stopped. There are pointers that thread the task into several lists at once: the list of its children, its siblings, the global task list, and the wait queue it is parked on if it is blocked. The struct is large because it has to answer every question the kernel might ask about a task, from any subsystem, at any time.

Two things are worth pulling out. Most of the fields are pointers to shared sub-structures rather than the data itself, which is what makes the process-versus-thread distinction so cheap to express: two threads in the same process point their mm and files at the same objects, while two separate processes each get their own. And the kernel keeps these structs allocated even after a process has finished running, because a parent may still need to read the exit status. That lingering, half-dead struct is the zombie, which we get to below.

The task_struct mostly holds pointers to sub-structures. Sharing or copying those pointers is the difference between a thread and a process.

Address space — the most important piece.

Every process gets its own virtual address space, an illusion of contiguous memory from 0 to 2⁴⁸ (on x86-64). The kernel tracks it as a list of virtual memory areas (VMAs), each a contiguous range with a backing (file or anonymous), permissions, and flags.

$ cat /proc/self/maps
55cd28e7c000-55cd28e84000 r--p 00000000 fd:01 1234567 /usr/bin/cat
55cd28e84000-55cd28e8e000 r-xp 00008000 fd:01 1234567 /usr/bin/cat
55cd28e8e000-55cd28e91000 r--p 00012000 fd:01 1234567 /usr/bin/cat
55cd28e91000-55cd28e93000 r--p 00014000 fd:01 1234567 /usr/bin/cat
55cd28e93000-55cd28e95000 rw-p 00016000 fd:01 1234567 /usr/bin/cat
7f8e2c4d2000-7f8e2c4f5000 r--p 00000000 fd:01 7654321 /usr/lib/libc.so.6
...
7ffd9b8a7000-7ffd9b8c8000 rw-p 00000000 00:00 0       [stack]
7ffd9bb1a000-7ffd9bb1d000 r--p 00000000 00:00 0       [vvar]
7ffd9bb1d000-7ffd9bb1f000 r-xp 00000000 00:00 0       [vdso]
ffffffffff600000-ffffffffff601000 --xp 00000000 00:00 0 [vsyscall]

Every line is one VMA. The first column is the address range; the second is permissions (rwxp — last char is p for private, s for shared); next is the offset in the backing file; then the device + inode + path.

This is the record of where every byte of your process's memory came from. When you mmap something, a new VMA appears here. When you fork, the child gets a copy with the same VMAs but its own page tables, with copy-on-write set on the writable ones.

It helps to picture the layout. At low addresses sit the read-only text (the code) and the read-only constants, mapped straight from the binary. Above them is the initialised and zero-filled global data. Then the heap, which grows upward as the process asks for memory through brk or, more often now, mmap. Far up near the top of the user range is the stack, which grows downward toward the heap. The shared libraries the program links against — libc and friends — are mapped somewhere in the middle. The wide gap between heap and stack is not wasted; it is virtual address space, costing nothing until a page is actually touched and backed by physical memory.

Two properties matter for the rest of this page. The address space is private: address 0x4000 in one process and the same address in another point at different physical pages, or at nothing. And it is lazy: the kernel hands out virtual ranges freely and only allocates a real page when the process faults on it. Both properties are what make fork affordable, and both rest on the page tables and TLB underneath, which the virtual memory page covers from the hardware side.

fork() — the cleverest syscall.

fork() creates a new process that is an almost-exact copy of the caller. Same address space contents, same open file descriptors, same cwd, same signal mask, same everything. What differs: a new PID, a new parent, an empty tms_* accounting, and the return value (0 in the child, child's PID in the parent).

Done naively this would be unworkable, copying gigabytes of address space. The trick is copy-on-write: the kernel duplicates the page tables but marks every writable page read-only in both processes. The first write from either side traps to the kernel, which allocates a fresh page, copies the original, marks it writable in the writer's page table, and returns. Pages neither side ever writes stay shared forever.

pid_t pid = fork();
if (pid < 0) {
    // fork failed
} else if (pid == 0) {
    // we are the child
    execve("/usr/bin/ls", argv, envp);
} else {
    // we are the parent; pid is the child's PID
    int status;
    waitpid(pid, &status, 0);
}

The single line fork() returns twice — once in each process — and the return value is the only way each side knows who it is. Both processes then continue from the same instruction, with the same open files, the same variables, the same everything, diverging only as they each follow different branches of the if. This is the part that trips people up the first time: there is no "main" copy that keeps going while a "child" copy starts fresh. There are two equal processes, distinguished by a number.

Why fork looks redundant. Most fork() calls are followed right away by exec(), which throws away the entire address space. So why fork at all? Because between the two, the child can do anything the parent can do. It can chdir, setuid, redirect file descriptors, or install signal handlers — without touching the parent's state. It's the cleanest way Unix has to set up a new environment for a child program.

Copy-on-write is not free, even when nothing is copied. The kernel still has to duplicate the page tables and walk every writable mapping to flip it read-only, and the first write to each shared page takes a minor fault. For a process with a small address space this is cheap. For a multi-gigabyte process — a JVM heap, a big in-memory cache — the page-table work alone can be slow, and a child that then writes widely will trigger a storm of copy faults. This is why a large server process that needs to spawn helpers often uses posix_spawn or vfork+exec, or keeps a small pre-forked helper around, rather than forking the whole heap. It is also why thread-based concurrency, covered on the threads page, sidesteps the address-space copy entirely.

exec() — replacing the whole program.

execve(path, argv, envp) doesn't create a new process. It replaces the current one's address space with the contents of path, sets up a new stack with argv and envp, and jumps to the new program's entry point. The PID stays the same. Open file descriptors stay open, unless they have FD_CLOEXEC set. Signal handlers reset to default; the signal mask is kept.

Modern Linux supports two formats for the executable: ELF (the default) and #! interpreters, which the kernel rewrites internally to run the interpreter on the script. The binfmt_misc mechanism lets userspace register handlers for other file formats, which is how java -jar works as a one-liner on most distros.

What survives an exec is the small set of things that belong to the process rather than the program: the PID, the parent relationship, open file descriptors without FD_CLOEXEC, the working directory, and the credentials. What does not survive is everything that belonged to the old program's running state: the heap, the stack, the loaded code, any memory mappings, and the registered signal handlers, which reset to default because the new program never installed them. The continuity of file descriptors across exec is what makes shell redirection work. The shell forks, the child opens or dups the file descriptors it wants, and then execs the command, which inherits them.

The fork / exec / wait triad. The parent forks, the child execs a new program, the parent waits to read the exit status and free the zombie.

wait(), zombies, and orphans.

When a process exits, its address space is torn down, its file descriptors are closed, and its task_struct moves to the ZOMBIE state. That's a stub holding only the exit status, the killing signal, and timing accounting. The parent has to call wait() or waitpid() to read those out and reap the child. Until it does, the zombie sits there holding a PID slot.

Reaping matters because PID slots are a finite resource. The zombie holds nothing but its exit status, yet it keeps its entry in the process table until reaped. A parent that forks children in a loop and never calls wait() will slowly fill the table, and eventually fork() itself starts failing with EAGAIN. The fix is either to call wait()/waitpid() for every child, or to handle the SIGCHLD signal the kernel sends on each child's exit and reap from there. A third option is to tell the kernel you do not care about the exit status by setting SIGCHLD to SIG_IGN, after which children are reaped automatically and never become zombies.

Processes form a tree. Every process has exactly one parent, recorded in its task_struct, and the tree is rooted at PID 1. When you run a pipeline in a shell, the shell is the parent of each command; when a server accepts a connection and forks a worker, the server is the worker's parent. The pstree command draws this tree, and the relationships in it decide who is responsible for reaping whom.

If the parent dies first, the child is reparented to the nearest ancestor that called prctl(PR_SET_CHILD_SUBREAPER, 1), falling back to PID 1 (init). Init's only job is to wait() for orphans in a loop. This is why containers without a real init (the most common production failure mode for docker run my-binary) pile up zombies forever once their child processes spawn grandchildren.

It is worth being precise about the two failure shapes, because they look similar and have opposite causes. An orphan is a live process whose parent has died; it is fine, because it gets reparented to a subreaper or to PID 1, which will reap it when it eventually exits. A zombie is a dead process whose parent is still alive but has not reaped it; it is the leak. A pile of orphans is usually harmless. A pile of zombies means some living parent is not doing its job, and the parent is the bug to fix, not the zombie.

The "tini" reflex. Container images that run a single non-init process as PID 1 are the classic way to leak zombies. The fix is to put a tiny init like tini at PID 1, or use docker run --init, or write your application to handle SIGCHLD itself.

Process states.

A process spends its life moving between a handful of states, and almost every interesting thing the kernel does is a transition between them. The textbook model has five: a process is new while it is being set up, ready when it could run but is waiting for a CPU, running when it is actually on a CPU, blocked when it is waiting for something like disk or a network reply, and terminated when it has finished. Linux splits and renames these — its ready and running states are both TASK_RUNNING, and blocked is split into interruptible and uninterruptible sleep — but the shape is the same.

The state machine. The scheduler owns the ready ↔ running edges; blocking and waking are driven by I/O and events.

Notice who owns each edge. The two edges between ready and running belong to the scheduler: dispatching a ready task onto a CPU, and preempting a running task back to the ready queue when its time slice ends or a higher-priority task wakes. The edges into and out of blocked belong to events: a task calls read() on a socket with no data and blocks, then an interrupt signals the data arrived and the task is moved back to ready. A process can only be running on as many CPUs as it has runnable threads, and the number of ready-but-not-running tasks is exactly the run-queue depth that load average measures.

State	What it means	How to wake
R (running)	On a CPU or runqueue	—
S (interruptible)	Waiting on something, will wake on signal	Event or signal
D (uninterruptible)	Waiting on disk I/O, will not wake on signal	I/O completes
T (stopped)	Sent SIGSTOP/SIGTSTP	SIGCONT
Z (zombie)	Exited; awaiting parent's `wait()`	Reaped
X (dead)	Reaped; about to be freed	—

A process stuck in D state for any real length of time is worth investigating. Usually it's a stuck NFS mount, a hung block device, or a kernel bug. ps auxf + look for state D.

The file descriptor table.

Every open file, socket, pipe, and device a process holds is reached through a small integer: a file descriptor. The descriptor is an index into the process's file descriptor table, the files field on the task_struct. Slot 0 is standard input, 1 is standard output, 2 is standard error by convention, and everything a process opens after that gets the lowest free slot. The integer means nothing on its own; it is a per-process handle that the kernel resolves to a real kernel object.

There are actually three layers here, and the distinction explains a lot of Unix behaviour. The descriptor is per-process. It points into a system-wide open file table, where each entry holds the current file offset and the access mode. That entry in turn points at the inode, the kernel's record of the actual file. When you dup a descriptor, or when a child inherits descriptors across fork, the two descriptors point at the same open file table entry, so they share the offset: a write through one advances the position for the other. When two processes open the same file independently, they get separate open file entries and separate offsets, even though both reach the same inode.

Three layers. Descriptors that share an open file table entry share the file offset; independent opens do not.

The table is what makes shell redirection and pipes work. To run cmd > out.txt the shell forks, opens out.txt, uses dup2 to make that descriptor become slot 1, then execs cmd — which writes to standard output as usual, unaware it is now going to a file. The size of the table is capped by RLIMIT_NOFILE, the per-process open-file limit, which is the thing you raise when a busy server logs "too many open files." Each descriptor is small, but they are not free, and a process that opens connections without closing them leaks slots until it hits the cap.

Processes versus threads.

The line between a process and a thread comes down to what gets shared. A process has its own address space; the threads inside it share that one address space. Concretely, when Linux creates a thread it makes a new task_struct that points its mm, files, fs, and signal at the same sub-structures as the leader, instead of getting fresh copies. The threads share memory, open files, and the working directory; each keeps only its own registers, its own stack, and its own scheduling state. That is why one thread can corrupt another's data with a stray pointer, while one process cannot reach into another's memory at all.

The trade-off follows directly. Threads are cheap to create and cheap to switch between, because there is no address space to copy and no page tables to swap, and they share data for free — which is exactly why they need locks, because that shared data races. Processes are more expensive and communicate only through explicit channels like pipes, sockets, or shared memory, but the isolation they buy means a crash in one does not take down the others. Most servers pick a blend: a small number of processes for fault isolation, many threads inside each for throughput. The threads page covers the kernel thread model in full, and the thread pools page covers the standard pattern for reusing threads instead of paying creation cost per task.

What a process costs.

A process is not a free abstraction, and it pays at three different moments. Creation costs the page-table duplication and the copy-on-write setup we saw with fork, plus the work of building the new task_struct and threading it into the kernel's lists. For a small program this is microseconds; for a huge address space it is meaningfully more. Existence costs memory: the kernel keeps the task_struct, the page tables, and the kernel stack for the lifetime of the process, regardless of whether it is running. Thousands of mostly-idle processes still consume kernel memory.

The third cost is the context switch, paid every time the CPU moves from one process to another. The kernel saves the outgoing process's registers, swaps the page-table root, and restores the incoming process's registers. Swapping the address space flushes much of the TLB, so the new process takes a wave of TLB misses as it warms back up, and its working set may have been evicted from cache while it was away. These indirect costs — cold cache, cold TLB — usually dwarf the direct register-save work. Switching between two threads of the same process is cheaper precisely because the address space does not change, so the TLB and much of the cache survive. How the kernel decides who runs next, and how it tries to keep switches productive, is the scheduling page.

PID 1 — the init process.

PID 1 is special. It can't be killed by signals it doesn't have handlers for (so kill -9 1 from another process is a no-op against the kernel default). It adopts orphans. If it dies, the kernel panics, by design. On modern Linux distros PID 1 is systemd; on container hosts it's the kernel's init; inside containers it's whatever your ENTRYPOINT resolves to, which is the source of the zombie problem above.

Two practical points. First, inside a container your application is PID 1 and inherits its duties, including signal forwarding to children, which Bash and Python don't do by default. Second, outside a container, talking to systemd is the supported way to start, restart, and watch long-running processes.

Processes

A program is not a process.

What lives in a task_struct.

Address space — the most important piece.

fork() — the cleverest syscall.

exec() — replacing the whole program.

wait(), zombies, and orphans.

Process states.

The file descriptor table.

Processes versus threads.

What a process costs.

PID 1 — the init process.

Further reading.

02 — Threads