Threads
A thread is the unit of CPU execution. A process is the box it runs inside.
On Linux every thread is a kernel-scheduled task_struct that shares
the process address space with its peers but keeps its own stack and registers.
On top of that sits a stack of abstractions: pthreads, futexes, thread-local
storage, goroutines, async/await, each one trading kernel transitions for
scheduling complexity. This page walks the whole path from what a thread owns
to why a thread pool exists.
Thread vs process.
Start with the distinction every other idea on this page hangs off. A process is a container. It owns an address space, a set of open file descriptors, a working directory, a signal-handler table, and a security identity. A thread is a flow of execution running inside that container. One process can hold many threads, and those threads all run inside the same address space, which is the whole point: they can see each other's memory directly, with no copying and no message passing.
That sharing is the line between threads and separate processes. Two processes are isolated. If you want them to cooperate you reach for a pipe, a socket, shared memory you set up on purpose, or some other channel the kernel mediates. Two threads in one process cooperate by writing to the same variable. The first model is safe by default and slow to pass data through. The second is fast and dangerous by default, because nothing stops two threads from touching the same byte at the same time. Most of the rest of this page is about living with that trade.
A useful way to hold it: a process answers "what is isolated from what," and a thread answers "what is running right now." When a program "does two things at once" on one machine, it is almost always running two threads, not two processes, because the threads can share state cheaply and the kernel can put them on two cores at the same time. The cost of that convenience is that you, the engineer, become responsible for keeping the shared state consistent.
What a thread owns vs shares.
The clean way to think about a thread is by what is private to it and what it borrows from the process. Get this list right and the rest of threading stops being mysterious.
A thread owns three things. First, a stack: its own
region of memory for local variables, function arguments, return addresses, and saved
registers across calls. Each thread needs a separate stack because each thread is at a
different point in a different call chain. Second, its registers,
including the program counter that says which instruction is next and the stack pointer
that says where its stack currently is. Third, a thread control block
(the TCB, or on Linux the task_struct), the small kernel record holding the
thread's id, its saved register state when it is not running, its scheduling state and
priority, and a pointer back to the process it belongs to.
Everything else is shared. The address space, so the
heap and global variables and program code are common to every thread; a pointer that is
valid in one thread is valid in all of them. The open file descriptors,
so a file or socket opened by one thread can be read or written by another, and closing it
in one closes it for all. The signal-handler table, the current working directory, the
user and group identity, and the environment. When one thread calls malloc
and another calls free on the same pointer, that works precisely because the
heap is shared. When one thread corrupts the heap, every thread crashes, for the same reason.
This split is exactly why a thread is cheaper than a process. Creating a process means building a fresh address space and (with copy-on-write) preparing to duplicate page tables. Creating a thread means allocating one stack and one control block and pointing them at an address space that already exists. The kernel work is far smaller, which is the first half of why threads are the default tool for on-machine concurrency.
User-level vs kernel-level threads.
There are two places a thread can be scheduled from, and the choice shapes everything else.
A kernel-level thread is known to the kernel. It has a real
task_struct, the kernel's scheduler decides when it runs, and it can be placed
on any core. Because the kernel sees it, a kernel thread can run truly in parallel with its
siblings on a multi-core machine, and the kernel can preempt it at any instant on a timer.
The price is that every operation that touches the scheduler — create, block, wake, switch —
is a trip into the kernel, and that trip costs real time.
A user-level thread is invisible to the kernel. A library inside the process keeps a list of these threads and a small scheduler, and switches between them by saving and restoring registers entirely in user space. The kernel sees one thread and schedules that one thread; the rest are a fiction the runtime maintains. Creating and switching such a thread is very cheap, because no syscall is involved. But two problems follow. The runtime cannot place its user threads on different cores, because the kernel only gave it one, so pure user threads cannot run in parallel. And if any user thread makes a blocking syscall — a plain disk read, say — the kernel blocks the single real thread, and every user thread on top of it freezes.
The M:N model exists to take the cheap part of user threads and add real parallelism back. The runtime keeps M user threads and multiplexes them onto N kernel threads, where N is usually around the number of cores. The runtime schedules user threads onto kernel threads in user space, gets fast creation and switching, and still runs N of them in parallel because the kernel will place those N kernel threads on N cores. To dodge the blocking-syscall trap, an M:N runtime turns blocking calls into non-blocking ones under the hood, parks the user thread, and lets the kernel thread pick up other work until the I/O is ready. Go's goroutines are the cleanest production example; the cost is a complicated runtime and a few corner cases around signals and true preemption.
Three threading models.
| Model | Mapping | Used by |
|---|---|---|
| 1:1 (kernel threads) | Each user thread = one kernel-scheduled task | Linux NPTL, Windows, macOS — every mainstream OS today |
| N:1 (green threads) | All user threads multiplexed onto one kernel thread by a userspace scheduler | Old Java green threads, Erlang BEAM (per scheduler), Ruby MRI's GVL |
| M:N (hybrid) | M user threads multiplexed onto N kernel threads | Go runtime, Rust async, Erlang BEAM (overall), older Solaris |
Linux picked 1:1 with the NPTL rewrite in 2003. Every pthread_create
calls clone() with the right flags and produces a kernel-scheduled task.
This makes the kernel's scheduler the single source of truth for what runs when. That
plays well with priority and CPU affinity, but caps thread count at "however many
task_structs your kernel can hold" (low millions in practice).
clone() — what pthread_create actually does.
clone() is Linux's superpower. It's fork() with a flag set
deciding which kernel resources to share with the parent: address space (CLONE_VM),
file descriptors (CLONE_FILES), filesystem info (CLONE_FS), signal handlers
(CLONE_SIGHAND), and so on.
A POSIX thread is a clone() with all of those flags set:
clone(
flags = CLONE_VM | CLONE_FILES | CLONE_FS |
CLONE_SIGHAND | CLONE_THREAD | CLONE_SYSVSEM |
CLONE_SETTLS | CLONE_PARENT_SETTID |
CLONE_CHILD_CLEARTID,
stack = ...,
ptid = &thread_id,
tls = &tls_block,
ctid = &futex_word
);The new task has its own kernel stack and register state, but every other resource is shared with the original. From the kernel's view, threads are processes that opted into the same address space.
Futexes — the fast path.
A naive mutex requires a syscall on every lock and unlock. For uncontended cases, which is most of them, that is a 1 μs round trip you don't need.
A futex (Fast Userspace muTEX) is the trick. The mutex is a single 32-bit integer in shared memory. Lock and unlock operate on it with atomic compare-and-swap in userspace. The kernel only enters the picture when there's contention:
// pthread_mutex_lock pseudo:
if (atomic_cas(&mutex, 0, 1)) return; // uncontended fast path
syscall(futex_wait, &mutex, 1); // contended: park the thread
// pthread_mutex_unlock pseudo:
atomic_store(&mutex, 0);
if (waiters) syscall(futex_wake, &mutex, 1);Lock + unlock with no contention: ~25 ns. With contention: a syscall, which is where mutex performance falls off a cliff. This is also why "thread sleeping on a condition" is fast in glibc; it is just a parked futex.
Thread-local storage.
Every thread needs a small private region, for errno, for memory allocator state, for
whatever you put in __thread or thread_local. The runtime
allocates a TLS block per thread, points the FS or GS register at it, and the compiler
emits %fs:offset for every TLS access.
Two flavours:
- Static TLS. Variables declared
__threadin the main program or libraries loaded at startup. Allocated en bloc with the thread. Single-instruction access. - Dynamic TLS.
__threadindlopen()'d libraries. Allocated on first access, indirected through a per-thread DTV array. Slower; opt out where possible by linking everything statically or pre-loading.
Stack growth and cost.
Each pthread defaults to an 8 MB stack on Linux glibc (you can tune this with
pthread_attr_setstacksize). The 8 MB is reserved virtually, not
committed. The kernel only allocates pages as the stack actually grows. So 1000
threads cost 1000 × few KB committed, plus 1000 × 8 MB of address space.
On a 64-bit process this is fine. On 32-bit you'd run out of address space at a few thousand threads. The "C10K problem" was partly this. Modern solutions (epoll + coroutines, goroutines, Tokio tasks) sidestep it by giving each "logical thread" a tiny userspace stack instead of an 8 MB OS stack.
| What | Linux pthread cost |
|---|---|
| Create | ~10–30 μs (clone + setup) |
| Context switch (same core) | ~1 μs |
| Context switch (cross-core, cold caches) | ~5–20 μs |
| Address space (idle) | 8 MB virtual / few KB committed |
| Mutex lock/unlock (uncontended) | ~25 ns |
| Mutex lock/unlock (contended) | ~1 μs (syscall) |
The context switch, and what it costs.
Only one thread runs on a core at a time. Putting a different thread on that core is a context switch, and it is the operation that makes concurrency feel like it "just works" while quietly setting a ceiling on how much of it you can afford.
The mechanics are short to state. The kernel decides to switch, either because the running thread blocked (it asked for I/O, took a lock that was held, or slept) or because its time slice ran out and the timer fired. The kernel saves the running thread's registers — the program counter, the stack pointer, the general registers — into that thread's control block. It picks the next thread from the run queue. It loads that thread's saved registers back into the CPU, and if the new thread belongs to a different process it also swaps the address space, which means changing the page-table root and, on older hardware, flushing the TLB. Then it returns, and the new thread continues exactly where it left off, with no idea it was ever paused.
The direct cost of the save-and-restore is around a microsecond. The bigger, sneakier cost is indirect. The thread that just got the core arrives to find its data no longer in the L1 and L2 caches, because the previous thread evicted it, so the first few thousand instructions stall on memory while the cache refills. If the switch also crossed CPU sockets, the NUMA penalty on every miss makes it worse still. This is why a microsecond on paper can behave like five or twenty in a real workload, and why a system thrashing between thousands of runnable threads spends a painful fraction of its time switching rather than working. The fix is rarely "switch faster" and usually "switch less": keep the number of busy threads near the number of cores, and keep each thread on its data.
Thread pools — why they exist.
Put the create cost and the switch cost together and the case for a thread pool writes itself. Creating a thread costs tens of microseconds and an 8 MB stack reservation. If you spawn a fresh thread per incoming request, you pay that on every request, and at high request rates you end up with thousands of threads all contending for a handful of cores, which means the machine spends its time context-switching instead of doing your work.
A thread pool fixes both problems at once. You create a fixed set of worker threads up front, sized to roughly match the hardware, and hand them a shared queue of tasks. Each worker loops: pull a task, run it to completion, pull the next. Submitting work becomes pushing onto the queue rather than asking the kernel for a thread, so the per-task overhead drops to almost nothing. And because the number of threads is capped, the number of runnable threads stays near the core count, so the switch storm never happens. Under a flood of work, tasks pile up in the queue rather than spawning an unbounded swarm of threads — the system degrades by adding latency instead of falling over.
Sizing the pool is the one real decision. For CPU-bound work the right size is close to the number of cores, because more threads than cores just add switching with no extra throughput. For I/O-bound work, where threads spend most of their time parked waiting on the network or disk, you want more threads than cores so the cores stay busy while others wait — but this is exactly the regime where M:N runtimes and async I/O do better, because they get the same effect without a real OS thread sitting blocked for each outstanding request.
Shared memory is the hazard.
Everything good about threads — direct, copy-free access to common state — is also the thing that will hurt you. Because threads share the address space, two of them can read and write the same variable at the same time, and the order in which their individual machine instructions interleave is up to the scheduler, not you. When the result depends on that order, you have a race condition.
The classic example is the one that looks too simple to break. Two threads each run
counter = counter + 1 a million times, starting from zero. You expect two million.
You get less, and a different less each run. The reason is that counter = counter + 1
is not one step. It is read the value into a register, add one, write it back. If thread A reads
41, and thread B reads 41 before A writes, both compute 42, both write 42, and one increment
vanishes. The hardware did nothing wrong; the two threads' instructions simply interleaved in a
way your single-threaded reading of the code never imagined.
The fix is to make the read-modify-write atomic — indivisible from any other thread's point of view — by protecting it with a lock. A thread takes the lock, does its update, releases the lock, and while it holds the lock no other thread can be inside the same critical section. That restores correctness, and it introduces the new costs the rest of systems programming spends its time managing: contention when many threads want the same lock, deadlock when threads wait on each other in a cycle, and the simple slowdown of serialising work that you wrote as parallel. This is a deep enough subject to have its own page — synchronization covers locks, atomics, condition variables, and the memory model in full. The thing to carry away here is narrow and important: the moment two threads touch the same mutable data, that access needs a plan, and "it worked when I tested it" is not one.
Green threads, goroutines, and OS threads.
With the cost model in hand, the lighter-weight thread types make sense as engineering, not magic.
An OS thread is the kernel-scheduled task_struct this page started
with: real parallelism, real preemption, clean priority, and an 8 MB stack plus tens of
microseconds to create. Excellent up to thousands; painful at hundreds of thousands.
A green thread is a user-space thread scheduled by a runtime rather than the kernel, the N:1 model from the table above. It is cheap to create and switch, but classic green threads cannot use more than one core and stall completely on a blocking syscall, which is why most languages moved on from them. A goroutine is the modern answer: an M:N thread that keeps the cheapness of green threads — a stack that starts at about 2 KB and grows on demand, creation in a couple hundred nanoseconds, a switch in tens of nanoseconds — while the Go runtime multiplexes many of them across a small pool of OS threads and converts blocking calls into parking, so a program can hold hundreds of thousands of goroutines on a handful of cores.
| Kind | Scheduled by | Create / switch | Parallel? |
|---|---|---|---|
| OS thread (1:1) | Kernel | ~10–30 μs / ~1 μs | Yes, across cores |
| Green thread (N:1) | Runtime, one OS thread | Cheap / cheap | No, one core only |
| Goroutine (M:N) | Runtime, N OS threads | ~200 ns / ~50 ns | Yes, up to N cores |
The trade is always the same shape. The kernel gives you correctness corners for free — true preemption, signal targeting, scheduling fairness across the whole machine — at a cost the kernel cannot lower much. A user-space runtime gives those corners back in exchange for the small stacks and fast switches it needs to run a problem with a hundred thousand concurrent things in it. Neither is better in the abstract; they are tuned for different scales of concurrency, and a senior engineer's job is to know which scale a given problem lives at.
Where user-space M:N comes back.
Goroutines, async/await, BEAM processes are all M:N. They give up some of what kernel threads provide (clean priority, signal targeting, real preemption from outside the runtime) to get back the things the kernel can't afford to: 2 KB stacks, ~200 ns creation, ~50 ns switch. None of this is magic. They are moving the scheduler into userspace and trading correctness corner cases for throughput.
Go's scheduler is the cleanest production example,
M:P:G in three letters. Rust's Tokio
executor is conceptually similar, on top of the OS thread pool. Both ultimately call
epoll or io_uring for I/O readiness; the magic is in the
scheduler that wakes the right task when bytes arrive.
Common mistakes.
- Spawning a thread per request. Works at low concurrency, falls over at hundreds. Use a worker pool, or move to a non-blocking model.
- Holding a mutex across a syscall. Suddenly your tail latency is disk-bound; everyone else queues. Read what you need, drop the lock, then do I/O.
- Ignoring NUMA on big servers. Threads bouncing between sockets
kill cache locality.
numactl, CPU affinity, or a runtime that respects sockets. - Relying on signal delivery to a specific thread. Linux delivers signals to "any thread that hasn't masked it"; pinning is your job.
Further reading.
- clone(2) — the syscall behind every Linux thread.
- pthreads(7) — overview of POSIX threads on Linux.
- Drepper — Futexes are tricky — Ulrich Drepper's correctness paper on glibc's futex use.
- Drepper — ELF Handling for Thread-Local Storage — how static and dynamic TLS work end to end.
- Kerrisk — TLPI ch. 29–34 — the canonical Linux programming reference on threads, futexes, TLS.
- Semicolony — Out-of-order execution — what each thread runs on. ~600-entry reorder buffer, register renaming, why "context switch cost" is partly a flushed-pipeline cost.
- Semicolony — CPU caches and false sharing
— two threads writing different fields of the same cache line is an 80× slowdown. The Linux
____cacheline_alignedmacro exists for this. - Semicolony — NUMA and memory bandwidth
— a thread that migrates to another socket pays 80% extra latency on every memory access. Why
taskset/numactlexist.