09 / 10

Internals / 09

Synchronization primitives

Two threads, one memory location, and no agreement about who writes first. The result is undefined: sometimes the right number, sometimes nonsense, occasionally a crash on a CPU you don't own. Every synchronization primitive in every language is some way of forcing those two threads into an order: a hardware atomic that does the read-modify-write in one shot, a lock that puts the loser to sleep, or a memory-model rule that says what the other CPU can see.

The problem — a race nobody scheduled.

Two threads execute counter++ at the same time. The CPU compiles that into three instructions: load the value into a register, add one, store it back. If both threads happen to load before either stores, both add one to zero, both write one back, and the counter that should now be two is one. The lost update is invisible: no exception, no signal, just a wrong number that surfaces hours later as a leaked file handle or a double-spent dollar.

This is a data race, and the C and C++ standards explicitly call its behaviour undefined. The compiler is allowed to assume it can't happen and optimise accordingly, which means in practice the symptoms can be far stranger than a lost increment. The only fix is to give the threads an ordering. Either you serialise them so one runs the read-modify-write before the other starts (a lock), or you use a single hardware instruction that the CPU guarantees is atomic with respect to other cores (an atomic operation). Every primitive in this article is one of those two things, or a higher-level structure built on top.

The window where the trouble lives has a name: the critical section. It is the stretch of code that reads and writes shared state and would break if two threads ran it at once. The whole discipline of synchronization is the work of drawing that boundary correctly. Draw it too wide and you serialise work that could have run in parallel, which throws away the cores you paid for. Draw it too narrow and you leave an invariant exposed: a moment where a data structure is half-updated and another thread can see the seam. The rule that makes a critical section work is mutual exclusion, at most one thread inside at a time, but the rule you actually care about is the invariant the section protects. A lock is not the goal; the consistent view of memory it buys you is the goal.

A correct critical section also needs two properties beyond exclusion. It needs progress: if no one holds the lock and someone wants it, one of the contenders gets in, so the section can't sit empty while threads wait outside. And it needs bounded waiting: a thread that asks for the lock eventually gets it, rather than being passed over forever while others cut the line. Most of the failure modes later in this page, starvation and livelock and lock convoys, are violations of one of these two, not of exclusion itself. Exclusion is the easy part. Fairness is where the bodies are buried.

Atomic operations — the hardware floor.

Every modern CPU offers a small set of instructions that the cache-coherence protocol treats as indivisible. On x86 they're the LOCK-prefixed forms: LOCK CMPXCHG (compare-and-swap), LOCK XADD (fetch-and-add), LOCK XCHG (atomic swap). On ARM and RISC-V the pattern is different: a paired load-linked / store-conditional (LL/SC) where the store only succeeds if no one else touched the line since the load.

Compare-and-swap is the foundational one. CAS reads a memory location, compares it to an expected value, and writes a new value only if the comparison succeeded — all as one observable step. Build a counter by reading the current value, computing value+1, and CAS-ing it back; if someone else won the race, your CAS fails, you reload and try again. Fetch-and-add is the unconditional sibling. It returns the old value and add a delta atomically. Both are single instructions, no kernel involved, and cost roughly 5–50 ns on a hot cache line; the cost balloons to hundreds of nanoseconds when the line bounces between sockets.

Atomics are the floor. std::atomic, Java's AtomicLong, Go's sync/atomic (covered in the Go sync primitives deep dive), and Rust's std::sync::atomic all compile down to these instructions. Every mutex, every channel, every garbage collector's write barrier ends up here.

The mental model worth keeping is the CAS retry loop. You read the current value, compute what you want it to become, and ask the hardware to write the new value only if the location still holds the old one. If another core changed it in between, your compare fails, you discard the work, and you loop. Under light contention almost every attempt succeeds on the first try and the loop is invisible. Under heavy contention the loop spins, the cache line ping-pongs, and the same operation that cost 10 ns idle now costs hundreds. That is the first hint of a theme this page keeps returning to: the instruction is cheap, the contention is not.

Compare-and-swap as a loop. Lock-free counters, Treiber's stack, and most mutex fast paths are this shape underneath.

Mutex — mutual exclusion, the workhorse.

A mutex guarantees that at most one thread holds it at a time; everyone else waits. The implementation has two paths. The fast path is a single atomic CAS on a lock word. If the lock is free, you flip it from 0 to 1 and proceed; if it's held, you fall through. Uncontended pthread_mutex_lock on Linux costs roughly 20–25 ns, almost entirely the CAS itself.

The slow path kicks in when the CAS fails. The thread parks itself on a kernel wait queue via a futex(FUTEX_WAIT) syscall and stays asleep until the unlocker issues futex(FUTEX_WAKE). The full contended round-trip is in the low microseconds, orders of magnitude more than the fast path, but the thread isn't burning CPU while it waits. Every modern Linux mutex (pthread_mutex_t, Java's synchronized, Go's sync.Mutex) follows this two-tier shape, and it's the right default for almost everything.

Spinlock — busy-wait, dangerously.

A spinlock skips the slow path entirely. Acquire is a CAS in a loop: if the lock is held, spin on a read of the lock word until it changes, then try the CAS again. No syscall, no sleep, no context switch, which is exactly what you want when the critical section is a handful of nanoseconds and the cost of going through the scheduler would dwarf the wait.

In the kernel, spinlocks are everywhere. They're the only choice in interrupt handlers and other contexts where sleeping isn't allowed. Naive test-and-set spinlocks suffer cache storms and unfairness, so production implementations use ticket locks (everyone takes a number) or MCS / queued locks (each waiter spins on a local cache line). Linux's qspinlock is an MCS variant tuned for the kernel.

Why spinlocks are dangerous in userspace. A userspace thread holding a spinlock can be preempted by the scheduler at any instruction. If that happens, every other thread spinning on the same lock burns a full timeslice, millions of cycles, for nothing, waiting for a runqueue decision they can't see. This is the lock holder preemption problem, and it's why blanket use of spinlocks in userspace usually makes throughput worse than a mutex. Reserve them for the cases where you control scheduling (pinned threads, real-time priority) or where the critical section is provably nanoseconds.

The choice between spinning and blocking comes down to one comparison: how long do you expect to wait, against how much it costs to go to sleep and be woken. A context switch is on the order of a microsecond once you count the syscall, the scheduler, and the cold cache the thread wakes up to. If the lock will free in less time than that, spinning wins, because you would spend more cycles sleeping than you would ever spend waiting. If the lock might be held for milliseconds, spinning is pure waste: you burn a whole core doing nothing while the holder does I/O or gets descheduled. Blocking wins there by a wide margin, because a parked thread costs nothing and frees the core for real work.

Real implementations refuse to pick one and instead do both. An adaptive mutex spins for a short, bounded number of iterations on the assumption the holder is on another core and about to release, and only then falls through to a blocking futex(FUTEX_WAIT). Glibc's pthread_mutex in adaptive mode, Go's sync.Mutex (which spins a few times before parking), and the Linux kernel's mutex all use this hybrid. The short spin catches the common case where the section is tiny; the fallback to sleep catches the case where it is not. You get the cheap path for cheap waits and you stop wasting a core on long ones.

Spin below a context switch, block above it. Adaptive mutexes straddle the line so you don't have to guess.

Reader-writer locks — many readers, one writer.

When the workload is read-heavy and writes are rare, a plain mutex serialises readers that didn't need to be serialised. A reader-writer lock (pthread_rwlock_t, Go's sync.RWMutex, Java's ReentrantReadWriteLock) splits the contract: any number of readers may hold the lock simultaneously, but a writer needs exclusive access and blocks until all readers drain.

The cost is in the bias. Reader-preferring rwlocks let new readers in even when a writer is waiting, which is great for throughput and can starve the writer indefinitely under heavy read load. Writer-preferring variants block new readers once a writer is queued, which is fairer but kills read throughput during a long writer queue. Most production rwlocks also cost more on the uncontended path than a plain mutex (two atomics instead of one), so the break-even is "lots of readers, long critical sections, rare writes", narrower than people think.

Condition variables — wait until something is true.

A mutex protects shared state; a condition variable lets you wait for that state to reach a particular shape. The pattern is paired: the waiter holds the mutex, checks a predicate, and if it isn't satisfied calls cond_wait() which atomically releases the mutex and parks the thread. A producer changes the state, then calls cond_signal() or cond_broadcast() to wake one or all waiters.

Two non-obvious rules. First, spurious wakeups are real. POSIX and Java both explicitly allow cond_wait to return without any signal being sent. The cause is usually the futex or kernel wait queue waking more threads than strictly necessary. The fix is always the same: check the predicate in a while loop, never an if. Second, the mutex must be held across the predicate check, not just the wait. Otherwise the producer can change the state and signal in the gap, and your waiter sleeps through the notification it was supposed to receive.

The reason the mutex and the wait have to be fused into one step is the heart of the whole pattern. Picture the gap if they were separate: the waiter checks the predicate, finds it false, releases the mutex, and then, before it can call wait, the producer runs, sets the state, and signals. The signal goes to an empty wait queue. The waiter, oblivious, now parks itself and sleeps forever on a condition that already came true. This is the lost wakeup, and the only defence is that cond_wait releases the mutex and enrolls on the wait queue as a single atomic operation, so the producer cannot squeeze in between. The producer, in turn, must hold the same mutex while changing the state, which is what keeps it from racing past the check.

signal wakes one waiter; broadcast wakes them all. Reach for broadcast when the state change might satisfy several distinct waiters with different predicates, or when you are not sure exactly who should wake. Reach for signal when the resource is a single unit and waking more than one waiter just means the extras wake, re-check, find nothing, and go back to sleep, which wastes cycles and, in the worst case, creates a lock convoy as they all stampede the mutex at once. The defensive rule covers either way: the predicate is re-checked on wakeup, so a broadcast that wakes too many is merely wasteful, never wrong.

The wait/signal pattern. Always check the predicate in a while loop, and always hold the mutex while you change the state and signal.

Semaphores — a counter with wait and signal.

A semaphore is a non-negative integer with two atomic operations: wait (decrement, blocking if the value is zero) and signal (increment, possibly waking a waiter). Dijkstra invented them in 1965 and the abstraction has aged remarkably well. A binary semaphore (initialised to 1) is functionally a mutex; counted semaphores generalise to anything that needs an integer's worth of capacity.

The production use cases are concrete. A bounded queue uses one semaphore to count available slots and another to count items present; producers wait on slots, consumers wait on items, both signal the other after the operation. A throttle caps concurrent operations. Initialise to the limit, acquire before the operation, release after; this is exactly Go's idiomatic channel-based worker-pool pattern. A resource pool (database connections, file handles) uses a semaphore counting available resources.

The Linux futex — fast userspace, slow kernel.

Before futex (Hubertus Franke and Rusty Russell, kernel 2.5.7, 2002), every pthread_mutex_lock on Linux was a syscall in the uncontended case, hundreds of nanoseconds for a flag flip that didn't need the kernel at all. Futex split the problem: the lock word lives in userspace and the common case is a plain atomic, while the kernel is only consulted when a thread actually needs to sleep or wake.

The interface is two operations. FUTEX_WAIT(addr, expected) atomically checks that *addr == expected and if so puts the calling thread to sleep on a kernel hash table keyed on addr. FUTEX_WAKE(addr, n) wakes up to n threads sleeping on that key. There is no kernel-side lock object at all in the uncontended case. The kernel hash bucket exists only while threads are queued.

Futex fast path is ~10 ns. An uncontended lock acquire is a single LOCK CMPXCHG on a userspace word: no syscall, no context switch, ~10 ns on a modern x86. Only when the CAS fails does the slow path issue FUTEX_WAIT, which takes a few microseconds for the syscall plus the scheduler round-trip. Glibc's pthread_mutex_t, Go's sync.Mutex, Java's lock optimisations (LockSupport.park), and Rust's std::sync::Mutex all build on this exact pattern. Long stop-the-world pauses (see Go's GC deep dive) interact directly with futex sleep queues, because a paused mutator holding a lock will leave every waiter on its futex wait list until the world restarts.

/* Atomic compare-and-swap on a counter — no lock needed. */
#include <stdatomic.h>

atomic_int n = 0;

void increment(void) {
    int old, new;
    do {
        old = atomic_load(&n);
        new = old + 1;
    } while (!atomic_compare_exchange_weak(&n, &old, new));
}

/* Or, when a critical section is wider than one word, a mutex. */
#include <pthread.h>

pthread_mutex_t m = PTHREAD_MUTEX_INITIALIZER;
long balance = 0;

void transfer(long delta) {
    pthread_mutex_lock(&m);     /* fast path: 1 CAS, ~20 ns */
    balance += delta;           /* contended path: futex sleep */
    pthread_mutex_unlock(&m);
}

Memory models — what the other CPU sees.

A mutex is half the story. The other half is the memory model: the rules about which writes from one core become visible to other cores, and in what order. Modern CPUs reorder loads and stores aggressively, and so do compilers. Without explicit ordering, a thread that "obviously" sees the data write before the flag write may in fact see the flag flip while the data write is still sitting in another core's store buffer.

C11 and C++11 std::atomic exposes three ordering tiers. Sequentially consistent (the default, memory_order_seq_cst) gives the strongest guarantee, since all threads observe all sequentially-consistent operations in a single total order, and the highest cost, since x86 needs an MFENCE and ARM needs a full DMB ISH. Acquire-release (memory_order_acquire on loads, memory_order_release on stores) pairs a load with a store and guarantees that everything before the release is visible after the acquire. This is the classic flag-publishing pattern, cheap on x86 (free on loads, a single store with implicit ordering). Relaxed (memory_order_relaxed) gives atomicity but no ordering at all; useful for counters where nobody else's view of order matters.

Why this matters on real hardware, and not just in the standard, comes down to the store buffer. When a core writes, the value lands first in a small per-core buffer and drains to the shared cache later. A second core reading the same addresses can see the writes drain in a different order than the first core issued them. The canonical demonstration is two flags: thread A sets x = 1 then reads y; thread B sets y = 1 then reads x. With plain stores, both can read zero, because each write is still parked in its own core buffer when the other reads. No interleaving of the source lines produces that result, yet the hardware does. A fence, or the implicit ordering an acquire-release pair carries, forces the buffer to drain at the right moment and rules the anomaly out.

Store buffers let writes become visible out of order. Memory ordering, fences and acquire-release, is the contract that tames it.

Java has the JMM, formalised in JSR-133. Go has its own memory model (the deep dive walks through it) which, since Go 1.19, follows the same acquire-release semantics as C++. The high-order bit is the same in every language: if multiple threads touch a variable and at least one writes, either use a lock or use an atomic with explicit ordering. Plain reads and writes are not a contract.

Lock-free and wait-free — non-blocking, but hard.

A lock-free algorithm guarantees that at least one thread always makes progress, even if others stall or are preempted. Wait-free is stronger: every thread makes progress in a bounded number of steps. Both ban locks and replace them with atomics, usually CAS in a retry loop.

The classic examples are small and famous. Treiber's stack (1986) is a lock-free LIFO: push CAS-es a new node into the head pointer, pop CAS-es the head pointer to next. Michael and Scott's queue (1996) is a lock-free FIFO using two CAS-es per enqueue. Both fit on a page; both are foundational.

Both are also much harder to get right than they look. The ABA problem, where a pointer reads A, gets preempted, the value changes to B and back to A, then the CAS succeeds against stale logic, bites every CAS-based structure and is usually patched with version counters or hazard pointers. Memory reclamation (when can you free a node another thread might still be reading?) is the real research problem. And the performance is often disappointing in practice: cache-line bouncing on the head pointer under contention can be slower than a well-tuned mutex. Reach for lock-free when you've measured the lock is the bottleneck and you've ruled out partitioning the data first.

Deadlock, four conditions, one cure.

Deadlock is the failure where a set of threads are all blocked forever, each waiting on a resource another holds. The textbook case is two locks and two threads: thread one grabs lock A and reaches for B, thread two grabs lock B and reaches for A, and neither will ever let go. Coffman's 1971 result says a deadlock needs four conditions to hold at once, and the practical value of the list is that breaking any single one of them makes deadlock impossible.

Mutual exclusion. The resource can't be shared; only one holder at a time. This is the property a lock exists to provide, so you rarely give it up.
Hold and wait. A thread keeps the locks it already has while blocking for more. Break it by taking all locks at once, or by releasing everything and retrying if you can't get them all.
No preemption. A lock can't be yanked away from its holder. Break it with try-lock plus back-off, so a thread that can't get the second lock drops the first and starts over.
Circular wait. There's a cycle in the "who waits for whom" graph. Break it by imposing a global lock order: every thread acquires locks in the same fixed sequence, so a cycle can never form.

The two strategies in production are prevention and avoidance. Prevention is static: pick one of the four conditions and design so it can never hold. The most common is the global lock order, breaking circular wait, because it costs nothing at runtime and is easy to audit. Address-ordered locking, where you always lock the lower-addressed object first, is the trick that makes operations like "transfer between two accounts" safe without thinking about which account is which. Avoidance is dynamic: the system tracks who holds what and refuses any acquisition that could lead to an unsafe state, the way Dijkstra's banker's algorithm does. Avoidance is rare outside textbooks because it needs to know every thread's future resource demands in advance, which real programs don't supply.

Detection is the third option: let deadlock happen, notice the cycle in the wait-for graph, and recover by killing or rolling back a victim. Databases do exactly this, since transactions take locks in an order nobody can predict, and aborting one deadlocked transaction is cheaper than forcing a global order on every query. The mutex and deadlock simulator lets you step two threads through the classic AB/BA cycle and watch the wait-for graph close, which makes the four conditions concrete in a way prose can't.

Livelock, busy but going nowhere.

Livelock is deadlock's noisier cousin. The threads aren't blocked, they're running, but they make no progress because each keeps reacting to the others. The hallway analogy is exact: two people meet, both step left to pass, both step right, both step left again, and they dance in place without ever getting past each other. A common software version is a naive try-lock back-off where every contender, on failing to get the second lock, releases and retries on the same schedule, so they collide again on the next round, and the next.

The cure is to break the symmetry. Add randomised back-off so the retries spread out in time and one thread wins, the way Ethernet's exponential back-off resolves collisions on a shared wire. Or impose an ordering, the same global lock order that kills circular wait, so there's a defined winner rather than two equals fighting. Livelock is harder to spot than deadlock because the process looks alive, CPU is busy and the thread is scheduled, yet the useful-work counter never moves. When a service is pinned at full CPU and throughput is zero, livelock is high on the suspect list.

The cost of contention, why locks aren't free.

An uncontended lock is cheap, tens of nanoseconds, almost all of it the atomic. The price appears when threads actually fight for it, and the cause is the cache-coherence protocol rather than the lock logic. A lock word lives in one 64-byte cache line. To acquire the lock, a core must own that line in exclusive state, which means invalidating every other core's copy and pulling the line across the interconnect. Under contention the line bounces from core to core on every acquire and release, and each bounce is a coherence transaction measured in hundreds of nanoseconds, climbing into the microseconds when the cores sit on different sockets.

This is why contention scales so badly. Add more threads to a hot lock and you don't just wait longer in the queue; you make every acquire more expensive, because the line is colder each time it comes back. Amdahl's law puts a ceiling on it: if a fixed fraction of the work must run inside one critical section, that fraction caps your speedup no matter how many cores you throw at the problem. The fixes are all about avoiding the fight rather than winning it faster. Shard the lock so different keys map to different locks and rarely collide. Partition the data so each thread owns a slice and needs no lock at all. Use per-CPU structures and aggregate lazily. Replace a read-mostly lock with an RCU scheme where readers take no lock and writers publish a new version. The fastest lock is the one you never have to take, and most real concurrency wins come from removing contention, not from a cleverer primitive.

Common bugs — the catalogue.

Synchronization has its own zoo of pathologies. Deadlock is the canonical one: two threads each holding a lock the other wants. The standard prevention is a global lock order; the standard detection is a wait-for graph in something like pthread_mutex's PTHREAD_MUTEX_ERRORCHECK mode or Go's race detector.

Priority inversion happens when a low-priority thread holds a lock that a high-priority thread needs, while a medium-priority thread keeps preempting the lock holder. The high-priority thread waits indefinitely on a chain it can't see. Mars Pathfinder famously hit this in 1997 on Mars; the fix is priority inheritance, which temporarily boosts the lock holder to the waiter's priority. Lock convoy is the contention pattern where many threads wake on a single signal, rush the lock, and serialise into a procession, a problem condition variables make worse with broadcast.

False sharing is the hardware-level bug: two variables that no thread actually shares happen to land on the same 64-byte cache line, and writes from different cores bounce the line back and forth as if there were real contention. The fix is to pad hot variables to cache-line size: Go's runtime.padding, Java's @Contended, C++'s alignas(std::hardware_destructive_interference_size). ABA in CAS we covered above. The tooling landscape is small but essential: Valgrind's Helgrind and DRD for happens-before analysis, ThreadSanitizer (TSan) for production-grade race detection and the default behind go test -race and Clang's -fsanitize=thread.

Primitive	Cost (uncontended)	Sleeps?	Use when
Atomic CAS / FAA	~5–50 ns	no	counters, flags, lock-free structures
Mutex (futex-based)	~20–25 ns	on contention	default for most critical sections
Spinlock	~10 ns	never	kernel, IRQ handlers, ns-scale sections
Reader-writer lock	~40–60 ns	on contention	read-heavy, long sections, rare writes
Condition variable	µs (signal + wake)	yes	wait for a predicate to become true
Semaphore	~30 ns + futex on 0	on zero	bounded queues, throttles, resource pools