How threads work.

Three threads, each adding 1 to a counter that starts at 0. Naive expectation: counter ends at 3. Reality without a lock: anything from 1 to 3 depending on scheduling. The bug isn\'t in any thread\'s code — it\'s in the fact that counter++ is three machine instructions, and the scheduler can stop you between any two of them. Watch the race happen, then watch a mutex fix it.

speed 1.6s

step 1 / 21 · counter = 0 · lost updates 0

phase race version · no lock · counter++ is 3 instructions single CPU core · pre-emptive scheduler · timer interrupts at any instruction boundary

RACE

Three threads · one shared counter · no protection

All three threads will run the same three-line function: load counter into a private register, add 1, store back. Each thread expects to add 1 to counter. If we run them carefully the counter should end at 3. Without coordination, it won't.

Thread: An independent execution flow with its own stack and registers, sharing the heap with siblings in the same process. The OS scheduler switches between them.
Shared memory: Memory accessible from multiple threads. The heap is shared by default; the stack is private per thread.
Race condition: A bug where program correctness depends on the order in which independent operations happen to be scheduled.

Why `counter++` is not one instruction

On x86 a high-level counter++ compiles to either one read-modify-write instruction (inc dword ptr [counter]) or, more commonly, three separate ops: load to register, increment register, store register back. Both forms are NOT atomic by default. The single-instruction form has a window between the load and store at the cache-coherence level. The three-instruction form has a much bigger window because the scheduler can preempt you between any two of them. Either way, two threads doing counter++ concurrently can lose updates.

You can ask for an actually-atomic increment with a lock prefix on x86: lock inc dword ptr [counter]. That tells the CPU to hold the cache line exclusively for the duration. Costs roughly 10-30x a regular store but it\'s the building block under std::atomic<int>::fetch_add, Java\'s AtomicInteger.incrementAndGet, and Go\'s atomic.AddInt64.

Why even a single-core system can race

You might think with one CPU core only one thread runs at a time, so races can\'t happen. They still can. The scheduler can interrupt a thread between any two instructions — that\'s exactly what we showed. T1 loaded, got preempted, T2 ran the whole sequence, T1 resumed with stale data. Multiple cores make this worse (true parallelism, plus cache coherence drama), but the fundamental problem is the gap between observing a value and acting on it.

Mutex implementations · cheaper than you\'d think

A naive mutex would call into the kernel on every acquire and release. That\'s a syscall — hundreds of nanoseconds even when uncontended. Modern mutex implementations (futex on Linux, SRW locks on Windows) are a fast path entirely in userspace using CAS, falling back to the kernel only when contention forces them to. Uncontended std::mutex::lock on x86 is about 25 ns. Contended, you\'re looking at a context switch and a wake-up: 1–5 µs.

Spinlocks skip the kernel entirely — they just CAS in a loop until the lock is free. Good when the critical section is tiny (a few hundred ns), bad when it\'s long. Most language runtimes ship adaptive locks: spin briefly, fall back to blocking. The right choice depends on contention shape.

Things mutexes don\'t fix

Deadlock. Two threads each hold one lock and want the other. Both wait forever. Cure: always acquire locks in a consistent global order. Or use timeouts.
Livelock. Threads keep retrying because they keep stepping on each other. No progress, but no one is blocked. Often happens with naive lock-free designs.
Priority inversion. A low-priority thread holds a lock that a high-priority thread needs. A medium-priority thread runs forever, starving the low one. The high-priority thread waits indefinitely. Mars Pathfinder famously hit this in 1997.
Convoy effect. One slow thread holding the lock causes a backed-up queue. Everyone proceeds at the slow thread\'s pace.
Memory ordering. Mutexes give you mutual exclusion but also implicit memory barriers — atomic primitives like atomic.Store need them too or other cores may see stale caches. Easy to get wrong if you roll your own.

Alternatives to grabbing a lock

Locks are the default tool but they\'re not the only one. Atomic operations (CAS) let you build lock-free counters, queues, and stacks — no thread ever blocks, but the code is harder to reason about. Message passing (channels in Go, actor mailboxes in Erlang/Akka) sidesteps shared state entirely: one thread owns the data, others ask it to do things. Sharding gives each thread its own data; they only coordinate at boundaries. Read-write locks let many readers in but only one writer. RCU (Read-Copy-Update) is what the Linux kernel uses for read-heavy data structures — readers see a consistent snapshot, writers update via a copy that\'s atomically swapped in.

What this visualisation simplifies

Single core. Real systems have many; race conditions get easier to hit, not harder, and cache coherence (MESI protocol) starts to matter.
No false sharing. Two threads writing different fields of the same cache line will hammer each other\'s caches even with no logical contention. A common cause of mysterious slowdowns.
Memory model handwaved. We pretended store-after-store-by-this-thread is visible immediately to other threads. Real CPUs (especially weakly-ordered ones like ARM) need explicit barriers.
Coarse instructions. Real load-inc-store has more substeps (cache line fetch, CAS handshake, store buffer drain). The story is the same: gaps where preemption can occur.

Go deeper

Concurrency in the OS Codex →

Memory models, atomic primitives, CAS-based data structures, deadlock detection, the actual cost of futex vs spinlock vs RW lock, the Linux kernel\'s RCU pattern.

Open the Codex →

Found this useful?