Context switching.

The CPU runs one thing at a time. When the OS wants to swap processes, it has to save every CPU register the current one was using, pick the next, load that one\'s register set back in, and jump. The mechanics are simple. The hidden cost — cold caches, cold TLB, cache-line bouncing — is usually bigger than the switch itself.

speed1.7s

step 1 / 7

A is on the CPU

Process A — say, vim — is currently executing. Its program counter, stack pointer, general-purpose registers, and the address-space pointer (CR3 on x86, TTBR on ARM) all point at A's state. B's context lives in B's task struct in kernel memory, dormant. The kernel sets a timer interrupt for somewhere between 1 ms and 10 ms from now.

Context: Everything the CPU needs to resume a process exactly where it left off: all general-purpose and floating-point registers, the program counter, the stack pointer, the page-table base, and a handful of bookkeeping fields.
Quantum / time slice: The maximum chunk of CPU time the scheduler gives one process before considering a switch. Modern Linux is dynamic; 1-10 ms is typical.

What a switch actually costs

The direct work — read 16-30 registers, write 16-30 registers, change CR3 — is a microsecond or two. The indirect cost is bigger and harder to measure: B\'s working set isn\'t in the data caches, so the first thousands of memory accesses miss. B\'s code probably isn\'t in the instruction cache. The TLB, on a CPU without ASID tagging, is fully flushed by the CR3 write — every memory access for a while triggers a page-table walk. Empirically, a switch on a modern Linux server burns 3-20 µs of usable CPU time. With 1000 switches/sec that\'s 0.3-2% of the CPU, gone, before you do anything useful.

Goroutines and green threads sidestep most of it

A Go goroutine switch costs ~200 ns. It doesn\'t involve the kernel — Go\'s runtime scheduler swaps the goroutine stack pointer and program counter in userland. No CR3 change (same process, same page tables), no kernel trap, no IRET. The trade-off: goroutines can\'t do true parallelism across CPUs without backing OS threads, and a blocking syscall on one parks the whole OS thread. Hence Go\'s netpoller, which uses non-blocking I/O so the runtime stays in userland.

How to see it in production

Linux exposes per-process switch counts in /proc/<pid>/status (look for voluntary_ctxt_switches and nonvoluntary_ctxt_switches). High involuntary switches mean the CPU is saturated and your process keeps getting preempted. perf stat -e context-switches measures system-wide rates. If you see 10k+ switches per second per core, latency starts to suffer and worth investigating whether you have too many threads, lock contention causing waits, or undersized timeslices.

Go deeper

Operating Systems Codex →

CFS internals, the EEVDF scheduler, preemption rules, real-time scheduling classes, why some workloads pin threads to specific cores.

Open the Codex →

Found this useful?