Goroutine Scheduler: G → P → M, live.

Goroutines (G) run on processors (P), backed by OS threads (M). Each P has a local run queue; idle Ps steal from busy ones.

G total
12
Running
0/4

Processors (P) and their run queues
P0queue 3
idle
G1G5G9
P1queue 3
idle
G2G6G10
P2queue 3
idle
G3G7G11
P3queue 3
idle
G4G8G12

What you're looking at

Four cards, one per processor (P). The green chip at the top of each card is the goroutine that P is running right now; the grey chips below it are that P's local run queue, waiting their turn. Each press of Tick advances the scheduler one step. A running goroutine either keeps going, yields back to its queue, or blocks and leaves the board entirely. The counters up top track total goroutines and how many of the four Ps are busy.

Press Seed 12, then Tick repeatedly and watch whichever P drains first. The moment a P has nothing running and an empty queue, it steals half of the busiest neighbour's queue: not one goroutine, not all of them, half. You will see the chips jump columns. The other thing to notice is what's missing. Blocked goroutines vanish from every queue, so the G total holds steady while the visible chips thin out. A parked goroutine sits in no queue and costs the scheduler nothing until it wakes.

What is the Go goroutine scheduler?

A million connections, eight cores.

The Go runtime scheduler multiplexes millions of goroutines onto a small number of OS threads using the GMP model — G (goroutine), M (OS thread, machine), P (logical processor, the scheduling unit). Dmitry Vyukov redesigned the scheduler in 2012; Go 1.14 (2020) added asynchronous preemption. Together, GMP and async preemption are why Go hits a million open connections per process without the configuration anguish other runtimes ask for.

Imagine you're writing a chat server. Every connected user gets a thread that waits for their next message and writes back replies. With ten users, that's ten threads — fine. With ten thousand, you're already past the point where the operating system is happy: each OS thread reserves an 8 MB virtual stack and a kernel scheduling slot, and the context switches between them start eating real CPU. With a million users, the math is absurd: 8 TB of address space and a kernel that spends more time switching than running.

The first instinct — “one thread per request” — doesn't scale because operating-system threads were designed for tens or hundreds, not millions. The kernel wasn't built to schedule a million customers; it was built to schedule a desktop. When Dan Kegel coined the term C10K in 1999 to describe the challenge of holding ten thousand TCP connections on one box, the standard answer was an event loop in C: epoll, callbacks, manual state machines. Fast, but the code was painful — every blocking call had to be rewritten as a continuation.

Go's bet was that you should keep writing code as if every connection had its own thread, and have the language runtime do the bookkeeping behind the scenes. A goroutine looks like a thread to the programmer — you spawn one with go f(), you block it on a channel, you read from a socket — but it isn't an OS thread. It's a 2 KB stack and a tiny struct. Behind the scenes, a small fixed-size pool of real OS threads (one per CPU core) takes turns running whatever goroutine has work to do. When a goroutine blocks on I/O, the runtime parks it and runs another; when the I/O completes, the runtime wakes it up.

This is the M:N scheduling trick: M goroutines run on N OS threads where N is roughly your core count. A goroutine costs about 2 KB and 200 ns to spawn. An OS thread costs 8 MB and around 50 µs. The ratio is roughly four thousand to one in memory and two hundred to one in spawn time, so a single Go process on an eight-core box happily holds a million parked goroutines waiting for sockets. Discord, Cloudflare, Uber, and the entire Kubernetes control plane lean on exactly this property.

The simulator above visualises the runtime's three core actors: G (a goroutine), M (a real OS thread), P (a logical processor that owns a run queue). A G runs on an M while the M holds a P. When a P's queue empties, an idle M steals work from a busy peer's queue. That's the entire fast path — and it's why a Go server can sleep nearly free between bursts of work.

M:N MODEL · MANY GOROUTINES (N) RUN ON FEW OS THREADS (M)USER SPACEG1G2G3G4G5G6G7G8G9G10G11G12N GOROUTINES · 2 KB EACH · ANY NUMBER, ANY TIMEKERNELM0 (core 0)RUNNING G2M1 (core 1)RUNNING G6M2 (core 2)RUNNING G10M3 (core 3)RUNNING G12M = GOMAXPROCS OS THREADS · ONE PER CORE · PARKED GOROUTINES COST ALMOST NOTHING

Origins — from green threads to GMP

From green threads to GMP.

The idea that a language runtime can multiplex many user-space tasks onto a small pool of operating-system threads predates Go by decades. Sun's original 1995 Java threading model, Solaris LWPs, the GNU Pth library, and Erlang's BEAM scheduler all explored variations of M:N scheduling — many user threads onto N kernel threads. Java abandoned M:N in 2002 and reverted to 1:1 (every java.lang.Thread is an OS thread) because the JVM's monitors and JNI made user-space scheduling fragile. Go made the opposite bet, and got it right by design rather than retrofit.

Go 1.0 shipped in March 2012 with a primitive scheduler: a single global run queue protected by a global lock. Every M wanting work took the lock, dequeued a G, and ran it. With four cores this scaled adequately; with sixteen cores the lock became the bottleneck and benchmarks showed catastrophic contention. Dmitry Vyukov — a runtime engineer Google hired specifically for concurrency — wrote a design document in 2012 titled “Scalable Go Scheduler Design Doc” proposing a per-P local queue plus work-stealing, modeled on the Cilk research scheduler from MIT (Blumofe and Leiserson, 1994).

The original 1.0 scheduler was honest about its limits. Russ Cox's commit message and the early language FAQ explicitly listed the scheduler as “temporary”. Production users were already piling pressure on it: Docker (born 2013) used goroutines for every container probe; CockroachDB and InfluxDB built their internals on assumed cheap concurrency. The pre-1.1 scheduler couldn't service those workloads. That market pressure, more than any benchmark, is why Vyukov's design landed barely a year after 1.0's release — an unusually fast turnaround for a runtime change in any language.

The new scheduler shipped in Go 1.1 (May 2013). Three roles: G for goroutine, M for machine (an OS thread), P for processor (a logical scheduling context). A G runs on an M while the M holds a P. Each P has its own runnable queue, so the common path of dequeue-and-run is lock-free. Idle Ps steal work from busy ones; goroutines blocked on syscalls leave their P behind for another M to grab. The change moved Go from “decent on a laptop” to “competitive with hand-tuned C” on 64-core machines — and almost every concurrency improvement since (preemption, work-conservation, NUMA awareness) has been incremental on top of Vyukov's design.

The closest contemporary relative is Erlang's BEAM scheduler (Joe Armstrong et al., from 1986 onward). BEAM also runs many lightweight processes per OS thread, also uses per-scheduler run queues, also work-steals. The differences are philosophical: BEAM's processes have isolated heaps and communicate solely by copy-message, which makes preemption trivial; Go's goroutines share memory, which made preemption hard until Go 1.14. Erlang treats processes as failure boundaries, supervised in a tree; Go treats goroutines as cheap routines, assumed not to crash. Both shapes work; they fit different application archetypes.

What Vyukov got correct that previous M:N schedulers got wrong, including Java's pre-2002 attempt: handle blocking syscalls cleanly. The runtime detaches a syscall-bound M from its P after a brief threshold, allows another M to grab the P and continue scheduling, and re-attaches the original M when the syscall returns. Without that, a single read(2) on a slow disk would freeze GOMAXPROCS-fraction of the application's parallelism. Earlier attempts at M:N foundered exactly here; Go's clean separation between scheduler-state (P) and kernel-state (M) is the structural reason it works.


The GMP model — Goroutine, Machine, Processor

Three letters, one runtime.

A G is a goroutine: a stack, a program counter, scheduler bookkeeping. The struct lives in runtime/runtime2.go and weighs about 2.5 KB before the stack itself. Initial stacks were 8 KB and segmented; Go 1.4 (December 2014) replaced segmented stacks with contiguous stacks that copy on grow, with an initial size dropped to 2 KB. A million goroutines therefore costs roughly 4 GB of address space — comfortably within the working set of a modern server — while a million OS threads at 8 MB each would demand 8 TB.

Segmented stacks — the pre-1.4 design — allocated a small initial stack and chained additional segments on overflow via a special prologue. Cheap to allocate, but the “hot split” problem haunted them: a function that crossed a segment boundary every call paid a heap allocation for the new segment, used it briefly, freed it on return, and re-allocated on the next call. Microbenchmarks like a recursive Fibonacci with deep recursion thrashed at the segment boundary. Contiguous stacks, by contrast, allocate a slab and copy on grow; the copy cost is paid once per stack lifetime, not once per cross-boundary call. The change made Go's tight inner loops competitive with C for the first time.

An M is a machine: an OS thread, created via clone(2) on Linux, pthread_create elsewhere. The runtime maintains a pool of Ms and creates new ones on demand — default cap is 10,000, configurable via debug.SetMaxThreads. An M is expensive: a kernel TID, an 8 MB virtual stack, scheduler entries. The runtime tries to keep the count of busy Ms equal to GOMAXPROCS; idle Ms park on a futex.

A P is a processor: a logical scheduling context, holding the local run queue, a small cache for the memory allocator (mcache), and a deferred-function pool. There are exactly GOMAXPROCS Ps; the count is set at startup and rarely changed. Each P's local run queue is a 256-slot circular buffer; overflow spills onto the global run queue, which the scheduler periodically drains for fairness. To run, a goroutine needs an M holding a P. To block on a syscall, the M detaches its P (so another M can grab it) and waits in the kernel; on return, the M tries to reacquire any free P, otherwise it parks the G on the global queue.

GOMAXPROCS defaults to runtime.NumCPU() since Go 1.5 (August 2015) — before that, the default was 1, an embarrassing footgun for new users. Inside a container with a CPU limit lower than the host, GOMAXPROCS still saw the host count until Uber's automaxprocs library and finally Go 1.25 (2025) made cgroup-aware GOMAXPROCS the default. The lesson: a wrong GOMAXPROCS doesn't crash; it silently throttles. A 4-vCPU container running a Go program with GOMAXPROCS=64 (because the host has 64 cores) thrashes between Ps that the kernel refuses to schedule simultaneously, and benchmarks degrade by 30–60%.

Stack growth deserves its own paragraph. Each goroutine begins with 2 KB; on entry to a function the prologue checks whether the current stack pointer is within a guard region; if not, the runtime allocates a stack twice as large, copies the old contents, fixes up pointers, and resumes. The copy is cheap in absolute terms (a few microseconds for typical frames) but costly when a goroutine repeatedly grows and shrinks — a phenomenon called stack thrashing that early Go suffered from before the contiguous-stack rewrite. Modern Go retains a per-goroutine high-water mark to avoid shrinking aggressively, so the steady-state cost is one growth per stack lifetime.

A practical implication: pointer fix-up during stack copy is the reason Go cannot expose stack pointers to C code without ceremony. Cgo wraps every Go-to-C call in a thin shim that switches to a fixed C stack (a separately-allocated 8 MB block), preventing the moving Go stack from invalidating any C-side pointer. The shim costs roughly 50–200 ns per call, dwarfing native function-call overhead and explaining why CGO calls are expensive. For hot inner loops, the rule is “avoid CGO”; for occasional library calls, the cost is invisible. The escape hatch is runtime.LockOSThread, which pins a goroutine to a specific M for the rest of its life — necessary for OpenGL, OS event loops, and a few other thread-affine APIs.

GMP MODEL · GOROUTINES (G) BIND TO PROCESSOR (P) HELD BY MACHINE (M)P0G1G2G3G4LOCAL RUN QUEUEM0 · OS threadP1EMPTY · STEALS HALFM1 · OS threadglobal queueG7G8FAIRNESS & OVERFLOWSTEAL61ST

Work stealing — when an idle P refuses to sleep

An idle P refuses to sleep.

The fast path of the Go scheduler is brutally short. Pop the head of the local run queue; run that G; on yield or block, push it (or its successor) back. Almost no synchronisation: the local queue is a single-producer, multi-consumer deque, and the producer (the P itself) can append without atomics. Only the steal end uses atomic compare-and-swap.

There is also a runnext slot per P: a single-G fast slot for a freshly-spawned goroutine, scheduled in preference to anything else in the local queue. The pattern go work(); <-done — spawn a goroutine and immediately wait for it — is the canonical case. Because the spawned G goes into runnext, the spawning G's eventual deschedule (on the channel receive) immediately picks it up. No queue manipulation, no cache miss, just a register hand-off. Microbenchmarks of go f(); wait sustain 30–50 million spawns per second per core because of this single-slot optimisation.

When a P's local queue empties, the scheduler enters findRunnable. Its checks, in order: the local queue (empty by definition); the global run queue; the network poller (non-blocking probe); a victim P chosen at random. From the victim, the stealer takes half of the victim's runnable Gs — not one, not all, half. Half is the ratio that minimises both starvation and ping-pong: stealing one means the next steal is imminent; stealing all means the victim immediately becomes the new thief. The Cilk paper proved that random work-stealing with halving is within a constant factor of optimal for any DAG-shaped workload.

If no work is found anywhere, the P transitions to spinning — it polls all the queues again for a few hundred microseconds before parking the M. Spinning costs a CPU but avoids the latency of waking a parked thread (which can be 5–15 µs on Linux due to futex wake plus context switch). The runtime caps the number of spinning Ms to roughly GOMAXPROCS / 2 to avoid burning all cores on idle work.

Run-queue overflow

When a P's local queue holds 256 Gs and a new one is created, the scheduler spills half of the local queue plus the new G to the global queue in one atomic batch. This batched move keeps the global lock cold — a goroutine-spawning loop touches the global queue once per 128 spawns, not once per spawn.

Tokio — Rust's async runtime — uses a near-identical work-stealing scheduler since 0.2 (November 2019), with explicit credit to Go's design. Cilk Plus, Intel TBB, .NET's ThreadPool, and Java's ForkJoinPool are all variants of the same idea. Work-stealing has won the multi-core scheduling argument; the only debates left are tuning constants. The constants that matter empirically: how often to check the global queue, how many spinning Ms to allow, and how to bias steal-victim selection on NUMA machines.

A subtle but important rule: every 61st dequeue from the local queue is overridden to pull from the global queue instead. Without this fairness exception, a constantly-spawning local producer could starve goroutines that have spilled to the global queue indefinitely. Sixty-one is a prime, chosen by Vyukov so it doesn't accidentally synchronise with any other periodic operation in the runtime. This kind of detail — weird primes for fairness — recurs throughout the runtime code and is one of the reasons the scheduler resists naive optimisation.

NUMA awareness remains the open frontier. On dual-socket and quad-socket servers, memory access latency depends on which socket holds the page and which core the goroutine runs on. Stealing across sockets crosses an interconnect (UPI, Infinity Fabric) at typically 100 ns versus 25 ns for within-socket reads. Go's scheduler is NUMA-blind: a steal can move a goroutine from socket 0 to socket 1, after which every memory access through its closures and slices crosses the interconnect. Engineering teams running large Go services on bare metal (Cloudflare, Discord, Twitch) have reported 10–25% throughput gains by pinning processes to a single NUMA node with numactl. Native NUMA support is on the runtime team's wish-list but not on a release schedule; for now, deployment-time pinning is the workaround.


Cooperative to async preemption (Go 1.14)

From cooperative to async.

Until Go 1.14 (February 2020), goroutine preemption was strictly cooperative. The compiler inserted preemption checks at function prologues; a goroutine in a tight non-calling loop — for {}, for i := 0; i < 1e10; i++ { x++ }, or anything CPU-bound without function calls — could hold a P forever. The pathology had a name in production Go: scheduler stalls. A common symptom was an HTTP server stuck for tens of seconds because one goroutine was inside a regex catastrophic-backtracking loop, never crossing a function boundary.

The pre-1.14 model had a second hazard called busy-loop denial-of-service: untrusted code, or even well-meaning code with a hashing bug, could pin every P. A 2018 incident at a large Kubernetes user manifested as etcd unable to make leader-election progress because one goroutine was inside a deterministic-but-pathological YAML parse. Diagnosing it required pprof samples and an educated reading of the runtime stack — the kind of bug that erodes trust in a runtime. Async preemption killed that bug class entirely; post-1.14 incidents of the same shape are vanishingly rare.

The fix shipped in 1.14 as asynchronous preemption via signals, designed by Austin Clements. The runtime's sysmon thread detects a goroutine that has held its P for more than 10 ms and sends SIGURG to the M running it. The Go runtime's signal handler saves the goroutine's register state, marks it preemptible, and returns; the scheduler then deschedules it. SIGURG was chosen because it's almost never used by application code — the rare exception being TCP urgent-data programs, which had to be patched. The design is similar to Java HotSpot's safepoint polling, except Go uses signals where HotSpot uses a memory barrier on a guard page.

Garbage collection benefits even more than scheduling. Go's concurrent collector needs every goroutine to reach a safepoint within a bounded time so the GC can scan stacks and proceed. Pre-1.14, a single uncooperative goroutine could push GC stop-the-world latency from microseconds to seconds — a horrible tail latency for a server. After 1.14, p99 GC pauses dropped below 500 µs on most workloads. The async-preemption work is, in practical terms, also async-GC work.

There are still corners. Goroutines holding the runtime's forEachP lock, those inside CGO calls, and those running on a syscall-detached M cannot be preempted by signal. CGO calls remain the leading cause of long “stalls” in modern Go programs — if you call into a C library that takes seconds to return, that's seconds the goroutine isn't preemptible. Profilers like runtime/trace visualise these clearly, and the standard advice is to put long CGO calls behind a worker pool sized to GOMAXPROCS.

There is also runtime.Gosched(): a manual yield that returns the current goroutine to the back of its P's run queue. It's almost always the wrong tool. Code calling Gosched in a hot loop is implementing a poor-man's preemption that the runtime would do better. The legitimate uses are narrow: a benchmark hot path that wants to amortise scheduler overhead across iterations, or a debugger hook. Modern Go discourages it for application code; if you need cooperative yielding, channels and selects do the job with better semantics.

A subtle implication of preemption: function inlining changes preemption frequency. The compiler inlines small functions, which removes their preemption check; aggressive inlining of a hot loop body can extend a goroutine's uninterrupted run by orders of magnitude. The 1.14 async preemption work fixes the worst-case stalls but does not fix all tail-latency footguns; if you care about p99 GC pauses, profile-guided optimisation (PGO, stable in Go 1.21) and selective //go:noinline annotations are the tools to reach for. The number of practitioners who understand this remains embarrassingly small.

EraMechanismTriggerFailure mode
Go 1.0–1.2noneglobal queue lockcore scaling cliff
Go 1.2–1.13cooperativefunction-prologue checktight loop = forever stall
Go 1.14+async (signals)SIGURG from sysmon > 10msCGO calls still un-preemptible
BEAM (Erlang)reductions~4000 reds (~1ms)NIFs bypass the counter
Tokio (Rust)explicit .awaitprogrammer yield pointmissing await stalls runtime
// Spawn a goroutine. Cost ~200ns; stack starts at 2KB.
go func(id int) &#123;
    for &#123;
        select &#123;
        case msg := &lt;-work:
            handle(msg)
            // pre-1.14: tight non-calling loops were unpreemptible.
            // since 1.14: sysmon sends SIGURG after 10ms.
        case &lt;-ctx.Done():
            return
        &#125;
        runtime.Gosched() // explicit yield. usually unnecessary.
    &#125;
&#125;(workerID)

The threads you didn't write — netpoll, sysmon

The threads you didn't write.

Two daemons run beside the GMP machinery and account for most of Go's reputation as a network workhorse. The first is the network poller: a single shared event loop using epoll on Linux, kqueue on BSD and macOS, IOCP on Windows, and a port loop on Solaris. Every net.Conn, os.File in non-blocking mode, and timer registers with the poller. A goroutine that calls conn.Read on a not-yet-ready socket parks itself on the poller and yields; when the kernel signals readability, the runtime moves the G back to a P's run queue.

Crucially, the netpoller does not own a P. It is conceptually a singleton observer: any M can call netpoll(0) for a non-blocking probe, and exactly one M at a time owns the blocking call (epoll_wait with a timeout). When goroutines are running on every P, no M is parked on the netpoller; the poller is checked between scheduler iterations. When all Ps are idle, one M becomes the “net-poll-blocked” M and parks in the kernel; the runtime arranges for that M to be woken on any FD readiness or after the longest scheduled timer. The result is a runtime that uses zero CPU when there is no work and reacts within microseconds when an FD becomes readable.

Disk I/O is the un-elegant exception. Linux's epoll doesn't really support regular files (every file is “ready” the moment you ask), so disk reads in Go go through actual blocking syscalls on M. The runtime uses the same syscall-detach mechanism as for any other blocking call: the M parks in the kernel; its P is taken by another M; on return, the M tries to reacquire a P. This is fine at low concurrency but means a Go program reading from a slow disk can grow its M count rapidly. The Linux 5.1 io_uring interface offers a way out, and Go's standard library is gradually adopting it for file I/O on supporting kernels.

This is why a Go server can hold a million idle TCP connections on a four-core box: each connection is a parked G plus an FD in epoll, totalling roughly 4 KB. Compare to a 1:1 thread-per-connection server, where each connection demands an 8 MB OS thread — the canonical ten-year-old “C10K” problem solved in the runtime rather than the kernel. The network poller is checked opportunistically by every findRunnable call, and explicitly by sysmon every 10 ms in case no scheduler activity has happened recently.

Sysmon is the runtime's pacemaker: a dedicated OS thread without an attached P, woken every 20 µs at minimum, every 10 ms at maximum. Its responsibilities: poll the network poller (in case findRunnable hasn't), retake Ps from Ms blocked in syscalls beyond a threshold, send async-preempt signals to goroutines that have run too long, run the garbage collector's pacing logic, and trigger periodic runtime.GC if memory growth has stalled. Sysmon never holds a P, so it doesn't compete with goroutines for cores; it's a kernel-thread overseer.

A useful diagnostic environment variable: GODEBUG=schedtrace=1000,scheddetail=1 emits a runtime summary every second to stderr — goroutine count by state, P run-queue depths, M states, sysmon ticks. Reading the output is a fast education in everything above. For deeper visibility, the execution tracer (go tool trace) records every scheduler event into a binary log and renders a graphical timeline of which G ran on which P at every instant. The tool is one of Go's underused superpowers.

Timer goroutines were once a sysmon side-quest and are now first-class. Pre-Go 1.10 every time.NewTimer heap-allocated and lived in a single global heap protected by a global mutex; under high time-out load the timer mutex was the largest contention site in the runtime. Go 1.10 sharded the timer heap per P; Go 1.14 moved timer firing into the scheduler's findRunnable path, eliminating the dedicated timer goroutine entirely. The result is a five-to-ten-fold throughput improvement on workloads that arm and disarm timers at network speed — almost every HTTP server in production.

SYSMON · ASYNC PREEMPTION TIMELINE (10MS THRESHOLD, SIGURG)0ms10ms20msSYSMON TICKG running — tight loop, no func callSIGURGrunnable, on local P queuesysmon thread (no P) · 20µs–10ms wakeups · netpoll, retake, preemptGSYSMON

Scheduler-aware synchronisation — channels, mutexes, sync.Pool

Synchronisation, scheduler-aware.

Goroutines synchronise primarily through channels, and channels are not a library — they are runtime-internal structures (runtime/chan.go) tightly coupled to the scheduler. A channel holds a circular buffer (zero-length for unbuffered channels), a mutex, a queue of senders parked on a full buffer, and a queue of receivers parked on an empty one. Send and receive operations acquire the mutex, copy bytes if a counterpart is already waiting, and otherwise park the calling goroutine on the appropriate queue and call gopark.

Closing a channel wakes every parked receiver in one batch. Each gets a zero value and a false “ok” from the comma-ok form. Closing twice panics, by design — close is a uni-directional state transition. Send on a closed channel also panics. These rules look like inconveniences but are part of why Go programmers write the done-channel pattern instead of using shared boolean flags: a closed channel is the safest possible broadcast primitive in the language because the runtime guarantees any number of waiting receivers wake up exactly once.

A subtle scheduler interaction: a hot fan-out, fan-in pipeline of unbuffered channels can produce scheduler ping-pong — producer wakes consumer, consumer immediately blocks because no other producer has run, producer wakes consumer again, and so on. Each wake costs roughly 1–3 µs. Adding a small buffer (size 1 or 2) breaks the ping-pong and can improve throughput tenfold. Profiling tools like go tool trace visualise this pattern as alternating tiny G runs across cores. The fix is unintuitive but well-known to anyone who has tuned a Go data pipeline.

The scheduler integration matters: a parked sender stays attached to the channel's send queue, not to a P or M. When a receiver arrives, the channel code dequeues a sender, copies its value into the receiver's stack frame directly (no buffer hop for the unbuffered case), and calls goready to put the sender back on a P's run queue. The hand-off is direct — no queue spill, no global lock — which is why an unbuffered channel ping-pong between two goroutines can sustain roughly 5–15 million ops/sec on a single core. Buffered channels add an additional copy and run somewhat slower.

Select is more involved. For an N-way select, the runtime evaluates all cases, randomly shuffles the cases that are immediately ready (avoiding starvation between them), and picks one. If none is ready, it parks the goroutine on all N channels' wait queues simultaneously, then yields. When any case fires, the wakeup code unparks the goroutine and removes it from the other N−1 wait queues. The cost of select is roughly proportional to N due to this multi-queue parking; reflect.Select is much slower because it dynamically allocates the case array.

The single most useful idiom built on top of channels is the context.Context cancellation tree. Every cancellation reduces to a close(ch); every “am I still alive” check reduces to a non-blocking select { case <-ctx.Done(): ... default: }. The combination of cheap goroutines, cheap channels, cheap select, and structured cancellation gives Go its reputation for being good at request-scoped concurrency — the workload nearly every microservice runs.

Mutexes are the other primitive worth understanding. Go's sync.Mutex is a hybrid: it spins for a few hundred nanoseconds (cheap when contention is brief), then parks the goroutine via the runtime's semacquire primitive, which integrates with the scheduler exactly like a channel park. A waiter joins a per-mutex FIFO queue, and the unlocker hands the lock directly to the head waiter without going through the run queue. This direct hand-off is what makes Go mutexes roughly 30 ns when uncontested and roughly 1 µs when one goroutine waits — numbers competitive with the best C++ implementations, despite running through the Go runtime.

Go 1.9 added starvation mode to sync.Mutex. If a waiter has been queued for more than 1 ms, the mutex switches from “normal” mode (where the unlocker can hand the lock to a hot newcomer for cache locality) to “starvation” mode (where the head of the FIFO queue always wins). This bounded-waiting guarantee fixed a class of pathological tail latencies in long-running services where one writer would consistently lose to a stream of fresh readers. RWMutex and the standard sync.Map got similar treatments in subsequent releases. The lesson, again, is that production runtimes accumulate small “fairness in the tail” fixes nobody notices until they're missing.

Performance numbers worth remembering, measured on modern x86 with Go 1.22: a goroutine costs roughly 200 ns to spawn and 2 KB of memory; a function call inside an existing goroutine costs roughly 2 ns; a buffered channel hand-off costs 100–200 ns; an unbuffered channel hand-off costs 200–400 ns; a context-switch via park/unpark costs 1–3 µs. They mean a million goroutines cost roughly 2 GB and 200 ms of total spawn time; a million channel ops cost roughly 200 ms; a million blocking yields cost roughly 2 s. Whether that fits your workload is exactly the question the Go runtime asks every day — and the reason cheap concurrency is the wrong default for a CPU-bound batch job and the right default for a request/response server.

Buffered vs unbuffered

A small buffer (size 1 or 2) on a hot fan-out channel can break scheduler ping-pong and improve throughput tenfold — the producer no longer parks on every send. Unbuffered channels remain the right default for hand-offs that need synchronisation; buffered channels are right when the buffer reflects real backpressure tolerance.


Other runtimes solving the same problem — Erlang, Tokio, virtual threads

Other runtimes solving the same problem.

Erlang's BEAM scheduler, dating to 1986 and refactored for SMP in OTP R11B (2006), runs N scheduler threads (one per core), each with a per-priority run queue. Erlang processes have isolated heaps and message-passing semantics, so preemption is trivial: the runtime checks a reduction counter at every function entry and yields after roughly 4,000 reductions (~1 ms). Stealing operates between scheduler threads. The major BEAM-specific feature is heart-beat fairness across priority classes: even a low-priority process is guaranteed eventual time, which Go does not strictly promise.

Rust's Tokio, the de-facto async runtime since 2017, ships a multi-threaded scheduler with per-worker LIFO slots and global FIFO queues. Tasks are Future state machines compiled by the Rust compiler — not stacks — so a task is roughly 64 bytes plus the size of the state machine. Cancellation is by drop. The model is leaner than Go's per-goroutine 2 KB stack but pushes more complexity into application code: every .await is a manual yield point, and forgetting to make a long computation yieldable causes the same starvation Go fixed in 1.14.

Java's Project Loom shipped virtual threads in JDK 21 (September 2023) under JEP 444. A virtual thread is a continuation-backed M:N goroutine equivalent: cheap, blocking-friendly, no API change versus Thread. JDK 21 multiplexes virtual threads onto a ForkJoinPool; per-thread overhead is around 1 KB. The retrofit is impressive but constrained: synchronized blocks pin a virtual thread to its carrier (until JDK 24's pinning fix), and any long-running JNI call still blocks a carrier. Go's M:N model, in other words, finally landed on the JVM — twelve years after Vyukov's design doc.

.NET's Task / async/await pipeline, Kotlin's coroutines, Python's asyncio, JavaScript's microtask queue — each is a different point in the same design space. The recurring lesson: language runtimes that ignore concurrency become irrelevant within a decade. Go is currently the cleanest implementation of M:N scheduling in a mainstream language; whether it stays that way depends on whether Loom catches up before Go's authors retire.

Where BEAM excels and Go does not is in graceful degradation. WhatsApp's two-million-connections-per-server result, achieved on Erlang in 2012, was as much about supervised process trees and bounded mailboxes as about cheap concurrency: a memory blow-up in one process did not poison another. Go's shared heap and panic-as-stop semantics give better throughput in the median case but worse failure isolation. Conversely, where Go beats BEAM is on numerical and CPU-intensive work: Erlang's tagged-pointer arithmetic costs twice as much as native Go, and Go's contiguous slabs interact better with modern CPU caches. Neither is the universal winner.

One last comparison worth making: thread-per-request systems — Apache prefork, classic Tomcat, Ruby's MRI — treat each request as an OS thread, and the kernel handles scheduling. They scale to thousands of connections, not millions. Goroutine-style schedulers move scheduling into user space, treat connections as cheap parked tasks, and scale by orders of magnitude. The cost is that the runtime becomes a complex and load-bearing piece of code: bugs in runtime/proc.go show up as “my service stalls every Tuesday”, and the only path to fixing them is reading the runtime source. The scheduler is a beautiful piece of engineering, and an unforgiving one.


Further reading on the Go scheduler

Primary sources, in order.

Found this useful?