The M:P:G scheduler
The Go scheduler maps thousands of goroutines onto a small number of OS threads. It's a
work-stealing, preemptive, two-level scheduler built around three structs: M
(OS thread), P (logical processor), and G (goroutine). Once you
know which one does what, the rest of the runtime stops being mysterious.
Three structs, one scheduler
| Struct | What it represents | How many |
|---|---|---|
M | OS thread (kernel thread) | Up to ~10,000 (default) |
P | Logical processor — execution context, holds a run queue | GOMAXPROCS (default = NumCPU) |
G | Goroutine | Thousands to millions |
The invariants: a G only runs when bound to an M; an
M only runs Go code while holding a P. So the bottleneck on
parallelism is the number of Ps — that's why GOMAXPROCS caps
real parallelism. Ms without a P exist (in syscalls,
parked) but can't make progress on user code.
+-----+ +-----+ +-----+
| M | ---- | P | ---- | G | ← currently running
+-----+ +-----+ +-----+
|
v
+-------+
| runq | ← per-P local run queue (256 G's)
+-------+
↑
global runq, sched.runq (overflow)
↑
other P's runq (work stealing)Run queues, in three layers
A P finds the next runnable G in three places, in order:
- Its own local run queue. A small ring buffer (256 slots) on the
Pstruct. O(1) enqueue/dequeue, no locks needed because only theP's currentMtouches it (with atomic ops for stealers). - The global run queue. A linked list on
sched.runq, guarded bysched.lock. Used as overflow when local queues fill, and every 61st schedule tick the scheduler dequeues from the global to keep it from starving. - Work stealing. If both queues are empty, the
Ppicks another randomPand steals half its run queue. Repeats up to four times before parking.
P's
recent work. Stealing is a fallback. Most goroutines run on the P that
created them, which keeps cache locality intact.Syscalls and the P hand-off
A blocking syscall ties up the OS thread (M) until it returns. If we kept
the P attached, all the goroutines on that P's queue would be
blocked behind the syscall. So the runtime hands off the P:
- Goroutine enters a syscall via
entersyscall. - Runtime detaches the
Pfrom thisMand parks it. - If a free
Mexists, it grabs thePand continues running other goroutines. Otherwisesysmonspawns one. - Syscall returns. Goroutine calls
exitsyscalland tries to re-acquire aP. - If a
Pis available, run on it. Otherwise park the goroutine on the global queue and let theMexit (or sleep).
This is why "1000 goroutines in a syscall" works fine. The Ms pile up
in the syscall but the Ps keep running other goroutines.
sysmon, the background watcher
One special M runs without a P in a tight loop: sysmon.
It wakes up every 10–10,000 microseconds (adaptive backoff) and does five things:
- Retake P from blocked syscalls. If a
Phas been in syscall > 20μs and there's other work, hand it off. - Force preemption. If a goroutine has been on-CPU > 10ms, send
SIGURG(since 1.14) for async preemption. - Trigger GC. If GC hasn't run in 2 minutes, force it.
- Poll the network. Scrape any expired timers and netpoll-ready goroutines and put them on a run queue.
- Spawn an M. If all
Ps have work but noMis available (rare).
sysmon never holds a lock for long, never holds a P, and never
blocks. It's the only OS thread that lives outside the M:P:G dance.
GOMAXPROCS — the knob that matters
GOMAXPROCS is the number of Ps the runtime creates. It caps
parallelism for user code. Default is runtime.NumCPU().
The two interesting cases:
- Containers with CPU limits. A pod limited to 2 CPUs on a 64-CPU
host:
NumCPU()returns 64, but the cgroup throttles you to 2 CPUs. The scheduler creates 64Ps, the kernel runs at most 2, and you get thrashing. Fix: setGOMAXPROCSmanually, or use automaxprocs to read the cgroup quota. - Lock-heavy or syscall-heavy workloads. More
Ps doesn't help if your goroutines spend most of their time waiting. SometimesGOMAXPROCS= (CPUs / 2) reduces contention. Profile.
Reading GODEBUG=schedtrace
Set GODEBUG=schedtrace=1000 and the runtime prints a one-line summary every
second. Decoding it:
SCHED 1011ms: gomaxprocs=8 idleprocs=3 threads=14 spinningthreads=0 idlethreads=4 runqueue=2 [0 0 1 0 0 5 0 0]| Field | Means |
|---|---|
gomaxprocs | Number of Ps |
idleprocs | Ps with nothing to run |
threads | Total Ms |
spinningthreads | Ms actively trying to steal |
runqueue | Length of the global queue |
[a b c ...] | Length of each P's local queue |
Sustained large runqueue + non-zero local queues means you have more work
than CPU. Sustained imbalance across local queues means stealing isn't keeping up
(rare, but happens with very short-lived goroutines).
Further reading
- runtime/proc.go — the scheduler implementation.
- Dmitry Vyukov — Scalable Go Scheduler Design — the original 2012 design doc that introduced M:P:G.
- Proposal — Non-cooperative preemption — how 1.14 added signal-based preemption.
- Knyszek — Scheduling in Go — recent talk from a runtime maintainer.
- uber-go/automaxprocs — the production fix for GOMAXPROCS in cgroups.