The M:P:G scheduler

The Go scheduler maps thousands of goroutines onto a small number of OS threads. It's a work-stealing, preemptive, two-level scheduler built around three structs: M (OS thread), P (logical processor), and G (goroutine). Once you know which one does what, the rest of the runtime stops being mysterious.

Three structs, one scheduler

Struct	What it represents	How many
`M`	OS thread (kernel thread)	Up to ~10,000 (default)
`P`	Logical processor — execution context, holds a run queue	`GOMAXPROCS` (default = NumCPU)
`G`	Goroutine	Thousands to millions

The invariants: a G only runs when bound to an M; an M only runs Go code while holding a P. So the bottleneck on parallelism is the number of Ps — that's why GOMAXPROCS caps real parallelism. Ms without a P exist (in syscalls, parked) but can't make progress on user code.

+-----+      +-----+      +-----+
|  M  | ---- |  P  | ---- |  G  |   ← currently running
+-----+      +-----+      +-----+
                  |
                  v
              +-------+
              | runq  |  ← per-P local run queue (256 G's)
              +-------+
                  ↑
            global runq, sched.runq (overflow)
                  ↑
            other P's runq (work stealing)

Run queues, in three layers

A P finds the next runnable G in three places, in order:

Its own local run queue. A small ring buffer (256 slots) on the P struct. O(1) enqueue/dequeue, no locks needed because only the P's current M touches it (with atomic ops for stealers).
The global run queue. A linked list on sched.runq, guarded by sched.lock. Used as overflow when local queues fill, and every 61st schedule tick the scheduler dequeues from the global to keep it from starving.
Work stealing. If both queues are empty, the P picks another random P and steals half its run queue. Repeats up to four times before parking.

Why local first. The local queue is cache-hot for this P's recent work. Stealing is a fallback. Most goroutines run on the P that created them, which keeps cache locality intact.

Syscalls and the P hand-off

A blocking syscall ties up the OS thread (M) until it returns. If we kept the P attached, all the goroutines on that P's queue would be blocked behind the syscall. So the runtime hands off the P:

Goroutine enters a syscall via entersyscall.
Runtime detaches the P from this M and parks it.
If a free M exists, it grabs the P and continues running other goroutines. Otherwise sysmon spawns one.
Syscall returns. Goroutine calls exitsyscall and tries to re-acquire a P.
If a P is available, run on it. Otherwise park the goroutine on the global queue and let the M exit (or sleep).

This is why "1000 goroutines in a syscall" works fine. The Ms pile up in the syscall but the Ps keep running other goroutines.

sysmon, the background watcher

One special M runs without a P in a tight loop: sysmon. It wakes up every 10–10,000 microseconds (adaptive backoff) and does five things:

Retake P from blocked syscalls. If a P has been in syscall > 20μs and there's other work, hand it off.
Force preemption. If a goroutine has been on-CPU > 10ms, send SIGURG (since 1.14) for async preemption.
Trigger GC. If GC hasn't run in 2 minutes, force it.
Poll the network. Scrape any expired timers and netpoll-ready goroutines and put them on a run queue.
Spawn an M. If all Ps have work but no M is available (rare).

sysmon never holds a lock for long, never holds a P, and never blocks. It's the only OS thread that lives outside the M:P:G dance.

GOMAXPROCS — the knob that matters

GOMAXPROCS is the number of Ps the runtime creates. It caps parallelism for user code. Default is runtime.NumCPU().

The two interesting cases:

Containers with CPU limits. A pod limited to 2 CPUs on a 64-CPU host: NumCPU() returns 64, but the cgroup throttles you to 2 CPUs. The scheduler creates 64 Ps, the kernel runs at most 2, and you get thrashing. Fix: set GOMAXPROCS manually, or use automaxprocs to read the cgroup quota.
Lock-heavy or syscall-heavy workloads. More Ps doesn't help if your goroutines spend most of their time waiting. Sometimes GOMAXPROCS = (CPUs / 2) reduces contention. Profile.

Reading GODEBUG=schedtrace

Set GODEBUG=schedtrace=1000 and the runtime prints a one-line summary every second. Decoding it:

SCHED 1011ms: gomaxprocs=8 idleprocs=3 threads=14 spinningthreads=0 idlethreads=4 runqueue=2 [0 0 1 0 0 5 0 0]

Field	Means
`gomaxprocs`	Number of `P`s
`idleprocs`	Ps with nothing to run
`threads`	Total Ms
`spinningthreads`	Ms actively trying to steal
`runqueue`	Length of the global queue
`[a b c ...]`	Length of each P's local queue

Sustained large runqueue + non-zero local queues means you have more work than CPU. Sustained imbalance across local queues means stealing isn't keeping up (rare, but happens with very short-lived goroutines).

The M:P:G scheduler

Three structs, one scheduler

Run queues, in three layers

Syscalls and the P hand-off

sysmon, the background watcher

GOMAXPROCS — the knob that matters

Reading GODEBUG=schedtrace

Further reading

03 — Channels