Goroutines
A goroutine is a stackful coroutine that the Go runtime multiplexes onto OS threads. The
runtime tracks each one with a small g struct, gives it a tiny initial stack,
and schedules it against thousands of siblings. The cost of a single go f() is
small but not zero, and there are a handful of failure modes worth knowing about before
you write the first one.
What a goroutine actually is
In runtime/runtime2.go there is a struct called g — about 80
fields, the most important being stack, m (the OS thread it's
currently running on, if any), sched (the saved register state when it's
parked), and atomicstatus (one of idle, runnable, running, syscall,
waiting, dead).
When you write go f(x, y), the compiler emits a call to
runtime.newproc. That function allocates a fresh g, gives it a
2 KB stack out of the per-P stack cache, sets up the saved program counter to point at
f, queues it on the local run-queue of the current P, and
returns. Total cost: a few hundred nanoseconds.
// runtime/proc.go (simplified)
func newproc(fn *funcval) {
gp := getg()
pc := getcallerpc()
systemstack(func() {
newg := newproc1(fn, gp, pc)
runqput(getg().m.p.ptr(), newg, true)
if mainStarted {
wakep()
}
})
}Stacks that grow
Until Go 1.3 stacks were segmented — the runtime would link a new chunk on overflow and unlink it on return, which produced "hot split" pathology when a function repeatedly straddled the boundary. Since 1.4 stacks are contiguous and copying: when a function's prologue detects insufficient stack, the runtime allocates a stack twice the size, copies the old stack into it, walks the goroutine's frames to rewrite pointers, and frees the old one.
Initial stack: 2 KB. Maximum stack: 1 GB on 64-bit, controlled by
runtime.SetMaxStack. The growth check is a single instruction inserted at the
top of every non-leaf function — comparing SP against
g.stackguard0.
Parking and unparking
A goroutine is "parked" any time it cannot make progress: blocked on a channel, waiting on
a mutex, sleeping, asleep in net waiting for I/O, in a syscall. The runtime
calls gopark, which transitions the g from
running to waiting and switches the m to the next runnable
g.
When the wait condition is satisfied, goready moves the g back
to runnable and queues it on a P's run-queue. The actual register
state — PC, SP, BP, callee-saved registers — lives in g.sched the whole time.
Switching costs are dominated by the cache effects of touching the new stack, not the
handful of register loads.
// Channel send is the canonical parking site:
// 1. acquire hchan.lock
// 2. if a receiver is waiting, hand off & return
// 3. if buffer has room, copy & return
// 4. otherwise: enqueue self on hchan.sendq, gopark()
// 5. on wake: unlock, returnPreemption, in three eras
Pre-1.14: cooperative only. The compiler inserted preemption checks at every function call. A goroutine in a tight loop that didn't call any function could run forever and starve the scheduler. Real production stalls happened.
1.14+: signal-based async preemption. The runtime's sysmon
thread sends SIGURG to threads running goroutines that have been on-CPU more
than 10 ms. The signal handler suspends the goroutine at a safe point, lets the
scheduler run, and resumes. CPU-bound goroutines are now preempted reliably.
The seam that still leaks. Async preemption is only safe at points where
the runtime can identify all live pointers — i.e. at "safepoints", which the compiler
emits liberally but not universally. Hand-tuned assembly without the right pragmas, and
some cgo entry points, are still preemption holes.
Goroutine leaks
A goroutine leak is just a parked goroutine that never gets unparked. The classic shapes:
- Channel send to nobody. Producer sends on an unbuffered channel; the consumer panics or returns; producer parks forever.
- Receive from nobody. Consumer reads from a channel; producer hits an
error path that doesn't
close; consumer parks forever. - Forgotten
context. Long-running goroutine doesn't watchctx.Done(); caller cancels; goroutine keeps running. - Mutex held by a dead goroutine. Goroutine A locks; A panics in a
library that doesn't
defer Unlock(); B parks forever onLock.
runtime.NumGoroutine() after a known
steady state, then again after a workload, then again after the workload "should have"
drained. If the count climbs and never falls, you're leaking. pprof /goroutine
gives you the stacks of every parked goroutine — the duplicates are usually the leak.The real cost of go f()
| What | Cost |
|---|---|
Allocate g + 2KB stack | ~200 ns (cached), ~1 μs (cold) |
| Park / unpark on a channel | ~150 ns each side |
| Stack growth (2KB → 4KB) | ~1 μs + memcpy |
| Context switch between G's on same M | ~50 ns + cache effects |
| OS thread context switch (M handoff) | ~1–2 μs |
| Goroutine running on its own M (syscall) | ~OS thread cost |
Numbers are order-of-magnitude on a 2024-era x86 server. Run go test -bench
on your own hardware if it matters; the numbers shift between architectures and Go
versions.
When not to spawn one
- Per-CPU work. If you have N CPU cores and your work is parallelizable,
runtime.GOMAXPROCS(0)goroutines is the ceiling that gives you any speedup. More just adds scheduling overhead. - Per-request fanout, unbounded. A goroutine per fanout call from a request handler can produce a goroutine cliff under load. Bound it with a semaphore or a worker pool.
- Tiny work units. If
f()takes 100 ns to execute,go f()doubles the cost just in the spawn. Inline it.
Further reading
- runtime/proc.go
— scheduler,
newproc,gopark,goready. - runtime/runtime2.go
— the
g,m, andpstruct definitions. - Proposal — Non-cooperative preemption — the design doc behind 1.14's signal-based async preemption.
- research!rsc — The Go memory model — Russ Cox's three-part series on the model goroutines run inside.
- Cox — Inside the runtime — a one-hour tour of the runtime, including goroutine creation and scheduling.