01 / 10
Internals / 01

Goroutines

A goroutine is a stackful coroutine that the Go runtime multiplexes onto OS threads. The runtime tracks each one with a small g struct, gives it a tiny initial stack, and schedules it against thousands of siblings. The cost of a single go f() is small but not zero, and there are a handful of failure modes worth knowing about before you write the first one.


What a goroutine actually is

In runtime/runtime2.go there is a struct called g — about 80 fields, the most important being stack, m (the OS thread it's currently running on, if any), sched (the saved register state when it's parked), and atomicstatus (one of idle, runnable, running, syscall, waiting, dead).

When you write go f(x, y), the compiler emits a call to runtime.newproc. That function allocates a fresh g, gives it a 2 KB stack out of the per-P stack cache, sets up the saved program counter to point at f, queues it on the local run-queue of the current P, and returns. Total cost: a few hundred nanoseconds.

// runtime/proc.go (simplified)
func newproc(fn *funcval) {
    gp := getg()
    pc := getcallerpc()
    systemstack(func() {
        newg := newproc1(fn, gp, pc)
        runqput(getg().m.p.ptr(), newg, true)
        if mainStarted {
            wakep()
        }
    })
}

Stacks that grow

Until Go 1.3 stacks were segmented — the runtime would link a new chunk on overflow and unlink it on return, which produced "hot split" pathology when a function repeatedly straddled the boundary. Since 1.4 stacks are contiguous and copying: when a function's prologue detects insufficient stack, the runtime allocates a stack twice the size, copies the old stack into it, walks the goroutine's frames to rewrite pointers, and frees the old one.

Initial stack: 2 KB. Maximum stack: 1 GB on 64-bit, controlled by runtime.SetMaxStack. The growth check is a single instruction inserted at the top of every non-leaf function — comparing SP against g.stackguard0.

Why this matters. A goroutine that allocates 8 KB of stack-local data is not 4× more expensive than one that allocates 2 KB — it incurs one growth + copy. A goroutine that recurses deeply pays repeatedly. Iterative algorithms keep stacks shallow.

Parking and unparking

A goroutine is "parked" any time it cannot make progress: blocked on a channel, waiting on a mutex, sleeping, asleep in net waiting for I/O, in a syscall. The runtime calls gopark, which transitions the g from running to waiting and switches the m to the next runnable g.

When the wait condition is satisfied, goready moves the g back to runnable and queues it on a P's run-queue. The actual register state — PC, SP, BP, callee-saved registers — lives in g.sched the whole time. Switching costs are dominated by the cache effects of touching the new stack, not the handful of register loads.

// Channel send is the canonical parking site:
//   1. acquire hchan.lock
//   2. if a receiver is waiting, hand off & return
//   3. if buffer has room, copy & return
//   4. otherwise: enqueue self on hchan.sendq, gopark()
//   5. on wake: unlock, return

Preemption, in three eras

Pre-1.14: cooperative only. The compiler inserted preemption checks at every function call. A goroutine in a tight loop that didn't call any function could run forever and starve the scheduler. Real production stalls happened.

1.14+: signal-based async preemption. The runtime's sysmon thread sends SIGURG to threads running goroutines that have been on-CPU more than 10 ms. The signal handler suspends the goroutine at a safe point, lets the scheduler run, and resumes. CPU-bound goroutines are now preempted reliably.

The seam that still leaks. Async preemption is only safe at points where the runtime can identify all live pointers — i.e. at "safepoints", which the compiler emits liberally but not universally. Hand-tuned assembly without the right pragmas, and some cgo entry points, are still preemption holes.

Goroutine leaks

A goroutine leak is just a parked goroutine that never gets unparked. The classic shapes:

  • Channel send to nobody. Producer sends on an unbuffered channel; the consumer panics or returns; producer parks forever.
  • Receive from nobody. Consumer reads from a channel; producer hits an error path that doesn't close; consumer parks forever.
  • Forgotten context. Long-running goroutine doesn't watch ctx.Done(); caller cancels; goroutine keeps running.
  • Mutex held by a dead goroutine. Goroutine A locks; A panics in a library that doesn't defer Unlock(); B parks forever on Lock.
The detection trick. runtime.NumGoroutine() after a known steady state, then again after a workload, then again after the workload "should have" drained. If the count climbs and never falls, you're leaking. pprof /goroutine gives you the stacks of every parked goroutine — the duplicates are usually the leak.

The real cost of go f()

WhatCost
Allocate g + 2KB stack~200 ns (cached), ~1 μs (cold)
Park / unpark on a channel~150 ns each side
Stack growth (2KB → 4KB)~1 μs + memcpy
Context switch between G's on same M~50 ns + cache effects
OS thread context switch (M handoff)~1–2 μs
Goroutine running on its own M (syscall)~OS thread cost

Numbers are order-of-magnitude on a 2024-era x86 server. Run go test -bench on your own hardware if it matters; the numbers shift between architectures and Go versions.

When not to spawn one

  • Per-CPU work. If you have N CPU cores and your work is parallelizable, runtime.GOMAXPROCS(0) goroutines is the ceiling that gives you any speedup. More just adds scheduling overhead.
  • Per-request fanout, unbounded. A goroutine per fanout call from a request handler can produce a goroutine cliff under load. Bound it with a semaphore or a worker pool.
  • Tiny work units. If f() takes 100 ns to execute, go f() doubles the cost just in the spawn. Inline it.

Further reading

Found this useful?