09 / 10

Internals / 09

The runtime package

The runtime package is the standard-library bridge to the scheduler and the heap. It exposes a small, carefully-curated subset of internals: a few knobs you can turn, a few counters you can read, and a handful of hooks you probably shouldn't touch. Knowing which is which is most of the value.

What runtime is

Most of what the Go runtime does — scheduling, garbage collection, stack growth, memory allocation — happens behind the language. The runtime package is the small doorway the standard library leaves open. Through it you can read a handful of counters, turn a few knobs, and register a few callbacks.

The API divides roughly into three groups. Diagnostic: ReadMemStats, Stack, NumGoroutine, NumCgoCall — read-only views into runtime state. Operational: GOMAXPROCS, GC, SetCPUProfileRate — actually change runtime behaviour. Dangerous: SetFinalizer, KeepAlive, Goexit, LockOSThread — correct uses exist but most code that reaches for them shouldn't.

The rule of thumb. If a function in runtime has an obvious equivalent elsewhere — defer file.Close() instead of SetFinalizer, a metric library instead of a hand-rolled ReadMemStats loop — prefer the equivalent. The runtime package is a fallback, not a default.

GOMAXPROCS

GOMAXPROCS controls the number of P structures the scheduler creates, which caps how many goroutines can be executing user code in parallel. Since Go 1.5 the default is runtime.NumCPU(). You can override it at startup with the GOMAXPROCS environment variable, or change it at runtime by calling runtime.GOMAXPROCS(n).

The catch in containers: NumCPU reads the host's CPU count, not the cgroup CPU quota. A process running with a 2-core quota on a 64-core host will spin up 64 P's, oversubscribe the cgroup, and pay the latency tax of constant throttling. Until the runtime grows cgroup awareness natively, the standard fix is uber-go/automaxprocs: a blank import that reads the cgroup CPU limit and sets GOMAXPROCS accordingly.

import _ "go.uber.org/automaxprocs"

// or set explicitly from a container env var:
//   GOMAXPROCS=$(nproc) or set from the orchestrator
//
// or programmatically from the cgroup:
//   runtime.GOMAXPROCS(cgroupCPUQuota())

runtime.NumGoroutine

NumGoroutine returns the number of goroutines that currently exist. It's a single atomic load — cheap, safe to call in a hot path, no scheduler interaction. Most metric exporters publish it as a gauge.

The useful diagnostic pattern is leak detection by delta. Snapshot the count when the process reaches steady state, run a workload that should fully drain, snapshot again. If the count climbs and never falls back, you have a leak — every iteration spawned a goroutine that never exited. A goroutine pprof profile then gives you the parked stacks; the ones with the highest counts are usually the culprits.

before := runtime.NumGoroutine()
runWorkload()
waitForDrain()
after := runtime.NumGoroutine()
if after-before > slack {
    // leak suspected — dump pprof goroutine profile
}

runtime.Stack and stack traces

runtime.Stack(buf, all) writes a formatted stack trace into buf. With all = false, it dumps the calling goroutine. With all = true, it walks every live goroutine and writes them all — the same output you get from a SIGQUIT (kill -3) and from an unrecovered panic with GOTRACEBACK=all.

The all-goroutines variant is heavy. The runtime briefly stops the world, walks every goroutine's stack, formats each frame. On a process with tens of thousands of goroutines this is hundreds of milliseconds of pause. Useful for crash dumps and one-off investigations; not something to wire into a /healthz endpoint.

Sizing the buffer. A truncated dump is worthless. The conventional shape is a loop that doubles the buffer until Stack returns less than its length — start at 64 KB and grow. Or use runtime/pprof.Lookup("goroutine").WriteTo(w, 2), which handles the buffering for you and gives the same format.

MemStats and gctrace

runtime.ReadMemStats(&m) fills a MemStats struct with about thirty allocator and GC counters. The fields that matter most:

Field	Meaning
`HeapAlloc`	bytes of live heap objects right now
`HeapSys`	bytes obtained from the OS for the heap
`HeapInuse`	bytes in spans currently in use
`HeapIdle`	bytes in idle spans, returnable to OS
`NextGC`	target `HeapAlloc` for next GC
`NumGC`	cumulative count of completed GC cycles
`PauseNs`	circular buffer of recent stop-the-world pauses

GODEBUG=gctrace=1 is the human-readable version: one line per GC cycle on stderr, with wall-clock duration, CPU fraction, and heap sizes before and after. Useful during local development; in production, prefer scraping MemStats into your metric system.

pprof

runtime/pprof exposes the standard profilers: CPU, heap, goroutine, block, mutex, threadcreate, and a few others. net/http/pprof wraps them as HTTP handlers under /debug/pprof. Importing the latter as a blank import is enough to register the routes on http.DefaultServeMux:

import _ "net/http/pprof"

// /debug/pprof/profile?seconds=30  — CPU profile
// /debug/pprof/heap                — heap snapshot
// /debug/pprof/goroutine?debug=2   — all goroutine stacks
// /debug/pprof/block               — blocking profile (needs SetBlockProfileRate)
// /debug/pprof/mutex               — mutex contention (needs SetMutexProfileFraction)

In production these endpoints should sit behind auth or on a separate admin port — they leak source paths and can be expensive to serve. The analyser is go tool pprof, which can read either a file or a URL directly, and renders flame graphs in a browser via -http=:8080.

runtime/trace

runtime/trace.Start(w) turns on the execution tracer. Unlike pprof, which samples, the tracer captures every goroutine state transition, every GC phase, every scheduler event, every network poll, every syscall — a complete timeline. The format is binary; read it with go tool trace traceout.bin, which opens a browser view of the goroutine timeline, syscall and GC bands, and per-goroutine flame graphs.

The price is volume. A few minutes of trace from a busy server can produce hundreds of megabytes. Tracing also adds noticeable overhead — usually under 10% CPU but enough to matter under contention. Use it for one-off investigations of latency anomalies and scheduler weirdness, not as a continuous profile.

f, _ := os.Create("trace.out")
defer f.Close()
trace.Start(f)
defer trace.Stop()

// run the workload under investigation

// then: go tool trace trace.out

SetFinalizer

runtime.SetFinalizer(obj, fn) arranges for fn(obj) to be called at some point after the GC determines obj is unreachable. The intended use is cleanup of non-Go resources — file descriptors, C handles, mmap'd regions — when explicit Close can't be guaranteed.

The pitfalls are unusual enough that most code should avoid finalizers entirely:

Not guaranteed to run. On normal exit, pending finalizers are skipped. If the program crashes or is killed, none run. Don't depend on a finalizer for correctness.
Runs on an arbitrary goroutine. The finalizer goroutine has no particular context, no caller, no recovery. A panic in a finalizer crashes the program.
Resurrects the object briefly. The finalizer holds a reference, so the object survives one extra GC cycle. Cyclic finalizers (objects referencing each other) are never collected at all.

The better default. An explicit Close() method invoked by defer, plus a check in tests that callers actually call it. The finalizer is a safety net for things like os.File in the standard library, not a primary lifecycle mechanism.

Common pitfalls and operational gotchas

Calling runtime.GC() in production. It forces a full GC synchronously. Almost always wrong — the runtime's pacer is better at choosing when. The one legitimate use is right before ReadMemStats in benchmarks, to get deterministic numbers.
Reading MemStats too often. The read itself is cheap, but the runtime updates some counters lazily at GC boundaries, and a very tight polling loop can subtly affect GC pacing. Once per second is plenty for metrics.
Assuming GOTRACEBACK=all in production. The default is single: on crash, only the panicking goroutine's stack is printed. For postmortem debugging you almost always want all or system, set via env var on the container.
Forgetting that trace files are huge. Leaving trace.Start on for an hour on a busy service will fill the disk. Cap the duration and rotate.
Exposing /debug/pprof without auth. The blank import registers handlers on http.DefaultServeMux, which is often the same mux serving public traffic. Bind pprof to an admin port or wrap in middleware.

Production checklist

Setting	Recommendation
`GOMAXPROCS`	Set from cgroup quota — `automaxprocs` or env from orchestrator
`/debug/pprof`	Exposed on a separate admin port, or behind auth middleware
Goroutine count + MemStats	Scraped periodically as metrics (per minute is fine)
`runtime/trace`	One-off investigations only; bounded duration; rotate files
`GOTRACEBACK`	`all` in production for crash diagnostics
Cleanup	Explicit `Close()` via `defer`; finalizers as a last resort
`GOMEMLIMIT`	Set to ~90% of container memory to cap heap growth

Most of these are one-line decisions made at service-template time. Getting them right once means every service inherits sane runtime behaviour without anyone thinking about it.

The runtime package

What runtime is

GOMAXPROCS

runtime.NumGoroutine

runtime.Stack and stack traces

MemStats and gctrace

pprof

runtime/trace

SetFinalizer

Common pitfalls and operational gotchas

Production checklist

Further reading

10 — Networking and the netpoller