11 / 13
Internals / 11

Cgo

Cgo lets Go programs call C and lets C call Go. The mechanism is real and shipping in production — sqlite drivers, image codecs, libpq, anything wrapping a system library — but every cgo call carries a couple of hundred nanoseconds of overhead and turns the calling goroutine into something the scheduler treats specially. Knowing why explains when cgo is the right tool and when it quietly becomes the bottleneck.


What cgo generates

When you write import "C" the cgo tool runs before the Go compiler. It reads the special // #include comments and the C symbols you reference, generates wrapper functions in both Go and C, and emits the glue that calls between them. The output is two files per package: _cgo_gotypes.go with Go declarations matching the C types, and _cgo_main.c with the C entry points.

// hello.go
package main

// #include <stdio.h>
// #include <stdlib.h>
import "C"
import "unsafe"

func main() {
    cstr := C.CString("hello from C\n")
    C.fputs(cstr, C.stdout)
    C.free(unsafe.Pointer(cstr))   // C.CString allocates on the C heap
}

The two heaps are separate. A pointer into Go memory is not safe to hold past a cgo call — the GC may move the goroutine's stack or compact the heap. Cgo enforces this at runtime: pass a Go pointer to C and it traps unless you've configured GODEBUG=cgocheck=0. The escape valve is C.CString, C.CBytes, and explicit C.malloc.

runtime.cgocall — the boundary crossing

Every Go-to-C call goes through runtime.cgocall. The current goroutine's M (OS thread) leaves the Go scheduler, switches to a separate C stack, runs the C function, and switches back. While the M is in C, the runtime treats it like a syscall-blocked thread: a new M is spun up if needed so the P (logical processor) can keep running other goroutines.

The numbers worth keeping in your head: a no-op cgo call is roughly 180–200 ns on modern x86, compared to 2–5 ns for a normal Go function call. That's two orders of magnitude. For a hot loop crossing the boundary millions of times per second, cgo is wrong on its face. For a once-per-request call into libpq, the overhead is invisible.

The non-obvious rule. Batch at the boundary. Wrap N C operations in a single cgo call, not N cgo calls. The overhead is per crossing, not per operation — the C side can loop a million times before the cost matters again.

C calling Go — the inverse problem

Going the other way is trickier. A Go function callable from C is marked with //export Name and gets a C wrapper that takes an M, sets up a goroutine context, runs the function, and tears the context down. The cost is similar — around a microsecond — and the same restriction on pointer ownership applies in reverse.

The classic place this shows up is callbacks. A C library that wants to call your function back (libcurl progress callbacks, GLFW input handlers, audio device callbacks) needs the exported wrapper. If those callbacks fire from a C-owned thread the wrapper has extra work to do — registering the thread with the runtime via needm — and the latency goes up.

Stack switching

Go goroutines use small, growable stacks (starting at 2 KB). C code expects a large contiguous stack (megabytes). Crossing the boundary requires a stack switch — the runtime allocates or reuses a dedicated C stack for the call, copies arguments across, runs C on that stack, and unwinds back. This switch is part of the per-call overhead and is unavoidable.

It's also why long-running C code blocks the M from doing anything else: the C stack can't be preempted by the scheduler the way a Go stack can. If the C function takes a long time, the scheduler will eventually spin up a new M to keep the P busy, but the blocked M sits unavailable until C returns.

When cgo is the right answer

A handful of legitimate cases. Wrapping a library that has no pure-Go equivalent (sqlite, libsodium, ffmpeg, image codecs, ML inference runtimes). Calling into a system API not yet exposed by x/sys. Embedding Go inside a C/C++ host (the inverse case — a plugin or a shared library Go exports via buildmode=c-shared).

And the cases where it's wrong. A pure-Go alternative exists and is fast enough (most crypto, most parsing, most networking). A hot path that crosses the boundary millions of times per second. Code that must be debuggable with look at — cgo frames don't show up in Go stack traces, and the GDB/LLDB experience is uneven.

Further reading

Next deep dive 12 — Sync primitives Continue
Found this useful?