Sync primitives
The sync package is small but every type inside it is doing something
clever. sync.Mutex spins briefly before parking, switches to starvation mode after
1 ms of unfairness, and degrades gracefully under contention. sync.Map keeps two
maps to avoid locking the read path. sync.Once is a double-checked atomic flag with
a fallback mutex. The package compiles down to roughly what you'd write by hand —
but only after you'd written it three times wrong.
sync.Mutex — fast path, slow path, starvation mode
A Mutex is a single 32-bit word plus a semaphore. The bottom bits encode locked state, woken state, and starvation mode. The fast path is a single CAS to flip the locked bit — when uncontended, that's the whole story and the cost is one atomic instruction (~25 ns on modern x86).
Under contention, the slow path kicks in. The acquiring goroutine spins for a few
iterations (the sync.runtime_canSpin check) on the chance the holder
will release momentarily. If the mutex is still held after the spin budget, the
goroutine parks via the runtime semaphore — same primitive channels use under
the hood. A goroutine that has waited longer than 1 ms flips the mutex into
starvation mode: from then on, releases hand the mutex directly
to the head of the wait queue instead of letting a newly-arrived goroutine
barge in. Starvation mode ends once the queue is short again.
sync.RWMutex — readers and writers
RWMutex layers reader counting on top of a regular Mutex. Readers atomically increment a counter; writers wait until the count drops to zero, then acquire the underlying Mutex. The trick is that writers signal "I'm waiting" by flipping the count negative, which causes new readers to block until the writer is done.
RWMutex is faster than Mutex only when reads dominate and the critical section is long enough to amortise the extra atomic operations. Benchmarks consistently show that for short critical sections (a few hundred ns), a plain Mutex wins because RWMutex's bookkeeping cost exceeds the win from parallel reads. Reach for RWMutex when read times are in the microseconds, not nanoseconds.
sync.Once — the double-checked atomic flag
Once.Do guarantees its function runs exactly once across any number of concurrent callers. The implementation is six lines, and reading them is a small lesson in Go's memory model.
// runtime sync/once.go (simplified)
type Once struct {
done uint32
m Mutex
}
func (o *Once) Do(f func()) {
if atomic.LoadUint32(&o.done) == 0 {
o.doSlow(f)
}
}
func (o *Once) doSlow(f func()) {
o.m.Lock()
defer o.m.Unlock()
if o.done == 0 {
defer atomic.StoreUint32(&o.done, 1)
f()
}
}Fast path: one atomic load. If done is set, return immediately. Slow path:
grab the mutex, check again (the second check covers the case where another caller
was in doSlow at the same time), run f, then atomically set
done. The atomic store happens after f returns and is paired with the
fast-path atomic load, which gives the happens-before relationship that makes
subsequent callers see f's side effects.
sync.Map — when a regular map + Mutex isn't enough
sync.Map is built for two specific workloads: stable keys with mostly reads, or keys
written once and read many times. Internally it keeps two maps — a read-only
read map served lock-free, and a dirty map protected by a
Mutex.
Reads check the read map first; if the key is there, no lock is taken. Misses fall through to the dirty map under lock. When too many misses accumulate, the dirty map is promoted to be the new read map. Writes go to the dirty map and trigger promotion eventually.
Use it when the access pattern is read-heavy and the key set churns slowly. For
everything else — write-heavy, range-heavy, or small maps — a regular
map[K]V with a sync.RWMutex is faster and uses less memory.
map + Mutex. It isn't — the read/dirty split adds overhead that's only
worth paying for a specific access pattern. Benchmark before swapping in.sync.WaitGroup — counter with a wakeup
WaitGroup is conceptually trivial: a counter, an atomic Add, and Wait that blocks until the counter hits zero. The implementation packs the counter and a waiter count into a single 64-bit word so the whole thing can be done with one CAS per operation.
The one rule that catches people: Add must happen before Wait
is called, not after. Calling wg.Add(1) inside the goroutine you just
launched is a race — the parent's wg.Wait() may see a zero counter and
return immediately. Always Add before go.
sync/atomic — when to skip the mutex entirely
Atomic operations cost one CAS or one memory barrier — roughly 5–25 ns on modern x86.
Mutex acquisition uncontended costs about the same; contended, much more. For a
single counter, a single flag, or a single pointer, sync/atomic is the
right tool.
Anything more — two pieces of state that must update together, or a counter plus a value that depend on each other — needs a Mutex. The trap is "I'll do two atomic ops back to back"; without a lock, another goroutine can observe the intermediate state.