07 / 08

Methods / 07 · Production process

Profiling in production

The other methods predict where the bottleneck is. Profiling confirms it on real production state, with real cache contents, real GC behaviour, real lock contention. The tools change every few years; the discipline of measuring without distorting what you measure is what carries across them.

Sampling vs instrumentation

There are two ways to attribute time to code, and the choice between them shapes everything else about a profiling session. Sampling interrupts the program many times per second and records what's currently running on the stack; the result is statistically representative with low overhead. Instrumentation inserts hooks (manual or via a compiler pass) at function entry and exit; the result is exact but adds per-call overhead and can change what you measure.

The mental model for sampling is a stroboscope. You do not watch the program continuously. You take a photograph of the call stack many times a second and assume that over thousands of photographs, the fraction of frames in which a function appears equals the fraction of time the program spent in it. If parseObject is on top of the stack in 280 of 1,000 samples, you conclude it took about 28% of the on-CPU time. The accuracy is statistical and improves with the sample count, the same way a poll gets tighter with more respondents. A 30-second sampling run at 100 Hz gives you roughly 3,000 samples per thread, which is plenty to find anything that matters and far too few to trust a function that shows up twice.

Instrumentation works differently. Instead of asking "what is running right now," it records "I entered function f at time t1, I left at t2," for every single call. That gives you exact call counts and exact per-call timing, which sampling can never give you. The price is that the recording itself runs inside the hot path. A function called ten million times that does almost no work will have its true cost swamped by the timestamp read and the bookkeeping the instrumentation adds on every entry and exit. The profile becomes a measurement of the profiler. This is the trap that catches people who reach for cProfile in Python or a tracing wrapper in JavaScript and conclude that some tiny accessor is the bottleneck: it is the most-called function, so it absorbs the most instrumentation overhead, so it rises to the top whether or not it was ever the problem.

	Sampling	Instrumentation
Overhead	1–5%, controllable by sampling rate	10–100%, depending on call rate
Accuracy	Statistical — needs enough samples	Exact within the instrumented scope
Best for	Production. Finding hot paths.	Dev/staging. Specific function timing.
Misses	Very short functions that fall between samples	Anything not instrumented
Tools	perf, async-profiler, pprof, eBPF	strace, DTrace probes, manual `time.Now()`

Default to sampling in production. The overhead is low enough that you can run it continuously. Instrumentation is better when you've narrowed the search to one function and need exact per-call timing, but most of the value of production profiling is "where is the time going across the whole process," which is exactly what sampling answers.

Sampling is a stroboscope. It never watches the program continuously; it counts how often each function is on the stack and turns counts into percentages.

On-CPU vs off-CPU

Sampling profilers split into two flavours that answer different questions. Both matter, and getting them confused is the most common mistake in profiling.

On-CPU profiling samples threads that are actively running on a CPU. Tells you where the process spends its computation. The flame graph everyone knows.
Off-CPU profiling samples threads that are blocked (on locks, on I/O, on conditions). Tells you where the process spends its waiting. The "I/O wait" flame graph; much less famous but often more useful.

A service whose CPU utilisation is 20% but whose throughput is half what it should be is almost certainly off-CPU bound. On-CPU profiling will show a small, evenly-distributed flame graph that doesn't explain the symptom. Off-CPU profiling will show a tall stack stuck in futex_wait or read or poll, which is the actual bottleneck.

The reason this split exists at all comes down to how the operating system runs threads. A thread is either on a CPU executing instructions, or it is off the CPU because the scheduler took it off, almost always because it asked to wait for something: a lock held by another thread, bytes from a socket, a page fault, a disk read, a condition variable. On-CPU profiling samples the first state and is blind to the second. If your wall-clock latency is 200 ms but only 8 ms of that is on-CPU, an on-CPU profile is explaining 4% of the problem and silently ignoring the rest. The other 192 ms is off-CPU, and you will not see it until you go looking for it specifically.

Off-CPU profiling is harder to capture, which is part of why it stayed obscure for so long. On-CPU sampling can use a simple timer interrupt that fires on whatever CPU the thread is using. Off-CPU work happens precisely when the thread is not scheduled, so there is no timer of yours running on it. The standard technique instead hooks the scheduler itself: record a timestamp when a thread goes off-CPU, record another when it comes back, attribute the difference to the stack that was running when it blocked. On Linux this is what the offcputime tool from bcc does with a kprobe on the scheduler's context-switch path. The output is a flame graph where width is blocked time rather than CPU time, and the leaves are the calls that put the thread to sleep.

Wall-clock time is on-CPU plus off-CPU. When most of a request is spent waiting, the on-CPU flame graph points at the wrong function, and only off-CPU profiling explains the latency.

Flame graphs

Brendan Gregg's flame graph format took stack-trace data, the output every sampling profiler produces, and made it readable. Each box represents a function in a stack frame; the width is proportional to the percentage of samples that hit that frame; the y-axis is stack depth. Boxes stack on top of their callers, so a box sits directly above the function that called it. Reading from the bottom up, you are reading the call tree; reading the width of any box, you are reading the time spent in that function and everything it called.

The thing that makes the format work is that it collapses thousands of individual stack samples into one picture. A profiler captures stacks like main → handleRequest → decode → parseObject over and over. Many of those stacks share a prefix. The flame graph merges every sample that shares a prefix into the same box and makes that box as wide as the number of samples it represents. The x-axis is not time in the chronological sense; adjacent boxes are not "what ran next." The x-axis is the merged population of samples, sorted alphabetically so the picture is stable run to run. That is worth internalising, because the most common beginner mistake is to read a flame graph left to right like a timeline. It is a histogram of stacks, not a trace.

A flame graph reading: handleRequest dominates, and inside it the tower of decode → parseObject → scanToken is the hot path. The fix lives in scanToken, not in the wide-but-shallow handleRequest that merely contains it.

Reading a flame graph quickly is a skill that pays off across every language and every profiler:

Wide is hot. A function 40% of the way across the chart appeared in 40% of samples. That's where time goes.
Tall isn't necessarily bad. Deep stacks just mean deep call chains. Width matters; depth doesn't.
Plateaus are bottlenecks. A flat top — same function across many adjacent samples — means the program is stuck in that function. That's usually the answer.
Look at the children of wide frames. If handleRequest is 80% wide, the answer is in its children — the actual leaf functions doing the work.

Differential flame graphs. Subtract a "before" flame graph from an "after" to see what changed. Boxes that grew (red) show regressions; boxes that shrank (green) show wins. This is how you defend the impact of an optimisation without arguing about averages.

The tool landscape

The right profiler depends on the runtime and the platform. The list below covers the ones in active use in modern production environments. Two threads run through all of them and are worth understanding before the table, because they explain why one profiler works on your service and the next produces garbage.

The first is stack unwinding. To record a sample, a profiler has to walk the call stack from the current instruction back to main. There are a few ways to do that. The cheapest is to follow frame pointers, a register convention where each stack frame stores a pointer to its caller's frame, so walking the stack is just chasing a linked list. Compilers omit frame pointers by default to free up a register, which speeds up the code and breaks the profiler at the same time: without them, samples taken in optimised code cannot be unwound and show up as [unknown]. The alternative is DWARF-based unwinding, which reads debug metadata to reconstruct each frame; it is correct without frame pointers but far more expensive per sample, sometimes too expensive to run continuously. The modern compromise, used by eBPF profilers, is to ship the unwind tables into the kernel and walk them there. Whichever path you are on, the rule stands: if your flame graph is full of broken or truncated stacks, the unwinder is the first thing to check.

The second thread is eBPF. The kernel can now run small, verified programs in response to events, including the timer interrupt that drives sampling and the scheduler events that drive off-CPU profiling. This is what lets a single low-overhead agent profile every process on a host, kernel and userspace alike, without modifying any of them. The bcc and bpftrace toolkits expose this directly for ad-hoc work, like a one-line off-CPU profile or a latency histogram of a specific syscall, and the continuous-profiling systems build on the same foundation under the hood. The learning curve is real, but eBPF is why production profiling stopped requiring a special build of your binary.

Tool	Targets	Notes
perf	Linux, any language with frame pointers or DWARF	The kernel-provided baseline. Captures CPU samples, hardware counters, kernel and user stacks. `perf record` + `perf script` + flamegraph.pl is the canonical pipeline.
async-profiler	JVM (Java, Kotlin, Scala, Clojure)	The JVM profiler everyone uses. Uses AsyncGetCallTrace; safe in production; produces both on-CPU and off-CPU flame graphs. Built-in lock and allocation profiling.
pprof	Go, plus others via the format	Go's standard profiler, surfaced via `net/http/pprof`. The pprof file format is now lingua franca — read by many UIs.
py-spy / Austin	CPython	Sampling profilers that work without modifying the process. Critical because cProfile is instrumenting and skews results.
BPF / bcc / bpftrace	Linux kernel + userspace	Programmable kernel-level tracing. Great for off-CPU profiling, lock tracing, latency histograms. Steep learning curve; high payoff.
VTune / Linux perf annotate	x86, low-level	For when sampling stack traces isn't enough and you need cycle-level or microarchitectural detail. Pair with top-down.
Pyroscope / Parca / Polar Signals	Cross-language continuous profilers	Run all the time, store profiles centrally, query like metrics. The continuous-profiling category — increasingly the default in cloud-native shops.

Continuous profiling

The newer wave of profilers runs all the time, at a low sampling rate (typically 19–100 Hz), shipping profiles to a central store. This changes how profiling fits into operations: instead of "page fires, capture a profile" you have "page fires, look at the profile from the moment it started". The historical view also lets you compare a regression to last week's baseline without having to reproduce the issue.

Practical considerations: overhead is about 1% at typical sampling rates, storage requires only a few MB per host per day with the pprof format's compression, and the query model is the same shape as metrics, with labels (service, version, instance) and time ranges. The trade-off is that you can't sample at arbitrarily high frequency without overhead growing; for one-off deep investigation, on-demand profiling at higher rates still beats continuous.

It helps to place continuous profiling next to the rest of your observability. Metrics tell you that a service is slow or hot. Traces tell you which span inside a request is slow and which downstream call it was waiting on. Neither tells you which lines of code inside that span are burning the CPU. Profiles fill exactly that gap: they are the fourth signal, the one that resolves a slow span down to a function and a call stack. The newest systems are starting to stitch these together, attaching a trace or span id as a label on profile samples so you can pivot from a slow trace straight into the flame graph for just the requests in that trace. The operational picture, when it works, is a straight line: an alert fires on a metric, you find the slow span in a trace, and you land on the exact function in a profile, all without reproducing anything or attaching a debugger to production.

Continuous profiling also changes what "profile" means socially on a team. When profiling is a special event you run during an incident, it is a skill only a few people exercise and the data is gone the moment the incident ends. When profiles are always on and queryable by anyone, they become part of normal review: you can look at the flame graph for a service the way you look at its latency chart, compare this week to last, and catch a slow creep in allocation rate before it becomes a page. The historical baseline is the whole point. Without it, every regression starts from "we think it got slower"; with it, every regression starts from "this function went from 2% to 28% of CPU on the deploy at 14:03."

The observer effect

Every measurement changes what's measured to some extent. Profilers are no exception, and the failure mode is subtle — the profile becomes internally consistent but misleading. Patterns to watch for:

The profiler shifts the hot path. Instrumentation that adds 1 µs per function call will make functions called millions of times look hot when they aren't. Sampling avoids this; instrumenting profilers don't.
The profiler hides the hot path. If the profiler can't unwind through optimised code (no frame pointers), samples that should attribute to a hot function show up as [unknown] at the top of the stack. Compile with -fno-omit-frame-pointer when profiling.
The profiler changes scheduling. Sampling at very high rates can perturb the scheduler enough to change which threads contend. Keep sample rates modest (≤ 100 Hz) for off-CPU and lock profiling.
The profiler doesn't see what you think. JIT'd code, foreign-function calls, kernel code, GPU work — each of these needs a specific profiler that knows how to unwind. A pure-userspace profiler will misattribute kernel time to whichever syscall returned to userland.

A worked profiling session

A real example: an HTTP service whose P99 latency degraded by 30% after a deploy. Metrics didn't point at any obvious cause.

# Step 1: capture a CPU profile (sampling, low overhead)
curl -o cpu.pb.gz \
  http://prod-host:6060/debug/pprof/profile?seconds=30

go tool pprof -http=:8080 cpu.pb.gz
# → flame graph shows a wide plateau on
#   "json.(*decodeState).object" — 28% of CPU samples.
# This is suspicious because JSON parsing wasn't touched in the deploy.

# Step 2: capture an off-CPU profile to see what's blocked
curl -o block.pb.gz \
  http://prod-host:6060/debug/pprof/block?seconds=30

go tool pprof -http=:8081 block.pb.gz
# → most off-CPU time is sync.Mutex.Lock under a request handler.
# Not the JSON parser — a shared map.

# Step 3: capture mutex contention specifically
curl -o mutex.pb.gz \
  http://prod-host:6060/debug/pprof/mutex?seconds=30

go tool pprof -http=:8082 mutex.pb.gz
# → 71% of mutex wait time on cache.shared.Get, which acquired
#   a global lock that the new deploy added to track hit/miss counters.

# Step 4: diff against last week's profile (continuous profiling)
# → the cache.shared.Get function went from ~2% of CPU to ~28% post-deploy.
# Cause: counter atomic-increment on every cache lookup, added in the deploy.

# Fix: per-shard counters, aggregated periodically. P99 returns to baseline.

Three takeaways from the session. First, the on-CPU profile pointed at JSON — which was downstream of the actual cause. Second, the off-CPU profile revealed the lock contention that was the real bottleneck. Third, continuous profiling (or any historical baseline) is what made it possible to attribute the regression to the deploy quickly, without having to reproduce.

Profiling allocation, locks, and GC

CPU profiling is the most common kind, but production performance questions often need others. The right profile type matches the question, and reaching for the wrong one wastes a debugging session.

Memory and allocation profiling deserves special attention because it answers a question CPU profiling cannot. A heap profile is a snapshot: it tells you what is alive right now and which call site allocated it, which is the tool for a leak or a memory ceiling you keep hitting. An allocation profile is a rate: it tells you which call sites are churning the most bytes over a window, even if those bytes are short-lived and freed immediately. The two diverge constantly. A function that allocates a million tiny slices per second and frees them just as fast shows almost nothing in a heap snapshot, because nothing survives to be counted, yet it can dominate an allocation profile and quietly drive your garbage collector into the ground. When CPU time mysteriously sits inside the runtime, in functions with names like mallocgc or gcBgMarkWorker, the on-CPU flame graph is telling you to go look at an allocation profile, because the real cause is upstream of the collector: something is making too much garbage.

Lock and mutex profiling is the off-CPU story made specific. Rather than "this thread was blocked," a mutex profile says "this thread was blocked waiting for this lock, acquired from these call sites, for this total time." That is usually enough to identify a single contended data structure, which is what most scaling cliffs come down to: a shared map, a global counter, a connection pool with one coarse lock. The fix is almost always to make the lock finer-grained or to remove the sharing, and the mutex profile is what tells you which lock is worth the effort.

Question	Profile type	Tools
Where is CPU going?	On-CPU sampling	perf, pprof, async-profiler
Where are we blocked?	Off-CPU sampling	async-profiler, bcc offcputime, pprof block
Where is memory being allocated?	Heap / allocation profile	pprof heap, async-profiler alloc, Java Flight Recorder
Why is GC pausing so long?	GC log + allocation profile	JFR, async-profiler alloc, language-specific GC logs
What lock is contended?	Lock / mutex profile	pprof mutex, async-profiler lock, bpftrace
What syscall is slow?	System-call latency histogram	bpftrace, bcc
What CPU pipeline stage is the issue?	Hardware-counter-based	perf with PMU events, top-down

Production checklist

Default to sampling, not instrumenting. Sampling profilers are safe in production; instrumenting profilers are not.
Compile with frame pointers. -fno-omit-frame-pointer for C/C++, -XX:+PreserveFramePointer for JVM. Stack walking fails without them.
Capture both on-CPU and off-CPU. The first tells you where CPU goes; the second tells you what's blocked. Most production puzzles need both.
Read flame graphs for width, not height. Wide plateaus are bottlenecks. Tall stacks are just deep call chains.
Run a continuous profiler if you ship to a fleet. Pyroscope, Parca, Polar Signals. The historical baseline is the difference between "we have a regression" and "this deploy caused this regression on this function."
Match the profile type to the question. Allocation profiles for GC investigations, mutex profiles for contention, syscall latency for I/O issues.
Always diff against a baseline. A single flame graph shows where time goes; a diff shows what changed. The latter is what you want during incident response.

Profiling in production

Sampling vs instrumentation

On-CPU vs off-CPU

Flame graphs

The tool landscape

Continuous profiling

The observer effect

A worked profiling session

Profiling allocation, locks, and GC

Production checklist

Further reading

Load testing without lying