Profiling in production
The other methods predict where the bottleneck is. Profiling confirms it on real production state, with real cache contents, real GC behaviour, real lock contention. The tools change every few years; the discipline of measuring without distorting what you measure is what carries across them.
Sampling vs instrumentation
There are two ways to attribute time to code, and the choice between them shapes everything else about a profiling session. Sampling interrupts the program many times per second and records what's currently running on the stack; the result is statistically representative with low overhead. Instrumentation inserts hooks (manual or via a compiler pass) at function entry and exit; the result is exact but adds per-call overhead and can change what you measure.
The mental model for sampling is a stroboscope. You do not watch the program
continuously. You take a photograph of the call stack many times a second
and assume that over thousands of photographs, the fraction of frames in
which a function appears equals the fraction of time the program spent in
it. If parseObject is on top of the stack in 280 of 1,000
samples, you conclude it took about 28% of the on-CPU time. The accuracy is
statistical and improves with the sample count, the same way a poll gets
tighter with more respondents. A 30-second sampling run at 100 Hz gives you
roughly 3,000 samples per thread, which is plenty to find anything that
matters and far too few to trust a function that shows up twice.
Instrumentation works differently. Instead of asking "what is running right
now," it records "I entered function f at time t1, I left at t2," for every
single call. That gives you exact call counts and exact per-call timing,
which sampling can never give you. The price is that the recording itself
runs inside the hot path. A function called ten million times that does
almost no work will have its true cost swamped by the timestamp read and the
bookkeeping the instrumentation adds on every entry and exit. The profile
becomes a measurement of the profiler. This is the trap that catches people
who reach for cProfile in Python or a tracing wrapper in
JavaScript and conclude that some tiny accessor is the bottleneck: it is the
most-called function, so it absorbs the most instrumentation overhead, so it
rises to the top whether or not it was ever the problem.
| Sampling | Instrumentation | |
|---|---|---|
| Overhead | 1–5%, controllable by sampling rate | 10–100%, depending on call rate |
| Accuracy | Statistical — needs enough samples | Exact within the instrumented scope |
| Best for | Production. Finding hot paths. | Dev/staging. Specific function timing. |
| Misses | Very short functions that fall between samples | Anything not instrumented |
| Tools | perf, async-profiler, pprof, eBPF | strace, DTrace probes, manual time.Now() |
On-CPU vs off-CPU
Sampling profilers split into two flavours that answer different questions. Both matter, and getting them confused is the most common mistake in profiling.
- On-CPU profiling samples threads that are actively running on a CPU. Tells you where the process spends its computation. The flame graph everyone knows.
- Off-CPU profiling samples threads that are blocked (on locks, on I/O, on conditions). Tells you where the process spends its waiting. The "I/O wait" flame graph; much less famous but often more useful.
A service whose CPU utilisation is 20% but whose throughput is half what
it should be is almost certainly off-CPU bound. On-CPU profiling will show
a small, evenly-distributed flame graph that doesn't explain the symptom.
Off-CPU profiling will show a tall stack stuck in futex_wait
or read or poll, which is the actual bottleneck.
The reason this split exists at all comes down to how the operating system runs threads. A thread is either on a CPU executing instructions, or it is off the CPU because the scheduler took it off, almost always because it asked to wait for something: a lock held by another thread, bytes from a socket, a page fault, a disk read, a condition variable. On-CPU profiling samples the first state and is blind to the second. If your wall-clock latency is 200 ms but only 8 ms of that is on-CPU, an on-CPU profile is explaining 4% of the problem and silently ignoring the rest. The other 192 ms is off-CPU, and you will not see it until you go looking for it specifically.
Off-CPU profiling is harder to capture, which is part of why it stayed
obscure for so long. On-CPU sampling can use a simple timer interrupt that
fires on whatever CPU the thread is using. Off-CPU work happens precisely
when the thread is not scheduled, so there is no timer of yours
running on it. The standard technique instead hooks the scheduler itself:
record a timestamp when a thread goes off-CPU, record another when it comes
back, attribute the difference to the stack that was running when it blocked.
On Linux this is what the offcputime tool from bcc does with a
kprobe on the scheduler's context-switch path. The output is a flame graph
where width is blocked time rather than CPU time, and the leaves are the
calls that put the thread to sleep.
Flame graphs
Brendan Gregg's flame graph format took stack-trace data, the output every sampling profiler produces, and made it readable. Each box represents a function in a stack frame; the width is proportional to the percentage of samples that hit that frame; the y-axis is stack depth. Boxes stack on top of their callers, so a box sits directly above the function that called it. Reading from the bottom up, you are reading the call tree; reading the width of any box, you are reading the time spent in that function and everything it called.
The thing that makes the format work is that it collapses thousands of
individual stack samples into one picture. A profiler captures stacks like
main → handleRequest → decode → parseObject over and over. Many
of those stacks share a prefix. The flame graph merges every sample that
shares a prefix into the same box and makes that box as wide as the number of
samples it represents. The x-axis is not time in the chronological sense;
adjacent boxes are not "what ran next." The x-axis is the merged population
of samples, sorted alphabetically so the picture is stable run to run. That
is worth internalising, because the most common beginner mistake is to read a
flame graph left to right like a timeline. It is a histogram of stacks, not a
trace.
handleRequest dominates, and inside it the tower of decode → parseObject → scanToken is the hot path. The fix lives in scanToken, not in the wide-but-shallow handleRequest that merely contains it.Reading a flame graph quickly is a skill that pays off across every language and every profiler:
- Wide is hot. A function 40% of the way across the chart appeared in 40% of samples. That's where time goes.
- Tall isn't necessarily bad. Deep stacks just mean deep call chains. Width matters; depth doesn't.
- Plateaus are bottlenecks. A flat top — same function across many adjacent samples — means the program is stuck in that function. That's usually the answer.
- Look at the children of wide frames. If
handleRequestis 80% wide, the answer is in its children — the actual leaf functions doing the work.
The tool landscape
The right profiler depends on the runtime and the platform. The list below covers the ones in active use in modern production environments. Two threads run through all of them and are worth understanding before the table, because they explain why one profiler works on your service and the next produces garbage.
The first is stack unwinding. To record a sample, a profiler has to walk the
call stack from the current instruction back to main. There are
a few ways to do that. The cheapest is to follow frame pointers, a register
convention where each stack frame stores a pointer to its caller's frame, so
walking the stack is just chasing a linked list. Compilers omit frame
pointers by default to free up a register, which speeds up the code and
breaks the profiler at the same time: without them, samples taken in
optimised code cannot be unwound and show up as [unknown]. The
alternative is DWARF-based unwinding, which reads debug metadata to
reconstruct each frame; it is correct without frame pointers but far more
expensive per sample, sometimes too expensive to run continuously. The
modern compromise, used by eBPF profilers, is to ship the unwind tables into
the kernel and walk them there. Whichever path you are on, the rule stands:
if your flame graph is full of broken or truncated stacks, the unwinder is
the first thing to check.
The second thread is eBPF. The kernel can now run small, verified programs in response to events, including the timer interrupt that drives sampling and the scheduler events that drive off-CPU profiling. This is what lets a single low-overhead agent profile every process on a host, kernel and userspace alike, without modifying any of them. The bcc and bpftrace toolkits expose this directly for ad-hoc work, like a one-line off-CPU profile or a latency histogram of a specific syscall, and the continuous-profiling systems build on the same foundation under the hood. The learning curve is real, but eBPF is why production profiling stopped requiring a special build of your binary.
| Tool | Targets | Notes |
|---|---|---|
| perf | Linux, any language with frame pointers or DWARF | The kernel-provided baseline. Captures CPU samples, hardware counters, kernel and user stacks. perf record + perf script + flamegraph.pl is the canonical pipeline. |
| async-profiler | JVM (Java, Kotlin, Scala, Clojure) | The JVM profiler everyone uses. Uses AsyncGetCallTrace; safe in production; produces both on-CPU and off-CPU flame graphs. Built-in lock and allocation profiling. |
| pprof | Go, plus others via the format | Go's standard profiler, surfaced via net/http/pprof. The pprof file format is now lingua franca — read by many UIs. |
| py-spy / Austin | CPython | Sampling profilers that work without modifying the process. Critical because cProfile is instrumenting and skews results. |
| BPF / bcc / bpftrace | Linux kernel + userspace | Programmable kernel-level tracing. Great for off-CPU profiling, lock tracing, latency histograms. Steep learning curve; high payoff. |
| VTune / Linux perf annotate | x86, low-level | For when sampling stack traces isn't enough and you need cycle-level or microarchitectural detail. Pair with top-down. |
| Pyroscope / Parca / Polar Signals | Cross-language continuous profilers | Run all the time, store profiles centrally, query like metrics. The continuous-profiling category — increasingly the default in cloud-native shops. |
Continuous profiling
The newer wave of profilers runs all the time, at a low sampling rate (typically 19–100 Hz), shipping profiles to a central store. This changes how profiling fits into operations: instead of "page fires, capture a profile" you have "page fires, look at the profile from the moment it started". The historical view also lets you compare a regression to last week's baseline without having to reproduce the issue.
Practical considerations: overhead is about 1% at typical sampling rates, storage requires only a few MB per host per day with the pprof format's compression, and the query model is the same shape as metrics, with labels (service, version, instance) and time ranges. The trade-off is that you can't sample at arbitrarily high frequency without overhead growing; for one-off deep investigation, on-demand profiling at higher rates still beats continuous.
It helps to place continuous profiling next to the rest of your observability. Metrics tell you that a service is slow or hot. Traces tell you which span inside a request is slow and which downstream call it was waiting on. Neither tells you which lines of code inside that span are burning the CPU. Profiles fill exactly that gap: they are the fourth signal, the one that resolves a slow span down to a function and a call stack. The newest systems are starting to stitch these together, attaching a trace or span id as a label on profile samples so you can pivot from a slow trace straight into the flame graph for just the requests in that trace. The operational picture, when it works, is a straight line: an alert fires on a metric, you find the slow span in a trace, and you land on the exact function in a profile, all without reproducing anything or attaching a debugger to production.
Continuous profiling also changes what "profile" means socially on a team. When profiling is a special event you run during an incident, it is a skill only a few people exercise and the data is gone the moment the incident ends. When profiles are always on and queryable by anyone, they become part of normal review: you can look at the flame graph for a service the way you look at its latency chart, compare this week to last, and catch a slow creep in allocation rate before it becomes a page. The historical baseline is the whole point. Without it, every regression starts from "we think it got slower"; with it, every regression starts from "this function went from 2% to 28% of CPU on the deploy at 14:03."
The observer effect
Every measurement changes what's measured to some extent. Profilers are no exception, and the failure mode is subtle — the profile becomes internally consistent but misleading. Patterns to watch for:
- The profiler shifts the hot path. Instrumentation that adds 1 µs per function call will make functions called millions of times look hot when they aren't. Sampling avoids this; instrumenting profilers don't.
- The profiler hides the hot path. If the profiler can't unwind through optimised code (no frame pointers), samples that should attribute to a hot function show up as
[unknown]at the top of the stack. Compile with-fno-omit-frame-pointerwhen profiling. - The profiler changes scheduling. Sampling at very high rates can perturb the scheduler enough to change which threads contend. Keep sample rates modest (≤ 100 Hz) for off-CPU and lock profiling.
- The profiler doesn't see what you think. JIT'd code, foreign-function calls, kernel code, GPU work — each of these needs a specific profiler that knows how to unwind. A pure-userspace profiler will misattribute kernel time to whichever syscall returned to userland.
A worked profiling session
A real example: an HTTP service whose P99 latency degraded by 30% after a deploy. Metrics didn't point at any obvious cause.
# Step 1: capture a CPU profile (sampling, low overhead)
curl -o cpu.pb.gz \
http://prod-host:6060/debug/pprof/profile?seconds=30
go tool pprof -http=:8080 cpu.pb.gz
# → flame graph shows a wide plateau on
# "json.(*decodeState).object" — 28% of CPU samples.
# This is suspicious because JSON parsing wasn't touched in the deploy.
# Step 2: capture an off-CPU profile to see what's blocked
curl -o block.pb.gz \
http://prod-host:6060/debug/pprof/block?seconds=30
go tool pprof -http=:8081 block.pb.gz
# → most off-CPU time is sync.Mutex.Lock under a request handler.
# Not the JSON parser — a shared map.
# Step 3: capture mutex contention specifically
curl -o mutex.pb.gz \
http://prod-host:6060/debug/pprof/mutex?seconds=30
go tool pprof -http=:8082 mutex.pb.gz
# → 71% of mutex wait time on cache.shared.Get, which acquired
# a global lock that the new deploy added to track hit/miss counters.
# Step 4: diff against last week's profile (continuous profiling)
# → the cache.shared.Get function went from ~2% of CPU to ~28% post-deploy.
# Cause: counter atomic-increment on every cache lookup, added in the deploy.
# Fix: per-shard counters, aggregated periodically. P99 returns to baseline.Three takeaways from the session. First, the on-CPU profile pointed at JSON — which was downstream of the actual cause. Second, the off-CPU profile revealed the lock contention that was the real bottleneck. Third, continuous profiling (or any historical baseline) is what made it possible to attribute the regression to the deploy quickly, without having to reproduce.
Profiling allocation, locks, and GC
CPU profiling is the most common kind, but production performance questions often need others. The right profile type matches the question, and reaching for the wrong one wastes a debugging session.
Memory and allocation profiling deserves special attention because it
answers a question CPU profiling cannot. A heap profile is a snapshot: it
tells you what is alive right now and which call site allocated it, which is
the tool for a leak or a memory ceiling you keep hitting. An allocation
profile is a rate: it tells you which call sites are churning the most bytes
over a window, even if those bytes are short-lived and freed immediately.
The two diverge constantly. A function that allocates a million tiny slices
per second and frees them just as fast shows almost nothing in a heap
snapshot, because nothing survives to be counted, yet it can dominate an
allocation profile and quietly drive your garbage collector into the ground.
When CPU time mysteriously sits inside the runtime, in functions with names
like mallocgc or gcBgMarkWorker, the on-CPU flame
graph is telling you to go look at an allocation profile, because the real
cause is upstream of the collector: something is making too much garbage.
Lock and mutex profiling is the off-CPU story made specific. Rather than "this thread was blocked," a mutex profile says "this thread was blocked waiting for this lock, acquired from these call sites, for this total time." That is usually enough to identify a single contended data structure, which is what most scaling cliffs come down to: a shared map, a global counter, a connection pool with one coarse lock. The fix is almost always to make the lock finer-grained or to remove the sharing, and the mutex profile is what tells you which lock is worth the effort.
| Question | Profile type | Tools |
|---|---|---|
| Where is CPU going? | On-CPU sampling | perf, pprof, async-profiler |
| Where are we blocked? | Off-CPU sampling | async-profiler, bcc offcputime, pprof block |
| Where is memory being allocated? | Heap / allocation profile | pprof heap, async-profiler alloc, Java Flight Recorder |
| Why is GC pausing so long? | GC log + allocation profile | JFR, async-profiler alloc, language-specific GC logs |
| What lock is contended? | Lock / mutex profile | pprof mutex, async-profiler lock, bpftrace |
| What syscall is slow? | System-call latency histogram | bpftrace, bcc |
| What CPU pipeline stage is the issue? | Hardware-counter-based | perf with PMU events, top-down |
Production checklist
- Default to sampling, not instrumenting. Sampling profilers are safe in production; instrumenting profilers are not.
- Compile with frame pointers.
-fno-omit-frame-pointerfor C/C++,-XX:+PreserveFramePointerfor JVM. Stack walking fails without them. - Capture both on-CPU and off-CPU. The first tells you where CPU goes; the second tells you what's blocked. Most production puzzles need both.
- Read flame graphs for width, not height. Wide plateaus are bottlenecks. Tall stacks are just deep call chains.
- Run a continuous profiler if you ship to a fleet. Pyroscope, Parca, Polar Signals. The historical baseline is the difference between "we have a regression" and "this deploy caused this regression on this function."
- Match the profile type to the question. Allocation profiles for GC investigations, mutex profiles for contention, syscall latency for I/O issues.
- Always diff against a baseline. A single flame graph shows where time goes; a diff shows what changed. The latter is what you want during incident response.
Further reading
- Brendan Gregg — Flame Graphs. The reference page. Includes the original flamegraph.pl, links to diff flame graphs, and patterns for reading them.
- Brendan Gregg — Systems Performance, Chapter 6 and 13. The most thorough single treatment of Linux profiling tools.
- Andrei Pangin — async-profiler. The JVM profiler. The README is also one of the best tutorials on profiling in general.
- Brendan Gregg — Off-CPU Analysis. The case for off-CPU profiling and how to do it on Linux.
- Polar Signals / Parca / Pyroscope documentation. Continuous profiling — read at least one to understand the operational model.
- Brendan Gregg — Linux perf Examples. The standard reference for what
perfcan do. - Adjacent: The USE method. USE tells you which resource is hot; profiling tells you what code is hitting it.
- Adjacent: Top-down microarchitecture. The microarchitectural companion to sampling. Top-down splits cycles; profiling attributes them to functions.