07 / 08
Methods / 07 · Production process

Profiling in production

The other methods predict where the bottleneck is. Profiling confirms it on real production state, with real cache contents, real GC behaviour, real lock contention. The tools change every few years; the discipline of measuring without distorting what you measure is what carries across them.


Sampling vs instrumentation

There are two ways to attribute time to code, and the choice between them shapes everything else about a profiling session. Sampling interrupts the program many times per second and records what's currently running on the stack; the result is statistically representative with low overhead. Instrumentation inserts hooks (manual or via a compiler pass) at function entry and exit; the result is exact but adds per-call overhead and can change what you measure.

The mental model for sampling is a stroboscope. You do not watch the program continuously. You take a photograph of the call stack many times a second and assume that over thousands of photographs, the fraction of frames in which a function appears equals the fraction of time the program spent in it. If parseObject is on top of the stack in 280 of 1,000 samples, you conclude it took about 28% of the on-CPU time. The accuracy is statistical and improves with the sample count, the same way a poll gets tighter with more respondents. A 30-second sampling run at 100 Hz gives you roughly 3,000 samples per thread, which is plenty to find anything that matters and far too few to trust a function that shows up twice.

Instrumentation works differently. Instead of asking "what is running right now," it records "I entered function f at time t1, I left at t2," for every single call. That gives you exact call counts and exact per-call timing, which sampling can never give you. The price is that the recording itself runs inside the hot path. A function called ten million times that does almost no work will have its true cost swamped by the timestamp read and the bookkeeping the instrumentation adds on every entry and exit. The profile becomes a measurement of the profiler. This is the trap that catches people who reach for cProfile in Python or a tracing wrapper in JavaScript and conclude that some tiny accessor is the bottleneck: it is the most-called function, so it absorbs the most instrumentation overhead, so it rises to the top whether or not it was ever the problem.

SamplingInstrumentation
Overhead1–5%, controllable by sampling rate10–100%, depending on call rate
AccuracyStatistical — needs enough samplesExact within the instrumented scope
Best forProduction. Finding hot paths.Dev/staging. Specific function timing.
MissesVery short functions that fall between samplesAnything not instrumented
Toolsperf, async-profiler, pprof, eBPFstrace, DTrace probes, manual time.Now()
Default to sampling in production. The overhead is low enough that you can run it continuously. Instrumentation is better when you've narrowed the search to one function and need exact per-call timing, but most of the value of production profiling is "where is the time going across the whole process," which is exactly what sampling answers.
a running thread, time flowing left to rightparseObjectqueryparseObjectlogeach tick = one timer interrupt that records the stacktally of where the stack pointedparseObject   4 / 6 samplesquery / log   2 / 6 samples
Sampling is a stroboscope. It never watches the program continuously; it counts how often each function is on the stack and turns counts into percentages.

On-CPU vs off-CPU

Sampling profilers split into two flavours that answer different questions. Both matter, and getting them confused is the most common mistake in profiling.

  • On-CPU profiling samples threads that are actively running on a CPU. Tells you where the process spends its computation. The flame graph everyone knows.
  • Off-CPU profiling samples threads that are blocked (on locks, on I/O, on conditions). Tells you where the process spends its waiting. The "I/O wait" flame graph; much less famous but often more useful.

A service whose CPU utilisation is 20% but whose throughput is half what it should be is almost certainly off-CPU bound. On-CPU profiling will show a small, evenly-distributed flame graph that doesn't explain the symptom. Off-CPU profiling will show a tall stack stuck in futex_wait or read or poll, which is the actual bottleneck.

The reason this split exists at all comes down to how the operating system runs threads. A thread is either on a CPU executing instructions, or it is off the CPU because the scheduler took it off, almost always because it asked to wait for something: a lock held by another thread, bytes from a socket, a page fault, a disk read, a condition variable. On-CPU profiling samples the first state and is blind to the second. If your wall-clock latency is 200 ms but only 8 ms of that is on-CPU, an on-CPU profile is explaining 4% of the problem and silently ignoring the rest. The other 192 ms is off-CPU, and you will not see it until you go looking for it specifically.

Off-CPU profiling is harder to capture, which is part of why it stayed obscure for so long. On-CPU sampling can use a simple timer interrupt that fires on whatever CPU the thread is using. Off-CPU work happens precisely when the thread is not scheduled, so there is no timer of yours running on it. The standard technique instead hooks the scheduler itself: record a timestamp when a thread goes off-CPU, record another when it comes back, attribute the difference to the stack that was running when it blocked. On Linux this is what the offcputime tool from bcc does with a kprobe on the scheduler's context-switch path. The output is a flame graph where width is blocked time rather than CPU time, and the leaves are the calls that put the thread to sleep.

one request, wall-clock 200 mson-CPU 8 msblocked on lockblocked on socket read↑ off-CPU profiler sees this↑ on-CPU sees only thiswhat each profiler reports as "the hot spot"on-CPU: serialize() — wrongoff-CPU: futex_wait, read() — right
Wall-clock time is on-CPU plus off-CPU. When most of a request is spent waiting, the on-CPU flame graph points at the wrong function, and only off-CPU profiling explains the latency.

Flame graphs

Brendan Gregg's flame graph format took stack-trace data, the output every sampling profiler produces, and made it readable. Each box represents a function in a stack frame; the width is proportional to the percentage of samples that hit that frame; the y-axis is stack depth. Boxes stack on top of their callers, so a box sits directly above the function that called it. Reading from the bottom up, you are reading the call tree; reading the width of any box, you are reading the time spent in that function and everything it called.

The thing that makes the format work is that it collapses thousands of individual stack samples into one picture. A profiler captures stacks like main → handleRequest → decode → parseObject over and over. Many of those stacks share a prefix. The flame graph merges every sample that shares a prefix into the same box and makes that box as wide as the number of samples it represents. The x-axis is not time in the chronological sense; adjacent boxes are not "what ran next." The x-axis is the merged population of samples, sorted alphabetically so the picture is stable run to run. That is worth internalising, because the most common beginner mistake is to read a flame graph left to right like a timeline. It is a histogram of stacks, not a trace.

read bottom-up for the call tree, read width for timemain   100%handleRequest   76%writeLog 22%decode   44%query 23%parseObject   41%scanToken   39%↓ tall narrow plateau = stuck here = the answerwidth is time, height is stack depth, the leaf at the top of the widest tower is doing the work
A flame graph reading: handleRequest dominates, and inside it the tower of decode → parseObject → scanToken is the hot path. The fix lives in scanToken, not in the wide-but-shallow handleRequest that merely contains it.

Reading a flame graph quickly is a skill that pays off across every language and every profiler:

  • Wide is hot. A function 40% of the way across the chart appeared in 40% of samples. That's where time goes.
  • Tall isn't necessarily bad. Deep stacks just mean deep call chains. Width matters; depth doesn't.
  • Plateaus are bottlenecks. A flat top — same function across many adjacent samples — means the program is stuck in that function. That's usually the answer.
  • Look at the children of wide frames. If handleRequest is 80% wide, the answer is in its children — the actual leaf functions doing the work.
Differential flame graphs. Subtract a "before" flame graph from an "after" to see what changed. Boxes that grew (red) show regressions; boxes that shrank (green) show wins. This is how you defend the impact of an optimisation without arguing about averages.

The tool landscape

The right profiler depends on the runtime and the platform. The list below covers the ones in active use in modern production environments. Two threads run through all of them and are worth understanding before the table, because they explain why one profiler works on your service and the next produces garbage.

The first is stack unwinding. To record a sample, a profiler has to walk the call stack from the current instruction back to main. There are a few ways to do that. The cheapest is to follow frame pointers, a register convention where each stack frame stores a pointer to its caller's frame, so walking the stack is just chasing a linked list. Compilers omit frame pointers by default to free up a register, which speeds up the code and breaks the profiler at the same time: without them, samples taken in optimised code cannot be unwound and show up as [unknown]. The alternative is DWARF-based unwinding, which reads debug metadata to reconstruct each frame; it is correct without frame pointers but far more expensive per sample, sometimes too expensive to run continuously. The modern compromise, used by eBPF profilers, is to ship the unwind tables into the kernel and walk them there. Whichever path you are on, the rule stands: if your flame graph is full of broken or truncated stacks, the unwinder is the first thing to check.

The second thread is eBPF. The kernel can now run small, verified programs in response to events, including the timer interrupt that drives sampling and the scheduler events that drive off-CPU profiling. This is what lets a single low-overhead agent profile every process on a host, kernel and userspace alike, without modifying any of them. The bcc and bpftrace toolkits expose this directly for ad-hoc work, like a one-line off-CPU profile or a latency histogram of a specific syscall, and the continuous-profiling systems build on the same foundation under the hood. The learning curve is real, but eBPF is why production profiling stopped requiring a special build of your binary.

ToolTargetsNotes
perfLinux, any language with frame pointers or DWARFThe kernel-provided baseline. Captures CPU samples, hardware counters, kernel and user stacks. perf record + perf script + flamegraph.pl is the canonical pipeline.
async-profilerJVM (Java, Kotlin, Scala, Clojure)The JVM profiler everyone uses. Uses AsyncGetCallTrace; safe in production; produces both on-CPU and off-CPU flame graphs. Built-in lock and allocation profiling.
pprofGo, plus others via the formatGo's standard profiler, surfaced via net/http/pprof. The pprof file format is now lingua franca — read by many UIs.
py-spy / AustinCPythonSampling profilers that work without modifying the process. Critical because cProfile is instrumenting and skews results.
BPF / bcc / bpftraceLinux kernel + userspaceProgrammable kernel-level tracing. Great for off-CPU profiling, lock tracing, latency histograms. Steep learning curve; high payoff.
VTune / Linux perf annotatex86, low-levelFor when sampling stack traces isn't enough and you need cycle-level or microarchitectural detail. Pair with top-down.
Pyroscope / Parca / Polar SignalsCross-language continuous profilersRun all the time, store profiles centrally, query like metrics. The continuous-profiling category — increasingly the default in cloud-native shops.

Continuous profiling

The newer wave of profilers runs all the time, at a low sampling rate (typically 19–100 Hz), shipping profiles to a central store. This changes how profiling fits into operations: instead of "page fires, capture a profile" you have "page fires, look at the profile from the moment it started". The historical view also lets you compare a regression to last week's baseline without having to reproduce the issue.

Practical considerations: overhead is about 1% at typical sampling rates, storage requires only a few MB per host per day with the pprof format's compression, and the query model is the same shape as metrics, with labels (service, version, instance) and time ranges. The trade-off is that you can't sample at arbitrarily high frequency without overhead growing; for one-off deep investigation, on-demand profiling at higher rates still beats continuous.

It helps to place continuous profiling next to the rest of your observability. Metrics tell you that a service is slow or hot. Traces tell you which span inside a request is slow and which downstream call it was waiting on. Neither tells you which lines of code inside that span are burning the CPU. Profiles fill exactly that gap: they are the fourth signal, the one that resolves a slow span down to a function and a call stack. The newest systems are starting to stitch these together, attaching a trace or span id as a label on profile samples so you can pivot from a slow trace straight into the flame graph for just the requests in that trace. The operational picture, when it works, is a straight line: an alert fires on a metric, you find the slow span in a trace, and you land on the exact function in a profile, all without reproducing anything or attaching a debugger to production.

Continuous profiling also changes what "profile" means socially on a team. When profiling is a special event you run during an incident, it is a skill only a few people exercise and the data is gone the moment the incident ends. When profiles are always on and queryable by anyone, they become part of normal review: you can look at the flame graph for a service the way you look at its latency chart, compare this week to last, and catch a slow creep in allocation rate before it becomes a page. The historical baseline is the whole point. Without it, every regression starts from "we think it got slower"; with it, every regression starts from "this function went from 2% to 28% of CPU on the deploy at 14:03."

The observer effect

Every measurement changes what's measured to some extent. Profilers are no exception, and the failure mode is subtle — the profile becomes internally consistent but misleading. Patterns to watch for:

  • The profiler shifts the hot path. Instrumentation that adds 1 µs per function call will make functions called millions of times look hot when they aren't. Sampling avoids this; instrumenting profilers don't.
  • The profiler hides the hot path. If the profiler can't unwind through optimised code (no frame pointers), samples that should attribute to a hot function show up as [unknown] at the top of the stack. Compile with -fno-omit-frame-pointer when profiling.
  • The profiler changes scheduling. Sampling at very high rates can perturb the scheduler enough to change which threads contend. Keep sample rates modest (≤ 100 Hz) for off-CPU and lock profiling.
  • The profiler doesn't see what you think. JIT'd code, foreign-function calls, kernel code, GPU work — each of these needs a specific profiler that knows how to unwind. A pure-userspace profiler will misattribute kernel time to whichever syscall returned to userland.

A worked profiling session

A real example: an HTTP service whose P99 latency degraded by 30% after a deploy. Metrics didn't point at any obvious cause.

# Step 1: capture a CPU profile (sampling, low overhead)
curl -o cpu.pb.gz \
  http://prod-host:6060/debug/pprof/profile?seconds=30

go tool pprof -http=:8080 cpu.pb.gz
# → flame graph shows a wide plateau on
#   "json.(*decodeState).object" — 28% of CPU samples.
# This is suspicious because JSON parsing wasn't touched in the deploy.

# Step 2: capture an off-CPU profile to see what's blocked
curl -o block.pb.gz \
  http://prod-host:6060/debug/pprof/block?seconds=30

go tool pprof -http=:8081 block.pb.gz
# → most off-CPU time is sync.Mutex.Lock under a request handler.
# Not the JSON parser — a shared map.

# Step 3: capture mutex contention specifically
curl -o mutex.pb.gz \
  http://prod-host:6060/debug/pprof/mutex?seconds=30

go tool pprof -http=:8082 mutex.pb.gz
# → 71% of mutex wait time on cache.shared.Get, which acquired
#   a global lock that the new deploy added to track hit/miss counters.

# Step 4: diff against last week's profile (continuous profiling)
# → the cache.shared.Get function went from ~2% of CPU to ~28% post-deploy.
# Cause: counter atomic-increment on every cache lookup, added in the deploy.

# Fix: per-shard counters, aggregated periodically. P99 returns to baseline.

Three takeaways from the session. First, the on-CPU profile pointed at JSON — which was downstream of the actual cause. Second, the off-CPU profile revealed the lock contention that was the real bottleneck. Third, continuous profiling (or any historical baseline) is what made it possible to attribute the regression to the deploy quickly, without having to reproduce.

Profiling allocation, locks, and GC

CPU profiling is the most common kind, but production performance questions often need others. The right profile type matches the question, and reaching for the wrong one wastes a debugging session.

Memory and allocation profiling deserves special attention because it answers a question CPU profiling cannot. A heap profile is a snapshot: it tells you what is alive right now and which call site allocated it, which is the tool for a leak or a memory ceiling you keep hitting. An allocation profile is a rate: it tells you which call sites are churning the most bytes over a window, even if those bytes are short-lived and freed immediately. The two diverge constantly. A function that allocates a million tiny slices per second and frees them just as fast shows almost nothing in a heap snapshot, because nothing survives to be counted, yet it can dominate an allocation profile and quietly drive your garbage collector into the ground. When CPU time mysteriously sits inside the runtime, in functions with names like mallocgc or gcBgMarkWorker, the on-CPU flame graph is telling you to go look at an allocation profile, because the real cause is upstream of the collector: something is making too much garbage.

Lock and mutex profiling is the off-CPU story made specific. Rather than "this thread was blocked," a mutex profile says "this thread was blocked waiting for this lock, acquired from these call sites, for this total time." That is usually enough to identify a single contended data structure, which is what most scaling cliffs come down to: a shared map, a global counter, a connection pool with one coarse lock. The fix is almost always to make the lock finer-grained or to remove the sharing, and the mutex profile is what tells you which lock is worth the effort.

QuestionProfile typeTools
Where is CPU going?On-CPU samplingperf, pprof, async-profiler
Where are we blocked?Off-CPU samplingasync-profiler, bcc offcputime, pprof block
Where is memory being allocated?Heap / allocation profilepprof heap, async-profiler alloc, Java Flight Recorder
Why is GC pausing so long?GC log + allocation profileJFR, async-profiler alloc, language-specific GC logs
What lock is contended?Lock / mutex profilepprof mutex, async-profiler lock, bpftrace
What syscall is slow?System-call latency histogrambpftrace, bcc
What CPU pipeline stage is the issue?Hardware-counter-basedperf with PMU events, top-down

Production checklist

  1. Default to sampling, not instrumenting. Sampling profilers are safe in production; instrumenting profilers are not.
  2. Compile with frame pointers. -fno-omit-frame-pointer for C/C++, -XX:+PreserveFramePointer for JVM. Stack walking fails without them.
  3. Capture both on-CPU and off-CPU. The first tells you where CPU goes; the second tells you what's blocked. Most production puzzles need both.
  4. Read flame graphs for width, not height. Wide plateaus are bottlenecks. Tall stacks are just deep call chains.
  5. Run a continuous profiler if you ship to a fleet. Pyroscope, Parca, Polar Signals. The historical baseline is the difference between "we have a regression" and "this deploy caused this regression on this function."
  6. Match the profile type to the question. Allocation profiles for GC investigations, mutex profiles for contention, syscall latency for I/O issues.
  7. Always diff against a baseline. A single flame graph shows where time goes; a diff shows what changed. The latter is what you want during incident response.

Further reading

  • Brendan Gregg — Flame Graphs. The reference page. Includes the original flamegraph.pl, links to diff flame graphs, and patterns for reading them.
  • Brendan Gregg — Systems Performance, Chapter 6 and 13. The most thorough single treatment of Linux profiling tools.
  • Andrei Pangin — async-profiler. The JVM profiler. The README is also one of the best tutorials on profiling in general.
  • Brendan Gregg — Off-CPU Analysis. The case for off-CPU profiling and how to do it on Linux.
  • Polar Signals / Parca / Pyroscope documentation. Continuous profiling — read at least one to understand the operational model.
  • Brendan Gregg — Linux perf Examples. The standard reference for what perf can do.
  • Adjacent: The USE method. USE tells you which resource is hot; profiling tells you what code is hitting it.
  • Adjacent: Top-down microarchitecture. The microarchitectural companion to sampling. Top-down splits cycles; profiling attributes them to functions.
Found this useful?