The USE method
Brendan Gregg's checklist for finding a bottleneck in a single box. For every resource — CPU, disk, NIC, memory bus, lock, file descriptor — record three numbers: utilisation, saturation, errors. The first resource where any of the three is high is usually the bottleneck. The whole walk-through takes a few minutes; the value is that it's exhaustive, so the resource you would have skipped is the one you actually check.
Why a method at all
Most performance debugging starts the same wrong way. A box is slow, someone
opens top, sees a number that looks plausible, and starts forming
a theory. The theory drives where they look next, which means they only ever
check the resources the theory predicts. If the theory is wrong they burn an
hour confirming a non-problem, and the actual bottleneck — a leaked file
descriptor, a single pinned core, a disk quietly retrying reads — never gets
looked at because nothing pointed there.
The USE method, from Brendan Gregg, fixes this by removing the theory from the first pass. Instead of guessing where to look, you check everything in a fixed order and let the numbers point. You list every resource on the box, and for each one you record three values: utilisation, saturation, and errors. The first resource where any of the three is unusually high is your lead. You are not diagnosing the root cause yet — you are finding the place to start, and you are finding it without a hunch that could be wrong. The whole pass is mechanical, which is exactly the point: a tired engineer at 3am should be able to run it correctly.
USE is a checklist, not a tool. It does not collect anything itself; it tells
you what to collect and from where. The tools it sits on top of are the
standard ones every Linux box already has — top, mpstat,
iostat, sar, vmstat, ss,
dmesg. The skill USE adds is not reading any one of them; it is
the discipline of reading all of them, in order, before you decide what is
wrong.
The three metrics
Each cell in the checklist asks a question you can read off a standard tool. There's no thinking involved at this stage: collect, compare, find the first cell that's hot. The three metrics are deliberately different from each other, because they catch different failure modes, and a resource can be in trouble on one while looking fine on the other two.
| Metric | Definition | Read it from |
|---|---|---|
| Utilisation | Fraction of time the resource is busy doing work. A CPU at 80% utilisation is busy 80% of the time. | mpstat, vmstat, iostat -x |
| Saturation | Amount of work the resource cannot service right now, sitting in a queue. Run queue length, disk queue depth, NIC tx queue. | vmstat 1 (runq-sz), iostat -x (avgqu-sz), ss -ti |
| Errors | Hardware or kernel-reported errors on the resource. Disk read errors, NIC packet drops, ECC events. | dmesg, ip -s link, smartctl, perf |
The difference is worth dwelling on, because it's the part people get wrong. Utilisation tops out at 100% — the resource cannot be more than fully busy. Once it's pegged there, the utilisation number stops telling you anything; it can't go higher to signal that the pressure is getting worse. Saturation is what keeps rising after that. A disk at 100% util with a queue of two is handling its load fine. The same disk at 100% util with a queue of two hundred is the bottleneck, and the only metric that distinguishes those two states is the queue depth. This is also why latency tracks saturation, not utilisation: each request now waits behind a long line of others before it gets serviced, so the tail latency climbs even though the busy-ness number hasn't moved.
Errors are the third axis because some failures never touch the first two. A NIC dropping one packet in a thousand can sit at modest utilisation and an empty queue while quietly destroying throughput, because every dropped packet triggers a retransmit and a congestion-control backoff up in TCP. A disk with rising reallocated-sector counts will reroute reads and keep working, slower, with no obvious busy-ness signal. If you only watch utilisation and saturation you will stare at a healthy-looking box and never understand why it's slow. The error counters are cheap to read and they catch the failure class the other two are blind to.
Building the resource list
The hardest part of USE for a newcomer is not reading the tools, it's knowing what counts as a resource. The rule is simple: a resource is any physical or logical thing that requests have to wait for. If two requests can contend for it, it belongs on the list. That covers the obvious hardware — CPUs, memory, disks, network interfaces — and it also covers things that are easy to forget, like the memory bus, interrupt handling, file descriptors, and locks.
Gregg's advice for building the list is to draw the functional block diagram of the system: CPUs connected to memory across a bus, I/O controllers hanging off that, disks and NICs behind the controllers. Every box and every arrow on that diagram is a resource, because the arrows — the interconnects — saturate too. A storage controller that can drive eight disks at full speed still has a single link back to the CPU, and that link is a resource you will not see if you only list the disks. The same goes for a network card that can saturate its PCIe lane before it saturates the wire. The interconnects are where the surprising bottlenecks live precisely because nobody lists them.
You do not need a perfect list to start. Build the standard one below, run a pass, and add anything specific to your workload as you learn the box. On a managed cloud instance, network and disk are usually the first to saturate because they're often throttled below the raw hardware limit; on a database box, locks and disk I/O; on a compute box, CPU and memory bandwidth. The list is a living thing, but the standard set covers the large majority of real incidents.
| Resource | What to check | Typical tool |
|---|---|---|
| CPU | Per-core %busy, run queue length, scheduler errors | mpstat -P ALL 1 |
| Memory capacity | Used vs free, swap activity, OOM kills | free -m, vmstat, dmesg |
| Memory bandwidth | Bytes/sec moving across the bus; saturation = stalls | perf stat -e mem-loads,mem-stores |
| Storage I/O | %util, avgqu-sz, errors per disk | iostat -xz 1 |
| Network I/O | RX/TX bytes, packet drops, retransmits | sar -n DEV 1, ss -s |
| File descriptors | Current open vs ulimit -n | lsof -p PID, /proc/PID/limits |
| Locks (kernel) | spinlock contention, mutex wait time | perf lock, eBPF tracing |
| Locks (application) | Wait time on app-level mutexes | language profilers (async-profiler, pprof) |
Per resource: what each number means, and how to read it
The three metrics are the same everywhere, but what counts as utilisation, saturation, and an error differs by resource. Going through them one at a time is the fastest way to build the mental model, because it shows you exactly which column of which tool to read for each cell of the worksheet.
CPU
Utilisation is per-core busy time, and the word "per-core" is doing real work
there. An aggregate "40% CPU" across eight cores can mean one core pinned at
100% while seven idle — the classic signature of a single-threaded hot path or
a contended lock — and the aggregate number hides it completely. Read it with
mpstat -P ALL 1, which prints every core on its own line.
Saturation for the CPU is the run queue: how many threads are runnable but
waiting for a core. vmstat 1 shows it in the r
column; a run queue that sits well above the core count means threads are
spending time waiting to run, not running. Errors are rarer here but real —
thermal throttling, correctable machine-check events — and show up in
dmesg and under /sys or via perf.
Memory
Memory needs splitting into capacity and bandwidth. For capacity, utilisation
is used versus total, but the saturation signal is the one that matters:
swapping and the OOM killer. A box can sit at 95% memory used and be perfectly
healthy; the moment it starts paging anonymous memory out to swap, latency
falls off a cliff, and that's the saturation event. Watch the si
and so columns in vmstat 1 (swap in / swap out) and
scan dmesg for OOM kills. For bandwidth, utilisation is bytes per
second moving across the memory bus, which you read with
perf stat on memory events; its saturation shows up as stalled
CPU cycles waiting on memory, which makes a workload look CPU-bound when it's
really starved for bandwidth. Errors are ECC events, also in dmesg.
Disk (storage I/O)
The whole picture for one disk is in a single tool: iostat -xz 1.
Utilisation is the %util column. Saturation is the average queue
depth, aqu-sz (older kernels: avgqu-sz), and the
service-time columns await and svctm tell you
whether requests are waiting in the queue or slow at the device itself.
Errors are not in iostat — you get them from dmesg,
from smartctl -a, and from the reallocated-sector and pending-sector
SMART counters. A disk at 100% util with a queue of one is fine; the same disk
with a queue of fifty and a climbing await is the bottleneck.
Network
Utilisation is throughput against the link's capacity, which sounds simple but
is the resource people most often mis-measure, because the real ceiling is
frequently the cloud provider's cap rather than the wire speed. Read RX/TX
with sar -n DEV 1 or nicstat. Saturation appears as
drops on the interface queues and, up at the TCP layer, as retransmits and a
growing send buffer; ss -ti shows per-socket retransmit counts and
ss -s gives the summary. Errors are interface errors and drops
from ip -s link. A NIC sitting at 30% throughput with a steady
drip of retransmits is saturated even though utilisation looks comfortable —
the kernel is choking on the queue, not the wire.
Controllers and interconnects
These are the ones the standard list tends to skip. A storage HBA, a PCIe
lane, a NUMA interconnect — each has a ceiling, and each can saturate before
the devices behind it do. Utilisation here is throughput against the link's
rated bandwidth; saturation is queuing at the controller. The tooling is
thinner — perf, vendor counters, NUMA stats from
numastat — but the point of listing them is to make you ask the
question. The most expensive incidents are usually a resource nobody thought
to check, and the interconnect is the canonical example.
File descriptors and locks
These are logical resources, but they queue and they run out, so they're in
scope. For file descriptors, utilisation is open descriptors against the
ulimit -n ceiling; read it from /proc/PID/limits and
count what's in /proc/PID/fd. There's no real saturation — you
either have a free descriptor or you don't — and the error is the hard wall:
the service starts returning errors when it can't open a socket. For locks,
utilisation is the fraction of time the lock is held, saturation is wait time
for threads queued on it, and you read both from perf lock at the
kernel level or a language profiler (async-profiler, pprof) at the application
level. A contended lock is invisible to every hardware metric and only shows
up if you put locks on the list.
The worksheet
A USE pass takes about ten minutes. You fill in a small grid — one row per resource, three columns for the metrics — and look for the first cell where a number is meaningfully above what you'd expect for the workload. That's the bottleneck, or at least the first thing to investigate.
# A worked USE pass on a service box under load.
# Goal: find the resource that's first to give.
# CPU
mpstat -P ALL 1 5
→ core 0: 98% busy (util high; runq stable at 4 — investigating)
→ core 1: 35% busy
→ core 2: 37% busy
→ core 3: 33% busy
→ conclusion: one core hot, others idle → probably a hot lock or
single-threaded code path
# Memory
free -m
→ used: 14 GB / 16 GB ; swap: 0
→ conclusion: fine
# Storage
iostat -xz 1 5
→ sda: %util 12 ; avgqu-sz 0.4 ; errors 0
→ conclusion: fine
# Network
sar -n DEV 1 5
→ eth0: rx 14 MB/s ; tx 7 MB/s ; drops 0
→ conclusion: fine
# Locks (app-level)
async-profiler -e lock -d 30 -f locks.html PID
→ 78% of contended-lock time on one synchronized map
→ conclusion: this is the bottleneckThe pass above took eight minutes. It found a contended lock on one core; the
rest of the box was idle. Without USE you might have stared at top's
CPU% column and concluded "the CPU is fine, only 40%" — the per-core view is
what catches the single-threaded bottleneck.
Worked example: a saturated disk
Here's a different shape of incident. A reporting service has gone slow. Tail latency on its queries has gone from 200ms to several seconds, but the error rate is flat and the request rate hasn't changed. Nothing is broken; everything is just slow. That last detail — slow but not failing, with no change in load — is the signature of a saturated resource, and USE finds which one.
# CPU
mpstat -P ALL 1 5
→ all cores 10–20% busy, run queue empty
→ conclusion: CPU is idle, not the problem
# Memory
free -m ; vmstat 1 5
→ used 9 GB / 32 GB ; si 0 so 0 ; no OOM in dmesg
→ conclusion: fine
# Disk
iostat -xz 1 5
Device %util aqu-sz await r/s w/s
nvme0n1 99.4 74.0 310 40 1900
→ util pegged, queue 74 deep, await 310ms
→ conclusion: this disk is saturated — start here
# confirm errors
smartctl -a /dev/nvme0n1 | grep -i error
→ 0 errors ; SMART healthy
→ conclusion: not a failing disk, just overloadedThe disk is the lead: pegged utilisation, a 74-deep queue, and an
await of 310ms means every read is sitting in line for a third of
a second before the device even starts on it. That queue is the saturation, and
it's what the user feels as multi-second latency. Notice what utilisation alone
would have told you — "the disk is 100% busy" — which sounds alarming but is
normal under load and gives you nothing to act on. The queue depth and
await are what say "overloaded," and the SMART check rules out the
third axis so you know you're chasing load, not a dying device.
Where it goes from there is no longer a USE question — USE has done its job by
naming the resource. The next step is to find what's driving the I/O
(iotop, biolatency from bcc) and whether it's a sudden
flood of writes, a missing index turning queries into full scans, or a noisy
neighbour on shared storage. The relationship between queue depth and latency
here is not a coincidence; it's exactly what
queueing theory
predicts once utilisation gets close to the knee, and that page explains why the
latency climbs so steeply at the end.
Worked example: a memory-saturated box
One more, because memory saturation has a distinctive and easily-misread signature. A service has become erratic: mostly fine, then suddenly a burst of very slow requests, then fine again. CPU graphs look spiky but never pinned. The instinct is to blame the CPU or a slow downstream call. USE points elsewhere.
# Memory capacity
free -m
→ used 30.4 GB / 32 GB ; available 600 MB ; swap used 1.8 GB
# Saturation: the tell is paging, not "used"
vmstat 1 10
procs memory swap io cpu
r b free cache si so bi bo us sy id wa
2 3 410M 1.1G 220 480 ... ... 30 12 18 40
2 4 380M 1.0G 310 690 ... ... 28 10 14 48
→ si/so non-zero and bursty ; wa (I/O wait) high
→ conclusion: the box is paging — memory is saturated
# Confirm the worst case
dmesg | grep -i 'killed process'
→ Out of memory: Killed process 4821 (report-worker)
→ conclusion: it has already OOM-killed a workerThe lead here is not the used number — 30 of 32 GB used is not by
itself a problem. The lead is the swap-in / swap-out activity (si
and so) in vmstat. Once the box runs low on memory the
kernel starts paging anonymous memory to disk, and every page fault that has to
read from swap turns a nanosecond memory access into a millisecond disk access.
That's the source of the bursty slowness and the high I/O wait — the CPU isn't
busy, it's blocked waiting for paged-out memory to come back. The OOM kill in
dmesg is the saturation event taken to its conclusion: the kernel
gave up and killed a process to reclaim memory.
This is a clean illustration of why utilisation and saturation are separate columns. A glance at "memory used" would have said the box was fine right up until it wasn't. The saturation signal — paging — is what was rising the whole time, and it's the metric that would have warned you before the OOM kill. The fix is a capacity question (more memory, a lower per-worker footprint, fewer workers), but USE's contribution is unambiguous: it told you the resource was memory and the mechanism was paging, in about three minutes, without a guess.
Where USE catches things other methods miss
USE's distinguishing trick is being exhaustive. Most performance investigations skip resources because they "feel fine"; USE asks you to check every one. A few patterns it catches that other methods leave alone:
- Per-resource imbalance. One CPU pinned, others idle. One disk at 100% util, others at 5%. The aggregate average hides it; the per-resource view doesn't.
- Saturation without high utilisation. A NIC with constant retransmits sits at maybe 30% throughput because the kernel is choking on the queue. Utilisation says "fine"; saturation says "investigate".
- Silent errors. A SATA disk with read errors will reroute reads via the OS; you'll see latency spikes and never see CPU pressure. The errors column catches it.
- Resources you forgot existed. File descriptors are the classic — a service that runs fine for six days then starts returning 503 because it leaked sockets. USE's exhaustive list keeps you from skipping it.
USE vs RED — different layers, both needed
USE inspects boxes. RED inspects services. They are not alternatives — they answer different questions, and a complete picture needs both. A service can be slow because the request rate has spiked (RED catches it), or because the box it runs on has a saturated disk (USE catches it). The first failure mode of any new oncall engineer is to use one method when the other one is needed.
| USE | RED | |
|---|---|---|
| Layer | Hardware resource | Service / endpoint |
| Question | Is this resource saturated? | Is this service healthy? |
| Metrics | Utilisation, saturation, errors | Rate, errors, duration |
| Best for | Single-host investigation | Service fleet, SLO tracking |
| Coverage | What the kernel can see | What the service emits |
The clean way to hold the two together: RED works top-down from the symptom, USE works bottom-up from the hardware. A real incident usually starts with a RED signal — an SLO burning, a latency alert, an error spike on an endpoint — because that's what's wired to alerting. RED tells you which service is unhappy. USE then tells you, on the boxes behind that service, which resource is the cause. Run RED to localise the symptom to a service, then run USE on that service's hosts to find the saturated resource. The two methods meet in the middle, and the place they meet is your root cause.
Both methods feed the same telemetry pipeline. The utilisation, saturation, and error numbers USE reads by hand on one box are exactly the kind of signals you want shipped as metrics, with logs and traces alongside, so that the manual pass becomes a dashboard and the next incident starts from a graph instead of a terminal. USE is the method you run by hand on the box in front of you; the same checklist, instrumented, is what your monitoring should cover for every box you can't log into in time.
Where USE falls short
USE is a single-host, resource-level method, and that scope is also its limit. It's worth being honest about the cases it handles badly, because a method applied outside its range produces false confidence.
- Distributed bottlenecks. When the slow part is the path between services — a slow downstream dependency, a chatty fan-out, a queue between two systems — every individual box can pass USE clean while the request is slow. USE checks resources on one host; it has no view of the request as it crosses hosts. That's a job for tracing and for RED across the fleet.
- Software bottlenecks that aren't resources. An O(n²) algorithm, a bad query plan, an unbounded retry loop — these burn a resource, so USE points at the resource, but the resource isn't the problem, the code is. USE tells you the CPU is hot; it can't tell you the CPU is hot because someone wrote a quadratic loop. You need a profiler for that, which is the natural next step after USE names the resource.
- Application-level locks and logical limits. USE can include these if you put them on the list, but they're invisible to the hardware tools, so they're easy to omit. A connection pool exhausted at 100 connections is a saturated resource the kernel cannot see at all.
- Intermittent and load-correlated faults. A USE pass is a snapshot. A bottleneck that only appears at peak, or only when a specific tenant runs a specific job, may not be present when you run the pass. USE is strongest on a box that is slow right now; for problems that come and go you need the same metrics recorded over time, not sampled by hand.
None of this makes USE less useful — it makes it a first pass, not the whole investigation. The honest workflow is: RED or a trace to find the unhappy service, USE to find the saturated resource on its hosts, then a profiler to find the code path hitting that resource. USE owns the middle step, and it owns it better than anything else, but it is a step, not the destination.
Production checklist
- Print the USE worksheet. One row per resource (CPU, memory, disk, network, FDs, locks); three columns (utilisation, saturation, errors). Keep it pinned somewhere reachable.
- Run the pass in order. CPU first, then memory, then disk, then network, then descriptors, then locks. Stop at the first cell that's hot.
- Always look per-resource, not aggregate. Per-core CPU, per-disk I/O, per-NIC throughput. Averages hide imbalances.
- Saturation matters as much as utilisation. A 100% busy resource with a short queue is normal; a saturated queue is not.
- Check error counters even when nothing looks broken. A silent disk error doesn't show up in CPU% but does show up as tail-latency spikes.
- Pair with RED. USE for the box, RED for the service. Use both, in that order if the symptom is on a single host; reverse the order if the symptom is fleet-wide.
- Promote anything chronic to a dashboard. If the same resource keeps appearing in USE passes, it belongs on a dashboard with an alert — not in oncall's head.
Further reading
- Brendan Gregg — "The USE Method". The original write-up. Worksheet templates for Linux, Solaris, and AWS EC2 included.
- Brendan Gregg — Systems Performance (2nd ed.). Chapter 2 covers USE in full, with worked examples and the exhaustive Linux tool reference.
- nicstat / sar / mpstat / iostat / vmstat manpages. The actual tools USE depends on. Worth reading the manpages once.
- Brendan Gregg — "Linux perf Examples". When the standard counters aren't enough,
perfis the next stop. - Adjacent: The RED method. The service-level companion. Use both for full coverage.
- Adjacent: Profiling in production. When USE has pointed at a hot resource, profiling tells you what's hitting it.