02 / 08

Methods / 02 · Hardware resources

The USE method

Brendan Gregg's checklist for finding a bottleneck in a single box. For every resource — CPU, disk, NIC, memory bus, lock, file descriptor — record three numbers: utilisation, saturation, errors. The first resource where any of the three is high is usually the bottleneck. The whole walk-through takes a few minutes; the value is that it's exhaustive, so the resource you would have skipped is the one you actually check.

Why a method at all

Most performance debugging starts the same wrong way. A box is slow, someone opens top, sees a number that looks plausible, and starts forming a theory. The theory drives where they look next, which means they only ever check the resources the theory predicts. If the theory is wrong they burn an hour confirming a non-problem, and the actual bottleneck — a leaked file descriptor, a single pinned core, a disk quietly retrying reads — never gets looked at because nothing pointed there.

The USE method, from Brendan Gregg, fixes this by removing the theory from the first pass. Instead of guessing where to look, you check everything in a fixed order and let the numbers point. You list every resource on the box, and for each one you record three values: utilisation, saturation, and errors. The first resource where any of the three is unusually high is your lead. You are not diagnosing the root cause yet — you are finding the place to start, and you are finding it without a hunch that could be wrong. The whole pass is mechanical, which is exactly the point: a tired engineer at 3am should be able to run it correctly.

USE is a checklist, not a tool. It does not collect anything itself; it tells you what to collect and from where. The tools it sits on top of are the standard ones every Linux box already has — top, mpstat, iostat, sar, vmstat, ss, dmesg. The skill USE adds is not reading any one of them; it is the discipline of reading all of them, in order, before you decide what is wrong.

The three metrics

Each cell in the checklist asks a question you can read off a standard tool. There's no thinking involved at this stage: collect, compare, find the first cell that's hot. The three metrics are deliberately different from each other, because they catch different failure modes, and a resource can be in trouble on one while looking fine on the other two.

Metric	Definition	Read it from
Utilisation	Fraction of time the resource is busy doing work. A CPU at 80% utilisation is busy 80% of the time.	`mpstat`, `vmstat`, `iostat -x`
Saturation	Amount of work the resource cannot service right now, sitting in a queue. Run queue length, disk queue depth, NIC tx queue.	`vmstat 1` (runq-sz), `iostat -x` (avgqu-sz), `ss -ti`
Errors	Hardware or kernel-reported errors on the resource. Disk read errors, NIC packet drops, ECC events.	`dmesg`, `ip -s link`, `smartctl`, `perf`

Utilisation is not saturation. A disk at 100% utilisation is busy continuously; that's normal under heavy load. A disk with a 200-deep queue is saturated; that's a problem. The two metrics are correlated but distinct, and the distinction is the whole reason USE has both.

The difference is worth dwelling on, because it's the part people get wrong. Utilisation tops out at 100% — the resource cannot be more than fully busy. Once it's pegged there, the utilisation number stops telling you anything; it can't go higher to signal that the pressure is getting worse. Saturation is what keeps rising after that. A disk at 100% util with a queue of two is handling its load fine. The same disk at 100% util with a queue of two hundred is the bottleneck, and the only metric that distinguishes those two states is the queue depth. This is also why latency tracks saturation, not utilisation: each request now waits behind a long line of others before it gets serviced, so the tail latency climbs even though the busy-ness number hasn't moved.

Once utilisation flattens at 100%, it stops being informative. Saturation is the metric that keeps rising — and it's what your users feel as latency.

Errors are the third axis because some failures never touch the first two. A NIC dropping one packet in a thousand can sit at modest utilisation and an empty queue while quietly destroying throughput, because every dropped packet triggers a retransmit and a congestion-control backoff up in TCP. A disk with rising reallocated-sector counts will reroute reads and keep working, slower, with no obvious busy-ness signal. If you only watch utilisation and saturation you will stare at a healthy-looking box and never understand why it's slow. The error counters are cheap to read and they catch the failure class the other two are blind to.

Building the resource list

The hardest part of USE for a newcomer is not reading the tools, it's knowing what counts as a resource. The rule is simple: a resource is any physical or logical thing that requests have to wait for. If two requests can contend for it, it belongs on the list. That covers the obvious hardware — CPUs, memory, disks, network interfaces — and it also covers things that are easy to forget, like the memory bus, interrupt handling, file descriptors, and locks.

Gregg's advice for building the list is to draw the functional block diagram of the system: CPUs connected to memory across a bus, I/O controllers hanging off that, disks and NICs behind the controllers. Every box and every arrow on that diagram is a resource, because the arrows — the interconnects — saturate too. A storage controller that can drive eight disks at full speed still has a single link back to the CPU, and that link is a resource you will not see if you only list the disks. The same goes for a network card that can saturate its PCIe lane before it saturates the wire. The interconnects are where the surprising bottlenecks live precisely because nobody lists them.

You do not need a perfect list to start. Build the standard one below, run a pass, and add anything specific to your workload as you learn the box. On a managed cloud instance, network and disk are usually the first to saturate because they're often throttled below the raw hardware limit; on a database box, locks and disk I/O; on a compute box, CPU and memory bandwidth. The list is a living thing, but the standard set covers the large majority of real incidents.

Resource	What to check	Typical tool
CPU	Per-core %busy, run queue length, scheduler errors	`mpstat -P ALL 1`
Memory capacity	Used vs free, swap activity, OOM kills	`free -m`, `vmstat`, `dmesg`
Memory bandwidth	Bytes/sec moving across the bus; saturation = stalls	`perf stat -e mem-loads,mem-stores`
Storage I/O	%util, avgqu-sz, errors per disk	`iostat -xz 1`
Network I/O	RX/TX bytes, packet drops, retransmits	`sar -n DEV 1`, `ss -s`
File descriptors	Current open vs `ulimit -n`	`lsof -p PID`, `/proc/PID/limits`
Locks (kernel)	spinlock contention, mutex wait time	`perf lock`, eBPF tracing
Locks (application)	Wait time on app-level mutexes	language profilers (async-profiler, pprof)

Per resource: what each number means, and how to read it

The three metrics are the same everywhere, but what counts as utilisation, saturation, and an error differs by resource. Going through them one at a time is the fastest way to build the mental model, because it shows you exactly which column of which tool to read for each cell of the worksheet.

CPU

Utilisation is per-core busy time, and the word "per-core" is doing real work there. An aggregate "40% CPU" across eight cores can mean one core pinned at 100% while seven idle — the classic signature of a single-threaded hot path or a contended lock — and the aggregate number hides it completely. Read it with mpstat -P ALL 1, which prints every core on its own line. Saturation for the CPU is the run queue: how many threads are runnable but waiting for a core. vmstat 1 shows it in the r column; a run queue that sits well above the core count means threads are spending time waiting to run, not running. Errors are rarer here but real — thermal throttling, correctable machine-check events — and show up in dmesg and under /sys or via perf.

Memory

Memory needs splitting into capacity and bandwidth. For capacity, utilisation is used versus total, but the saturation signal is the one that matters: swapping and the OOM killer. A box can sit at 95% memory used and be perfectly healthy; the moment it starts paging anonymous memory out to swap, latency falls off a cliff, and that's the saturation event. Watch the si and so columns in vmstat 1 (swap in / swap out) and scan dmesg for OOM kills. For bandwidth, utilisation is bytes per second moving across the memory bus, which you read with perf stat on memory events; its saturation shows up as stalled CPU cycles waiting on memory, which makes a workload look CPU-bound when it's really starved for bandwidth. Errors are ECC events, also in dmesg.

Disk (storage I/O)

The whole picture for one disk is in a single tool: iostat -xz 1. Utilisation is the %util column. Saturation is the average queue depth, aqu-sz (older kernels: avgqu-sz), and the service-time columns await and svctm tell you whether requests are waiting in the queue or slow at the device itself. Errors are not in iostat — you get them from dmesg, from smartctl -a, and from the reallocated-sector and pending-sector SMART counters. A disk at 100% util with a queue of one is fine; the same disk with a queue of fifty and a climbing await is the bottleneck.

Network

Utilisation is throughput against the link's capacity, which sounds simple but is the resource people most often mis-measure, because the real ceiling is frequently the cloud provider's cap rather than the wire speed. Read RX/TX with sar -n DEV 1 or nicstat. Saturation appears as drops on the interface queues and, up at the TCP layer, as retransmits and a growing send buffer; ss -ti shows per-socket retransmit counts and ss -s gives the summary. Errors are interface errors and drops from ip -s link. A NIC sitting at 30% throughput with a steady drip of retransmits is saturated even though utilisation looks comfortable — the kernel is choking on the queue, not the wire.

Controllers and interconnects

These are the ones the standard list tends to skip. A storage HBA, a PCIe lane, a NUMA interconnect — each has a ceiling, and each can saturate before the devices behind it do. Utilisation here is throughput against the link's rated bandwidth; saturation is queuing at the controller. The tooling is thinner — perf, vendor counters, NUMA stats from numastat — but the point of listing them is to make you ask the question. The most expensive incidents are usually a resource nobody thought to check, and the interconnect is the canonical example.

File descriptors and locks

These are logical resources, but they queue and they run out, so they're in scope. For file descriptors, utilisation is open descriptors against the ulimit -n ceiling; read it from /proc/PID/limits and count what's in /proc/PID/fd. There's no real saturation — you either have a free descriptor or you don't — and the error is the hard wall: the service starts returning errors when it can't open a socket. For locks, utilisation is the fraction of time the lock is held, saturation is wait time for threads queued on it, and you read both from perf lock at the kernel level or a language profiler (async-profiler, pprof) at the application level. A contended lock is invisible to every hardware metric and only shows up if you put locks on the list.

The worksheet

A USE pass takes about ten minutes. You fill in a small grid — one row per resource, three columns for the metrics — and look for the first cell where a number is meaningfully above what you'd expect for the workload. That's the bottleneck, or at least the first thing to investigate.

# A worked USE pass on a service box under load.
# Goal: find the resource that's first to give.

# CPU
mpstat -P ALL 1 5
  → core 0:  98% busy   (util high; runq stable at 4 — investigating)
  → core 1:  35% busy
  → core 2:  37% busy
  → core 3:  33% busy
  → conclusion: one core hot, others idle → probably a hot lock or
    single-threaded code path

# Memory
free -m
  → used: 14 GB / 16 GB ; swap: 0
  → conclusion: fine

# Storage
iostat -xz 1 5
  → sda: %util 12 ; avgqu-sz 0.4 ; errors 0
  → conclusion: fine

# Network
sar -n DEV 1 5
  → eth0: rx 14 MB/s ; tx 7 MB/s ; drops 0
  → conclusion: fine

# Locks (app-level)
async-profiler -e lock -d 30 -f locks.html PID
  → 78% of contended-lock time on one synchronized map
  → conclusion: this is the bottleneck

The pass above took eight minutes. It found a contended lock on one core; the rest of the box was idle. Without USE you might have stared at top's CPU% column and concluded "the CPU is fine, only 40%" — the per-core view is what catches the single-threaded bottleneck.

Worked example: a saturated disk

Here's a different shape of incident. A reporting service has gone slow. Tail latency on its queries has gone from 200ms to several seconds, but the error rate is flat and the request rate hasn't changed. Nothing is broken; everything is just slow. That last detail — slow but not failing, with no change in load — is the signature of a saturated resource, and USE finds which one.

# CPU
mpstat -P ALL 1 5
  → all cores 10–20% busy, run queue empty
  → conclusion: CPU is idle, not the problem

# Memory
free -m  ;  vmstat 1 5
  → used 9 GB / 32 GB ; si 0 so 0 ; no OOM in dmesg
  → conclusion: fine

# Disk
iostat -xz 1 5
  Device   %util   aqu-sz   await   r/s    w/s
  nvme0n1   99.4    74.0     310     40    1900
  → util pegged, queue 74 deep, await 310ms
  → conclusion: this disk is saturated — start here

# confirm errors
smartctl -a /dev/nvme0n1 | grep -i error
  → 0 errors ; SMART healthy
  → conclusion: not a failing disk, just overloaded

The disk is the lead: pegged utilisation, a 74-deep queue, and an await of 310ms means every read is sitting in line for a third of a second before the device even starts on it. That queue is the saturation, and it's what the user feels as multi-second latency. Notice what utilisation alone would have told you — "the disk is 100% busy" — which sounds alarming but is normal under load and gives you nothing to act on. The queue depth and await are what say "overloaded," and the SMART check rules out the third axis so you know you're chasing load, not a dying device.

Where it goes from there is no longer a USE question — USE has done its job by naming the resource. The next step is to find what's driving the I/O (iotop, biolatency from bcc) and whether it's a sudden flood of writes, a missing index turning queries into full scans, or a noisy neighbour on shared storage. The relationship between queue depth and latency here is not a coincidence; it's exactly what queueing theory predicts once utilisation gets close to the knee, and that page explains why the latency climbs so steeply at the end.

Worked example: a memory-saturated box

One more, because memory saturation has a distinctive and easily-misread signature. A service has become erratic: mostly fine, then suddenly a burst of very slow requests, then fine again. CPU graphs look spiky but never pinned. The instinct is to blame the CPU or a slow downstream call. USE points elsewhere.

# Memory capacity
free -m
  → used 30.4 GB / 32 GB ; available 600 MB ; swap used 1.8 GB

# Saturation: the tell is paging, not "used"
vmstat 1 10
  procs   memory        swap      io       cpu
  r  b    free   cache   si  so   bi  bo   us sy id wa
  2  3    410M   1.1G    220 480  ...  ... 30 12 18 40
  2  4    380M   1.0G    310 690  ...  ... 28 10 14 48
  → si/so non-zero and bursty ; wa (I/O wait) high
  → conclusion: the box is paging — memory is saturated

# Confirm the worst case
dmesg | grep -i 'killed process'
  → Out of memory: Killed process 4821 (report-worker)
  → conclusion: it has already OOM-killed a worker

The lead here is not the used number — 30 of 32 GB used is not by itself a problem. The lead is the swap-in / swap-out activity (si and so) in vmstat. Once the box runs low on memory the kernel starts paging anonymous memory to disk, and every page fault that has to read from swap turns a nanosecond memory access into a millisecond disk access. That's the source of the bursty slowness and the high I/O wait — the CPU isn't busy, it's blocked waiting for paged-out memory to come back. The OOM kill in dmesg is the saturation event taken to its conclusion: the kernel gave up and killed a process to reclaim memory.

This is a clean illustration of why utilisation and saturation are separate columns. A glance at "memory used" would have said the box was fine right up until it wasn't. The saturation signal — paging — is what was rising the whole time, and it's the metric that would have warned you before the OOM kill. The fix is a capacity question (more memory, a lower per-worker footprint, fewer workers), but USE's contribution is unambiguous: it told you the resource was memory and the mechanism was paging, in about three minutes, without a guess.

Where USE catches things other methods miss

USE's distinguishing trick is being exhaustive. Most performance investigations skip resources because they "feel fine"; USE asks you to check every one. A few patterns it catches that other methods leave alone:

Per-resource imbalance. One CPU pinned, others idle. One disk at 100% util, others at 5%. The aggregate average hides it; the per-resource view doesn't.
Saturation without high utilisation. A NIC with constant retransmits sits at maybe 30% throughput because the kernel is choking on the queue. Utilisation says "fine"; saturation says "investigate".
Silent errors. A SATA disk with read errors will reroute reads via the OS; you'll see latency spikes and never see CPU pressure. The errors column catches it.
Resources you forgot existed. File descriptors are the classic — a service that runs fine for six days then starts returning 503 because it leaked sockets. USE's exhaustive list keeps you from skipping it.

USE vs RED — different layers, both needed

USE inspects boxes. RED inspects services. They are not alternatives — they answer different questions, and a complete picture needs both. A service can be slow because the request rate has spiked (RED catches it), or because the box it runs on has a saturated disk (USE catches it). The first failure mode of any new oncall engineer is to use one method when the other one is needed.

	USE	RED
Layer	Hardware resource	Service / endpoint
Question	Is this resource saturated?	Is this service healthy?
Metrics	Utilisation, saturation, errors	Rate, errors, duration
Best for	Single-host investigation	Service fleet, SLO tracking
Coverage	What the kernel can see	What the service emits

The clean way to hold the two together: RED works top-down from the symptom, USE works bottom-up from the hardware. A real incident usually starts with a RED signal — an SLO burning, a latency alert, an error spike on an endpoint — because that's what's wired to alerting. RED tells you which service is unhappy. USE then tells you, on the boxes behind that service, which resource is the cause. Run RED to localise the symptom to a service, then run USE on that service's hosts to find the saturated resource. The two methods meet in the middle, and the place they meet is your root cause.

RED top-down from the unhappy service, USE bottom-up from the hardware. The root cause is where they meet.

Both methods feed the same telemetry pipeline. The utilisation, saturation, and error numbers USE reads by hand on one box are exactly the kind of signals you want shipped as metrics, with logs and traces alongside, so that the manual pass becomes a dashboard and the next incident starts from a graph instead of a terminal. USE is the method you run by hand on the box in front of you; the same checklist, instrumented, is what your monitoring should cover for every box you can't log into in time.

Where USE falls short

USE is a single-host, resource-level method, and that scope is also its limit. It's worth being honest about the cases it handles badly, because a method applied outside its range produces false confidence.

Distributed bottlenecks. When the slow part is the path between services — a slow downstream dependency, a chatty fan-out, a queue between two systems — every individual box can pass USE clean while the request is slow. USE checks resources on one host; it has no view of the request as it crosses hosts. That's a job for tracing and for RED across the fleet.
Software bottlenecks that aren't resources. An O(n²) algorithm, a bad query plan, an unbounded retry loop — these burn a resource, so USE points at the resource, but the resource isn't the problem, the code is. USE tells you the CPU is hot; it can't tell you the CPU is hot because someone wrote a quadratic loop. You need a profiler for that, which is the natural next step after USE names the resource.
Application-level locks and logical limits. USE can include these if you put them on the list, but they're invisible to the hardware tools, so they're easy to omit. A connection pool exhausted at 100 connections is a saturated resource the kernel cannot see at all.
Intermittent and load-correlated faults. A USE pass is a snapshot. A bottleneck that only appears at peak, or only when a specific tenant runs a specific job, may not be present when you run the pass. USE is strongest on a box that is slow right now; for problems that come and go you need the same metrics recorded over time, not sampled by hand.

None of this makes USE less useful — it makes it a first pass, not the whole investigation. The honest workflow is: RED or a trace to find the unhappy service, USE to find the saturated resource on its hosts, then a profiler to find the code path hitting that resource. USE owns the middle step, and it owns it better than anything else, but it is a step, not the destination.

Production checklist

Print the USE worksheet. One row per resource (CPU, memory, disk, network, FDs, locks); three columns (utilisation, saturation, errors). Keep it pinned somewhere reachable.
Run the pass in order. CPU first, then memory, then disk, then network, then descriptors, then locks. Stop at the first cell that's hot.
Always look per-resource, not aggregate. Per-core CPU, per-disk I/O, per-NIC throughput. Averages hide imbalances.
Saturation matters as much as utilisation. A 100% busy resource with a short queue is normal; a saturated queue is not.
Check error counters even when nothing looks broken. A silent disk error doesn't show up in CPU% but does show up as tail-latency spikes.
Pair with RED. USE for the box, RED for the service. Use both, in that order if the symptom is on a single host; reverse the order if the symptom is fleet-wide.
Promote anything chronic to a dashboard. If the same resource keeps appearing in USE passes, it belongs on a dashboard with an alert — not in oncall's head.

The USE method

Why a method at all

The three metrics

Building the resource list

Per resource: what each number means, and how to read it

CPU

Memory

Disk (storage I/O)

Network

Controllers and interconnects

File descriptors and locks

The worksheet

Worked example: a saturated disk

Worked example: a memory-saturated box

Where USE catches things other methods miss

USE vs RED — different layers, both needed

Where USE falls short

Production checklist

Further reading

The RED method