Performance engineering

Performance is a methodology, not a vibe. The methods that work in production — USE for resources, RED for services, top-down for CPU work, roofline for compute kernels, queueing theory for sized systems, latency budgets for distributed requests — are decades old, well-defined, and rarely taught together. This path puts them in one place, with the named tools, the real numbers, and the operational details that turn theory into outage triage.

Methods deep dive

The performance method playbook

Eight methods, end-to-end. Latency budgets & the percentile composition trap; the USE method (utilisation/saturation/errors); RED for services; top-down microarchitecture analysis; the roofline model; queueing theory for engineers; profiling in production; load testing without coordinated omission. Each chapter has the specific tools, the worked numbers, and the production gotchas that aren't in the textbooks.

Twelve mental models

The intuitions every method below relies on. Tagged Day-zero (start here), Practitioner (you tune real systems), Operator (you run them at 3 a.m.).

01 · Day-zero
Latency is a distribution

A single "average" number is a lie. Real systems have P50, P95, P99, P99.9; the tail is where users churn and where SLOs are written.
02 · Day-zero
Little's law sizes the system

Concurrency = arrival rate × latency. The thread pool, the queue, the connection pool — all sized by this one equation.
03 · Day-zero
Throughput vs latency

Adding capacity reduces latency at high load and is invisible at low load. The relationship is non-linear; the knee is the load you should plan for.
04 · Practitioner
USE for resources

Utilisation, saturation, errors — for every CPU, disk, NIC, lock. Brendan Gregg's checklist that catches the box-side problems.
05 · Practitioner
RED for services

Rate, errors, duration — per service. The orthogonal-to-USE checklist that catches the request-side problems.
06 · Practitioner
Top-down for CPU work

Front-end stalls, back-end stalls, bad speculation, retiring. Find which of those four buckets is dominating before you tune anything.
07 · Practitioner
Roofline for compute kernels

Operational intensity vs peak compute and bandwidth. Tells you whether you're memory-bound or compute-bound — and which tuning is worth the time.
08 · Practitioner
Latency budgets propagate

A request's deadline is the only one that matters. Budgets cascade down through services; per-hop timeouts without budget propagation are a bug.
09 · Operator
Coordinated omission breaks load tests

Closed-loop tests under-report tail latency dramatically. wrk2, vegeta, k6 are the tools that fix this; closed-loop locust is the tool that hides it.
10 · Operator
P99-of-N is not P99

Ten independent calls each at 1% slow → ~10% chance any is slow. The tail of the system is the tail of its slowest dependency, fan-out-aware.
11 · Operator
SLO + error budget = guardrails

The SLO sets what "good" means; the error budget sets how much risk you can spend. Together they are the contract between product, eng, and oncall.
12 · Operator
Profile in production

Synthetic benchmarks miss real cache state, real GC, real config. Continuous profiling — pprof, async-profiler, Pyroscope — catches what local benches never will.

Latency you should keep in your head

Norvig's "latency numbers every programmer should know," updated for 2026 silicon and modern infrastructure. The single most useful flashcard in performance engineering — memorise these and the magnitude of every choice becomes obvious.

Operation	Time	Cycles @ 3 GHz
L1 cache hit	~1 ns	3
L2 cache hit	~3–5 ns	10–16
L3 cache hit	~12–15 ns	~50
DRAM (same socket)	~80 ns	~250
DRAM (remote NUMA)	~140 ns	~420
NVMe Gen5 random read 4 KB	~10 µs	~30K
Same-DC RTT	~0.5 ms	~1.5M
Same-region RTT	~1–2 ms	~3–6M
Cross-region RTT (US east↔west)	~70 ms	~210M
HDD seek	~5 ms	~15M
Intercontinental RTT	~150 ms	~450M
TLS handshake (1-RTT)	~30 ms	~90M

Eight orders of magnitude separate L1 from intercontinental RTT. The single most-impactful tuning decision in any system is "where does the data live" — the rest is rounding error.

Methods, in one sentence each

Method	Use it for	Tools
USE	Resource health — every CPU, disk, NIC, lock	`top`, `vmstat`, `iostat`, `sar`
RED	Service health — rate, errors, duration per endpoint	Prometheus + dashboards, OpenTelemetry, Grafana
Top-down	Where CPU cycles are actually going	`perf`, `toplev.py`, Intel VTune, AMD uProf
Roofline	Memory-bound vs compute-bound — for compute kernels	Intel Advisor, ERT, custom benchmarks
Latency budgets	Allocating P99 across a request chain	Distributed tracing — Jaeger, Tempo, Honeycomb
Queueing theory	Sizing thread pools, queues, connection pools	Little's law, M/M/1, M/M/c — pen and paper
Profiling	Where time/memory/locks go in real production code	`pprof`, `async-profiler`, Pyroscope, Parca
Load testing	Pre-prod capacity validation; finding the knee	`k6`, `vegeta`, `wrk2` — open-loop only

Books, courses, papers, talks

Brendan Gregg — Systems Performance (2nd ed). The textbook. Chapter 2 (methodologies) and chapter 6 (CPUs) are the core; the rest is the per-resource reference.
Brendan Gregg — BPF Performance Tools. The eBPF era's manual. Chapter-per-tool, chapter-per-domain. Pair with the running fleet of bcc and bpftrace recipes.
Beyer et al — Site Reliability Engineering & The Site Reliability Workbook. Free from Google. Chapters on SLOs, error budgets, and overload are required.
Henderson — Building Scalable Web Sites. Older but the chapters on capacity planning are still the cleanest writeup of the math.
Yasin — A Top-Down method for performance analysis (Intel paper). The source paper for top-down. Read once before using toplev.py.
Williams, Waterman, Patterson — "Roofline: An Insightful Visual Performance Model" (CACM 2009). The roofline paper.
Tilkov & Vinoski — "Node.js: Using JavaScript to Build High-Performance Network Programs" (IEEE Internet Computing). Tangential but the section on the event loop is widely cited.
Talks: Brendan Gregg — "Linux Performance Tools" (USENIX); Bryan Cantrill — "Surge — Welcome to the Jungle"; Gil Tene — "wrk2" (Strange Loop).

Hands-on tools

What to install and what to point it at. The tooling list is short; the depth is in knowing which to reach for and what to read in the output.

System-level: perf, strace, ltrace, vmstat, iostat, sar, numastat, mpstat, pidstat.
eBPF / BCC / bpftrace: biolatency, execsnoop, opensnoop, tcpconnect, profile, stackcount. Each one a focused single-purpose tool.
Profilers: pprof (Go), async-profiler (JVM), py-spy (Python), rbspy (Ruby), perf record (any). Continuous profilers — Pyroscope, Parca, Datadog Continuous — for production fleets.
Tracing: OpenTelemetry SDK + Jaeger, Tempo, Honeycomb, or Datadog APM. Trace one request from edge to database before declaring the design "done".
Load: k6, wrk2, vegeta for open-loop. locust if you need a Python use with custom scenarios.
Visualisation: Brendan Gregg's flame graphs (flamegraph.pl), Grafana dashboards, Vector clocks for trace timelines.

Eight common mistakes

Reasoning from averages. Mean latency tells you almost nothing about user experience. Track P50, P95, P99, P99.9 separately. Mean is the worst summary.
Trusting closed-loop load tests. Locust without coordinated-omission correction reports tail latencies that are 10–100× lower than reality. Use wrk2, vegeta, or k6 with constant arrival rate.
Skipping the resource baseline. "It's slow" with no USE/RED dashboard is unanswerable. Establish baselines first; tune later.
Tuning without measuring. The classic — adding indexes, caches, or threads without a profile. Profile first; the bottleneck is rarely where you guess.
Confusing throughput and latency. Adding capacity at low load doesn't reduce latency. The two trade off only above the knee.
Ignoring the deadline. Per-hop timeouts add up to more than the request's deadline, so work continues after the user gave up. Propagate deadlines explicitly.
Hot path on the SQL planner. An EXPLAIN plan that picks a bad index in production silently kills P99. Plan stability matters; planner statistics matter.
Optimising before the system runs at scale. Real cache state, real GC behaviour, real contention only show up in production. Local benchmarks often optimise the wrong thing.

Adjacent paths

Computer architecture. The silicon under all of this. Caches, TLB, branch prediction, NUMA — the layer you're tuning against.
Operating systems. The scheduler, the page cache, the syscall barrier, eBPF — the OS-level tools listed above.
Back-pressure, retries, hedging, deadlines. The reliability primitives that depend on the methods here.
System design. Where capacity math meets architecture; performance methods are how you defend the design under questioning.

Continue

Open the methods directory

Eight deep dives covering USE, RED, top-down, roofline, latency budgets, queueing theory, profiling, and load testing.

Read the methods

Performance engineering

Twelve mental models

Latency is a distribution

Little's law sizes the system

Throughput vs latency

USE for resources

RED for services

Top-down for CPU work

Roofline for compute kernels

Latency budgets propagate

Coordinated omission breaks load tests

P99-of-N is not P99

SLO + error budget = guardrails

Profile in production