Performance engineering
Performance is a methodology, not a vibe. The methods that work in production — USE for resources, RED for services, top-down for CPU work, roofline for compute kernels, queueing theory for sized systems, latency budgets for distributed requests — are decades old, well-defined, and rarely taught together. This path puts them in one place, with the named tools, the real numbers, and the operational details that turn theory into outage triage.
Twelve mental models
The intuitions every method below relies on. Tagged Day-zero (start here), Practitioner (you tune real systems), Operator (you run them at 3 a.m.).
- 01 · Day-zero
Latency is a distribution
A single "average" number is a lie. Real systems have P50, P95, P99, P99.9; the tail is where users churn and where SLOs are written.
- 02 · Day-zero
Little's law sizes the system
Concurrency = arrival rate × latency. The thread pool, the queue, the connection pool — all sized by this one equation.
- 03 · Day-zero
Throughput vs latency
Adding capacity reduces latency at high load and is invisible at low load. The relationship is non-linear; the knee is the load you should plan for.
- 04 · Practitioner
USE for resources
Utilisation, saturation, errors — for every CPU, disk, NIC, lock. Brendan Gregg's checklist that catches the box-side problems.
- 05 · Practitioner
RED for services
Rate, errors, duration — per service. The orthogonal-to-USE checklist that catches the request-side problems.
- 06 · Practitioner
Top-down for CPU work
Front-end stalls, back-end stalls, bad speculation, retiring. Find which of those four buckets is dominating before you tune anything.
- 07 · Practitioner
Roofline for compute kernels
Operational intensity vs peak compute and bandwidth. Tells you whether you're memory-bound or compute-bound — and which tuning is worth the time.
- 08 · Practitioner
Latency budgets propagate
A request's deadline is the only one that matters. Budgets cascade down through services; per-hop timeouts without budget propagation are a bug.
- 09 · Operator
Coordinated omission breaks load tests
Closed-loop tests under-report tail latency dramatically. wrk2, vegeta, k6 are the tools that fix this; closed-loop locust is the tool that hides it.
- 10 · Operator
P99-of-N is not P99
Ten independent calls each at 1% slow → ~10% chance any is slow. The tail of the system is the tail of its slowest dependency, fan-out-aware.
- 11 · Operator
SLO + error budget = guardrails
The SLO sets what "good" means; the error budget sets how much risk you can spend. Together they are the contract between product, eng, and oncall.
- 12 · Operator
Profile in production
Synthetic benchmarks miss real cache state, real GC, real config. Continuous profiling — pprof, async-profiler, Pyroscope — catches what local benches never will.
Latency you should keep in your head
Norvig's "latency numbers every programmer should know," updated for 2026 silicon and modern infrastructure. The single most useful flashcard in performance engineering — memorise these and the magnitude of every choice becomes obvious.
| Operation | Time | Cycles @ 3 GHz |
|---|---|---|
| L1 cache hit | ~1 ns | 3 |
| L2 cache hit | ~3–5 ns | 10–16 |
| L3 cache hit | ~12–15 ns | ~50 |
| DRAM (same socket) | ~80 ns | ~250 |
| DRAM (remote NUMA) | ~140 ns | ~420 |
| NVMe Gen5 random read 4 KB | ~10 µs | ~30K |
| Same-DC RTT | ~0.5 ms | ~1.5M |
| Same-region RTT | ~1–2 ms | ~3–6M |
| Cross-region RTT (US east↔west) | ~70 ms | ~210M |
| HDD seek | ~5 ms | ~15M |
| Intercontinental RTT | ~150 ms | ~450M |
| TLS handshake (1-RTT) | ~30 ms | ~90M |
Eight orders of magnitude separate L1 from intercontinental RTT. The single most-impactful tuning decision in any system is "where does the data live" — the rest is rounding error.
Methods, in one sentence each
| Method | Use it for | Tools |
|---|---|---|
| USE | Resource health — every CPU, disk, NIC, lock | top, vmstat, iostat, sar |
| RED | Service health — rate, errors, duration per endpoint | Prometheus + dashboards, OpenTelemetry, Grafana |
| Top-down | Where CPU cycles are actually going | perf, toplev.py, Intel VTune, AMD uProf |
| Roofline | Memory-bound vs compute-bound — for compute kernels | Intel Advisor, ERT, custom benchmarks |
| Latency budgets | Allocating P99 across a request chain | Distributed tracing — Jaeger, Tempo, Honeycomb |
| Queueing theory | Sizing thread pools, queues, connection pools | Little's law, M/M/1, M/M/c — pen and paper |
| Profiling | Where time/memory/locks go in real production code | pprof, async-profiler, Pyroscope, Parca |
| Load testing | Pre-prod capacity validation; finding the knee | k6, vegeta, wrk2 — open-loop only |
Books, courses, papers, talks
- Brendan Gregg — Systems Performance (2nd ed). The textbook. Chapter 2 (methodologies) and chapter 6 (CPUs) are the core; the rest is the per-resource reference.
- Brendan Gregg — BPF Performance Tools. The eBPF era's manual. Chapter-per-tool, chapter-per-domain. Pair with the running fleet of
bccandbpftracerecipes. - Beyer et al — Site Reliability Engineering & The Site Reliability Workbook. Free from Google. Chapters on SLOs, error budgets, and overload are required.
- Henderson — Building Scalable Web Sites. Older but the chapters on capacity planning are still the cleanest writeup of the math.
- Yasin — A Top-Down method for performance analysis (Intel paper). The source paper for top-down. Read once before using
toplev.py. - Williams, Waterman, Patterson — "Roofline: An Insightful Visual Performance Model" (CACM 2009). The roofline paper.
- Tilkov & Vinoski — "Node.js: Using JavaScript to Build High-Performance Network Programs" (IEEE Internet Computing). Tangential but the section on the event loop is widely cited.
- Talks: Brendan Gregg — "Linux Performance Tools" (USENIX); Bryan Cantrill — "Surge — Welcome to the Jungle"; Gil Tene — "wrk2" (Strange Loop).
Hands-on tools
What to install and what to point it at. The tooling list is short; the depth is in knowing which to reach for and what to read in the output.
- System-level:
perf,strace,ltrace,vmstat,iostat,sar,numastat,mpstat,pidstat. - eBPF / BCC / bpftrace:
biolatency,execsnoop,opensnoop,tcpconnect,profile,stackcount. Each one a focused single-purpose tool. - Profilers:
pprof(Go),async-profiler(JVM),py-spy(Python),rbspy(Ruby),perf record(any). Continuous profilers — Pyroscope, Parca, Datadog Continuous — for production fleets. - Tracing: OpenTelemetry SDK + Jaeger, Tempo, Honeycomb, or Datadog APM. Trace one request from edge to database before declaring the design "done".
- Load:
k6,wrk2,vegetafor open-loop.locustif you need a Python use with custom scenarios. - Visualisation: Brendan Gregg's flame graphs (
flamegraph.pl), Grafana dashboards, Vector clocks for trace timelines.
Eight common mistakes
- Reasoning from averages. Mean latency tells you almost nothing about user experience. Track P50, P95, P99, P99.9 separately. Mean is the worst summary.
- Trusting closed-loop load tests. Locust without coordinated-omission correction reports tail latencies that are 10–100× lower than reality. Use wrk2, vegeta, or k6 with constant arrival rate.
- Skipping the resource baseline. "It's slow" with no USE/RED dashboard is unanswerable. Establish baselines first; tune later.
- Tuning without measuring. The classic — adding indexes, caches, or threads without a profile. Profile first; the bottleneck is rarely where you guess.
- Confusing throughput and latency. Adding capacity at low load doesn't reduce latency. The two trade off only above the knee.
- Ignoring the deadline. Per-hop timeouts add up to more than the request's deadline, so work continues after the user gave up. Propagate deadlines explicitly.
- Hot path on the SQL planner. An EXPLAIN plan that picks a bad index in production silently kills P99. Plan stability matters; planner statistics matter.
- Optimising before the system runs at scale. Real cache state, real GC behaviour, real contention only show up in production. Local benchmarks often optimise the wrong thing.
Adjacent paths
- Computer architecture. The silicon under all of this. Caches, TLB, branch prediction, NUMA — the layer you're tuning against.
- Operating systems. The scheduler, the page cache, the syscall barrier, eBPF — the OS-level tools listed above.
- Back-pressure, retries, hedging, deadlines. The reliability primitives that depend on the methods here.
- System design. Where capacity math meets architecture; performance methods are how you defend the design under questioning.