Methods · diagnostics · SLO
Study path / 10

Performance engineering

Performance is a methodology, not a vibe. The methods that work in production — USE for resources, RED for services, top-down for CPU work, roofline for compute kernels, queueing theory for sized systems, latency budgets for distributed requests — are decades old, well-defined, and rarely taught together. This path puts them in one place, with the named tools, the real numbers, and the operational details that turn theory into outage triage.


Twelve mental models

The intuitions every method below relies on. Tagged Day-zero (start here), Practitioner (you tune real systems), Operator (you run them at 3 a.m.).

  1. 01 · Day-zero

    Latency is a distribution

    A single "average" number is a lie. Real systems have P50, P95, P99, P99.9; the tail is where users churn and where SLOs are written.

  2. 02 · Day-zero

    Little's law sizes the system

    Concurrency = arrival rate × latency. The thread pool, the queue, the connection pool — all sized by this one equation.

  3. 03 · Day-zero

    Throughput vs latency

    Adding capacity reduces latency at high load and is invisible at low load. The relationship is non-linear; the knee is the load you should plan for.

  4. 04 · Practitioner

    USE for resources

    Utilisation, saturation, errors — for every CPU, disk, NIC, lock. Brendan Gregg's checklist that catches the box-side problems.

  5. 05 · Practitioner

    RED for services

    Rate, errors, duration — per service. The orthogonal-to-USE checklist that catches the request-side problems.

  6. 06 · Practitioner

    Top-down for CPU work

    Front-end stalls, back-end stalls, bad speculation, retiring. Find which of those four buckets is dominating before you tune anything.

  7. 07 · Practitioner

    Roofline for compute kernels

    Operational intensity vs peak compute and bandwidth. Tells you whether you're memory-bound or compute-bound — and which tuning is worth the time.

  8. 08 · Practitioner

    Latency budgets propagate

    A request's deadline is the only one that matters. Budgets cascade down through services; per-hop timeouts without budget propagation are a bug.

  9. 09 · Operator

    Coordinated omission breaks load tests

    Closed-loop tests under-report tail latency dramatically. wrk2, vegeta, k6 are the tools that fix this; closed-loop locust is the tool that hides it.

  10. 10 · Operator

    P99-of-N is not P99

    Ten independent calls each at 1% slow → ~10% chance any is slow. The tail of the system is the tail of its slowest dependency, fan-out-aware.

  11. 11 · Operator

    SLO + error budget = guardrails

    The SLO sets what "good" means; the error budget sets how much risk you can spend. Together they are the contract between product, eng, and oncall.

  12. 12 · Operator

    Profile in production

    Synthetic benchmarks miss real cache state, real GC, real config. Continuous profiling — pprof, async-profiler, Pyroscope — catches what local benches never will.

Latency you should keep in your head

Norvig's "latency numbers every programmer should know," updated for 2026 silicon and modern infrastructure. The single most useful flashcard in performance engineering — memorise these and the magnitude of every choice becomes obvious.

OperationTimeCycles @ 3 GHz
L1 cache hit~1 ns3
L2 cache hit~3–5 ns10–16
L3 cache hit~12–15 ns~50
DRAM (same socket)~80 ns~250
DRAM (remote NUMA)~140 ns~420
NVMe Gen5 random read 4 KB~10 µs~30K
Same-DC RTT~0.5 ms~1.5M
Same-region RTT~1–2 ms~3–6M
Cross-region RTT (US east↔west)~70 ms~210M
HDD seek~5 ms~15M
Intercontinental RTT~150 ms~450M
TLS handshake (1-RTT)~30 ms~90M

Eight orders of magnitude separate L1 from intercontinental RTT. The single most-impactful tuning decision in any system is "where does the data live" — the rest is rounding error.

Methods, in one sentence each

MethodUse it forTools
USEResource health — every CPU, disk, NIC, locktop, vmstat, iostat, sar
REDService health — rate, errors, duration per endpointPrometheus + dashboards, OpenTelemetry, Grafana
Top-downWhere CPU cycles are actually goingperf, toplev.py, Intel VTune, AMD uProf
RooflineMemory-bound vs compute-bound — for compute kernelsIntel Advisor, ERT, custom benchmarks
Latency budgetsAllocating P99 across a request chainDistributed tracing — Jaeger, Tempo, Honeycomb
Queueing theorySizing thread pools, queues, connection poolsLittle's law, M/M/1, M/M/c — pen and paper
ProfilingWhere time/memory/locks go in real production codepprof, async-profiler, Pyroscope, Parca
Load testingPre-prod capacity validation; finding the kneek6, vegeta, wrk2 — open-loop only

Books, courses, papers, talks

  • Brendan Gregg — Systems Performance (2nd ed). The textbook. Chapter 2 (methodologies) and chapter 6 (CPUs) are the core; the rest is the per-resource reference.
  • Brendan Gregg — BPF Performance Tools. The eBPF era's manual. Chapter-per-tool, chapter-per-domain. Pair with the running fleet of bcc and bpftrace recipes.
  • Beyer et al — Site Reliability Engineering & The Site Reliability Workbook. Free from Google. Chapters on SLOs, error budgets, and overload are required.
  • Henderson — Building Scalable Web Sites. Older but the chapters on capacity planning are still the cleanest writeup of the math.
  • Yasin — A Top-Down method for performance analysis (Intel paper). The source paper for top-down. Read once before using toplev.py.
  • Williams, Waterman, Patterson — "Roofline: An Insightful Visual Performance Model" (CACM 2009). The roofline paper.
  • Tilkov & Vinoski — "Node.js: Using JavaScript to Build High-Performance Network Programs" (IEEE Internet Computing). Tangential but the section on the event loop is widely cited.
  • Talks: Brendan Gregg — "Linux Performance Tools" (USENIX); Bryan Cantrill — "Surge — Welcome to the Jungle"; Gil Tene — "wrk2" (Strange Loop).

Hands-on tools

What to install and what to point it at. The tooling list is short; the depth is in knowing which to reach for and what to read in the output.

  • System-level: perf, strace, ltrace, vmstat, iostat, sar, numastat, mpstat, pidstat.
  • eBPF / BCC / bpftrace: biolatency, execsnoop, opensnoop, tcpconnect, profile, stackcount. Each one a focused single-purpose tool.
  • Profilers: pprof (Go), async-profiler (JVM), py-spy (Python), rbspy (Ruby), perf record (any). Continuous profilers — Pyroscope, Parca, Datadog Continuous — for production fleets.
  • Tracing: OpenTelemetry SDK + Jaeger, Tempo, Honeycomb, or Datadog APM. Trace one request from edge to database before declaring the design "done".
  • Load: k6, wrk2, vegeta for open-loop. locust if you need a Python use with custom scenarios.
  • Visualisation: Brendan Gregg's flame graphs (flamegraph.pl), Grafana dashboards, Vector clocks for trace timelines.

Eight common mistakes

  • Reasoning from averages. Mean latency tells you almost nothing about user experience. Track P50, P95, P99, P99.9 separately. Mean is the worst summary.
  • Trusting closed-loop load tests. Locust without coordinated-omission correction reports tail latencies that are 10–100× lower than reality. Use wrk2, vegeta, or k6 with constant arrival rate.
  • Skipping the resource baseline. "It's slow" with no USE/RED dashboard is unanswerable. Establish baselines first; tune later.
  • Tuning without measuring. The classic — adding indexes, caches, or threads without a profile. Profile first; the bottleneck is rarely where you guess.
  • Confusing throughput and latency. Adding capacity at low load doesn't reduce latency. The two trade off only above the knee.
  • Ignoring the deadline. Per-hop timeouts add up to more than the request's deadline, so work continues after the user gave up. Propagate deadlines explicitly.
  • Hot path on the SQL planner. An EXPLAIN plan that picks a bad index in production silently kills P99. Plan stability matters; planner statistics matter.
  • Optimising before the system runs at scale. Real cache state, real GC behaviour, real contention only show up in production. Local benchmarks often optimise the wrong thing.

Adjacent paths

  • Computer architecture. The silicon under all of this. Caches, TLB, branch prediction, NUMA — the layer you're tuning against.
  • Operating systems. The scheduler, the page cache, the syscall barrier, eBPF — the OS-level tools listed above.
  • Back-pressure, retries, hedging, deadlines. The reliability primitives that depend on the methods here.
  • System design. Where capacity math meets architecture; performance methods are how you defend the design under questioning.