08 / 08

Methods / 08 · Synthetic load

Load testing without lying

A load test answers three questions: where does this system stop scaling, can it carry the load you expect, and did the last change make any of that worse. The trouble is that the most common tools answer those questions wrong. They use a load model that pauses whenever the system stalls, so the tail latency they report is better than anything a real user will ever see. This page is about running the test honestly: open-loop arrivals, an awareness of coordinated omission, reading the throughput curve for its knee, and picking the test type that matches the question you actually have.

Three things a load test is for

Before any tooling argument, be clear about what you want from the run. Load tests get muddy when one test is asked to answer three different questions at once, so it helps to name them apart.

The first goal is finding the knee. Every system has a point where adding more load stops buying more work and starts buying latency instead. Below that point throughput rises roughly in step with the load you offer and latency stays flat. Above it, throughput plateaus or falls while latency climbs steeply. That bend is the single most useful number a load test produces, because it tells you the real ceiling, not the marketing one. The whole job of a saturation run is to locate it.

The second goal is validating capacity. You have a target — say 5,000 requests a second for a launch, with a P99 under 200 ms — and you want to know whether the system meets it with margin to spare. This is a pass/fail question against a known load, and the honest version of it holds the arrival rate fixed and reads the latency distribution. A run that quietly fails to reach the target rate has answered a different question than the one you asked.

The third goal is catching regressions. Once you know a system's shape, you re-run the same test on every release and watch for drift. A change that pushes the knee from 8,000 to 6,000 req/sec, or lifts the P99 at your target rate from 140 ms to 190 ms, is a regression even if every unit test still passes. This is the load test that earns its keep over time, and it only works if the test is repeatable: same workload, same warm-up handling, same generator settings, run by CI rather than by a person who half-remembers last quarter's numbers.

These three goals want different runs, and the rest of this page keeps coming back to them. But all three share one requirement that most teams get wrong on the first try: the load has to arrive the way real traffic arrives. That is the open-loop versus closed-loop distinction, and it is worth getting right before anything else.

Open-loop vs closed-loop in one paragraph

A closed-loop load generator works like this: send a request, wait for the response, send the next request. The number of concurrent virtual users is the knob you turn; the arrival rate depends on how fast the system responds. An open-loop generator works differently: send a request every 1/λ seconds regardless of whether prior requests have returned. The arrival rate is the knob; concurrency depends on how the system handles them.

Real users behave open-loop. They click a button when they want something; they don't wait for your previous response before clicking again. Closed-loop generators don't model real users; they model an imaginary user who slows down whenever the system does. Reports generated this way under-state the tail because the worst latencies that would have occurred under real arrival patterns silently never happen.

	Closed-loop	Open-loop
Knob	Number of virtual users (concurrency)	Arrival rate (req/sec)
Behaviour during stall	Stops sending — waits for response	Keeps sending — backlog grows
What's modelled	A user who paces themselves to system speed	A population whose request rate is independent of system speed
Tail latency	Under-reported	Accurately reflected
Tools	JMeter (default), Locust (default), ab, wrk	wrk2, vegeta, k6 in constant-arrival mode, Gatling open injection

Coordinated omission

Coordinated omission is the technical name for the under-reporting closed-loop tests produce. Gil Tene named it in his 2015 Oracle talk and it became standard SRE vocabulary within a year. The mechanism is straightforward:

# Closed-loop, single virtual user, send-receive-send pattern.
# Steady state: requests every 10 ms (the system's normal latency).

t = 0:    send req 1
t = 10:   recv req 1   →  latency = 10 ms  ✓ recorded
t = 10:   send req 2
t = 20:   recv req 2   →  latency = 10 ms  ✓ recorded
t = 20:   send req 3

# Now the system stalls for 1 second (GC, leader election, whatever).

t = 1020: recv req 3   →  latency = 1000 ms  ✓ recorded
t = 1020: send req 4
t = 1030: recv req 4   →  latency = 10 ms  ✓ recorded

# Reported P99 across these 4 samples: 1000 ms.
# Reality: between t=20 and t=1020, ~100 requests SHOULD have arrived
# (at the steady-state 10 ms cadence). Those requests would have queued.
# Their experienced latency would have been ~990, 980, 970, ..., 10 ms.
# The closed-loop generator never sent them. Reported P99 ignores them.

# Open-loop generator at 100 req/sec (= 1 req every 10 ms):
t = 0:    send req 1     → recv at  t=10  → 10 ms
t = 10:   send req 2     → recv at  t=20  → 10 ms
t = 20:   send req 3     → recv at t=1020 → 1000 ms  (stalled)
t = 30:   send req 4     → recv at t=1020 → 990 ms   (queued behind 3)
t = 40:   send req 5     → recv at t=1020 → 980 ms
...
t = 1010: send req 102   → recv at t=1020 → 10 ms

# Reported P99 across 102 samples: ~990 ms.
# That's what real users would experience.

The asymmetry. A closed-loop generator under one virtual user, in the example above, reports a single 1-second sample buried in thousands of 10 ms ones — the P99 is invisible. The open-loop generator at the same effective arrival rate reports 100 samples in the 10–1000 ms range — the entire stall shows up in the latency distribution. Same system, same stall; very different report.

Tools that get it right (and wrong)

The most important question to ask of any load-testing tool is: "what variable do I set — arrival rate or concurrency?" If the answer is "users" or "concurrent connections", it's closed-loop by default and probably under-reporting the tail. The right tools accept arrival rate directly.

Tool	Default mode	Coord-omission corrected?	Notes
wrk2	Open-loop, constant rate	Yes (Gil Tene's correction)	The reference. Specify `-R 10000` and it sends 10,000 req/sec regardless of system response. Lua scripting for request bodies.
vegeta	Open-loop	Yes	Go-based. Reads request descriptions from stdin; `vegeta attack -rate=10000` is the basic invocation.
k6	Closed-loop by default, open-loop in constant-arrival scenarios	Yes in open-loop mode	Modern, JavaScript scripting. Use `executor: 'constant-arrival-rate'` for open-loop; the default `constant-vus` is closed-loop.
Gatling	Open injection by default in modern versions	Yes in open injection	Scala DSL. `injectOpen(constantUsersPerSec(...))` is open-loop; the older injectClosed is closed-loop.
wrk (not wrk2)	Closed-loop	No	Faster than wrk2 but under-reports tail. Use only when peak throughput matters more than tail accuracy.
Locust	Closed-loop	No (requires custom code)	Python-based, very popular. The default user model is closed-loop; open-loop requires careful scripting.
JMeter	Closed-loop	No (Constant Throughput Timer is partial)	The legacy default. Adding the Constant Throughput Timer can partially correct, but it pauses overflow rather than queueing — not the same thing.
ab (ApacheBench)	Closed-loop	No	Sufficient for smoke tests, never for tail-latency claims.

Designing a realistic load test

Picking the right tool is the easy part. Designing a test that resembles production load — and that exercises the code paths that matter — is harder. A useful checklist before running any number:

Match the arrival distribution. Real traffic isn't constant; it has bursts. wrk2 and k6 both support Poisson and uniform arrivals. Constant arrival is a useful baseline; Poisson is closer to reality.
Mix request types proportionally. If 80% of production traffic is GETs and 20% is POSTs, the test should match. A pure-GET test exercises a different code path than the real mix.
Use realistic payload sizes. A 200-byte JSON request behaves differently from a 50-KB one. The CPU cost, the parsing path, and the network behaviour all differ.
Vary the keys. A test that hits the same URL repeatedly populates one cache line perfectly and tests nothing real. Generate a key space that approximates production cardinality.
Don't forget the ramp. Cold caches, cold JIT, cold connection pools — all warm up over the first minute. Discard the warm-up window; report from steady state.
Measure from outside the generator. The system's own metrics should agree with the generator's. If they don't, one of them is wrong — most often the closed-loop generator is under-reporting.

Reading the throughput and latency curve

A saturation run produces two curves plotted against offered load: throughput delivered and latency experienced. Read together they tell you everything about where the system breaks. The shape is the same across almost every system you will test, which is what makes it worth learning to read once.

Below the knee, throughput tracks offered load and latency is flat. At the knee, throughput plateaus and latency turns up. Run capacity and SLO tests in the linear region, at 70–80% of the knee.

The left of the chart is the region you want to live in. Offered load and delivered throughput move together, and latency barely changes because every request gets served promptly and queues stay short. This is the system doing useful work with headroom.

The knee is the bend where the throughput line flattens and the latency line kicks up. It is not a single sharp point on real hardware — it is a short transition — but it behaves like one for planning. The reason both curves turn at the same place is queueing: once arrivals approach the service rate, the queue stops draining between requests and starts growing, so each new request waits behind a longer line. Latency is the queue depth made visible. The maths behind why a queue blows up well before utilisation reaches 100% is covered in queueing theory for engineers; the short version is that latency rises in proportion to 1/(1−utilisation), so the last 10% of capacity costs far more latency than the first 10%.

Past the knee, the throughput line often does not just flatten — it droops. Delivered work actually falls as offered load rises. That counter-intuitive shape comes from the system spending its cycles on overhead instead of work: connection churn, lock contention, retries amplifying the load, garbage collection triggered by deep queues, threads thrashing. This is congestion collapse, and a system that exhibits it needs admission control (shed load, return 429s, trip a circuit) so that a brief overload does not turn into a long outage. The rate-limiter simulator is a good place to feel how a token bucket holds the offered load below the knee rather than letting it run off the cliff.

One practical warning about reading these curves: only the open-loop run produces an honest version. A closed-loop generator cannot push past the knee, because the moment latency rises its virtual users slow down and stop offering more load. Its throughput curve quietly bends into a smooth ceiling that never shows you the collapse, and its latency curve never climbs the way production's will. You measure the cliff by walking off it on purpose, which a closed-loop generator refuses to do.

Saturation tests vs latency tests

Two distinct test goals, often confused. Both are useful; running them as separate tests gives clearer results than trying to extract both from one run.

	Saturation test	Latency test
Goal	Find max throughput	Find latency under a known load
Method	Ramp arrival rate until system breaks	Hold arrival rate constant at a target; measure
What you report	Knee of the throughput curve, max stable RPS	P50/P95/P99 at the target rate
When to use	Capacity planning; pre-launch headroom check	SLO validation; regression testing
Generator preference	Open-loop ramp	Open-loop constant rate

The mistake to avoid: using a saturation test to claim a latency number. The latency at the peak of a saturation test is the latency right before the system melted; it's not the latency a user would experience at that rate sustainably. Run latency tests at 70–80% of the saturation knee, matching the queueing theory rule of thumb.

The four test shapes

Saturation and latency are the two goals; the shape of the load over time is a separate choice, and it changes what the test finds. Four shapes cover almost everything teams need, and each one surfaces a different class of bug.

Four profiles, same axes: arrival rate over time. Load holds steady at a target, stress ramps until something breaks, soak holds for a long time, spike jumps suddenly and drops.

A load test holds the arrival rate steady at a realistic target and watches the latency distribution settle. This is the everyday test, the one that validates an SLO and the one you re-run to catch regressions. It tells you how the system behaves at the load you expect, under conditions you can reason about.

A stress test ramps the arrival rate up and keeps going past the point where the system is comfortable, all the way to failure. The goal is not the latency number — it is the failure mode. Does the system shed load gracefully and recover, or does it fall over and stay down? Does it return clean errors or does it corrupt state? A stress test is how you find out whether your overload handling is real or aspirational before a traffic surge finds out for you.

A soak test, also called an endurance test, holds a moderate load for a long time — hours, sometimes days. It is built to catch the slow failures that a five-minute run never sees: memory leaks that take an hour to matter, file descriptors that never get released, log files that fill a disk, caches that grow without bound, connection pools that slowly poison themselves. A system can pass every short test and still die at 3 a.m. on day three; the soak test is the only one that catches that.

A spike test jumps the arrival rate suddenly — a flat baseline, then an instant jump to several times that, then back down. Real traffic does this: a marketing email goes out, a cache expires across the fleet at once, an upstream retries a backlog. The question is whether the system absorbs the jump without a cascade. Spikes expose problems that gradual ramps hide, because autoscaling has no time to react, cold paths get hit all at once, and thundering-herd effects pile retries on top of an already-loaded system.

These shapes compose. A capacity validation is usually a load test built on top of a saturation ramp you ran earlier to find the knee. A pre-launch sign-off might run a soak test at expected load and a spike test at the worst burst you can imagine, both in the same week. The point is to choose the shape deliberately rather than running one steady test and assuming it covers every failure mode.

Testing in prod vs staging

Where you run the test changes how much you can trust the result. Staging is safe and prod is honest, and most of the hard tradeoffs in load testing come from that tension.

A staging environment lets you ramp to failure without paging anyone or refunding a customer. That freedom is real and worth having. The catch is that staging is almost never a faithful copy of production. It runs fewer instances, smaller machines, a smaller and often synthetic dataset, a cache that was warmed by your test rather than by months of real traffic, and none of the noisy-neighbour effects of a shared production fleet. A knee you find at 4,000 req/sec on a half-size staging cluster tells you little about the real ceiling. Staging numbers are most trustworthy as relative measures — this release versus last release on the same rig — and least trustworthy as absolute capacity claims.

Testing in production removes the fidelity problem because it is, by definition, the real thing: real data sizes, real cache state, real hardware, real dependencies. The risk is obvious. A few patterns make it safe enough to do routinely. Run against a small, drained slice of the fleet rather than the whole thing, so a bad run hurts a controlled fraction. Use shadow traffic so the load is real but the responses are discarded. Cap the test with the same admission control that protects real users, so the generator cannot push the live system past the cliff. And always have a kill switch that stops the test in one action. Done with those guardrails, a production load test is the only one whose absolute numbers you can fully believe.

The usual division of labour: staging for fast, cheap regression checks on every release, and a small number of carefully bounded production tests for the absolute capacity numbers that matter for planning. Treat a staging knee as a ratio, not a fact, and confirm the real ceiling in prod before you bet a launch on it.

Shadowing and traffic replay

Synthetic load is never quite real. Two patterns address this in production- adjacent environments without affecting users:

Shadow traffic. A copy of every production request is mirrored to a parallel test environment. The mirror's responses are discarded. Real traffic, real distributions, real customers — without the customer ever seeing the test environment. Envoy, NGINX, and most service meshes support this natively.
Traffic replay. Record a window of real production requests, then replay them against the test environment at the desired rate. Same realistic mix and distribution as shadow traffic, with the advantage that you can amplify rate (replay at 2× original to test for headroom) or change other variables.

Both have caveats. Shadow traffic requires that the test environment can handle the production rate, or that you sample. Replay requires storing anonymised request data, which is sometimes a compliance constraint. Where they're feasible, they produce results that synthetic generators cannot match.

A worked example

Validating an SLO of "P99 < 200 ms at 5,000 req/sec sustained" for an HTTP service, using k6.

// k6 script — constant-arrival-rate (open-loop)
import http from 'k6/http';
import { check } from 'k6';

export const options = {
  scenarios: {
    sustained: {
      executor: 'constant-arrival-rate',
      rate: 5000,                  // 5000 iterations per timeUnit
      timeUnit: '1s',              // → 5000 req/sec
      duration: '5m',              // run for 5 minutes (warm-up + sample)
      preAllocatedVUs: 200,        // pool size; k6 expands if needed
      maxVUs: 1000,                // safety cap
    },
  },
  thresholds: {
    http_req_duration: ['p(99)<200'],   // fail if P99 > 200 ms
    http_req_failed: ['rate<0.001'],    // fail if error rate > 0.1%
  },
};

const KEYS = [...Array(10000).keys()]; // realistic key cardinality

export default function () {
  const key = KEYS[Math.floor(Math.random() * KEYS.length)];
  const res = http.get(`https://api.example.com/items/${key}`);
  check(res, { '200 OK': (r) => r.status === 200 });
}

# Run it
k6 run --out json=results.json sustain.js

# Output (excerpt)
#   scenarios: (100.00%) 1 scenario, 1000 max VUs, 5m0s max duration ...
#   http_req_duration..............: avg=42ms   min=8ms  med=35ms
#     max=2.1s  p(95)=85ms  p(99)=178ms
#   http_reqs......................: 1499873  4999.58/s
#   ✓ p(99)<200
#   ✓ rate<0.001

# Discard first 60 seconds (warm-up) and re-run summary:
# Steady-state P99: 142 ms.   SLO met with headroom.

# Compare to a closed-loop run for the same target rate:
k6 run --vus 50 --duration 5m sustain.js   # closed-loop
#   http_req_duration..............: avg=8ms  p(99)=22ms
#   http_reqs......................: 305127  1017.09/s     ← much lower
#                                                          actual rate

# The closed-loop run reports a P99 of 22 ms — beautifully low.
# It also achieves only 1,017 req/sec, because each of the 50 VUs sat
# idle every time the system was slow. The "low P99" describes 50 users
# politely waiting, not 5,000 req/sec under load.

The contrast is the whole point. The open-loop run achieves the target arrival rate and reports the actual P99 a population of users would see. The closed-loop run, even at the same intended concurrency, never reaches the target rate and produces a comforting but irrelevant tail number. The SLO claim only makes sense from the open-loop number.

Production checklist

Use an open-loop generator with a fixed arrival rate. wrk2, vegeta, or k6 in constant-arrival mode. The choice of tool matters less than the executor.
Set rate, not concurrency. "100,000 req/sec" is meaningful; "1,000 virtual users" is not.
Discard the warm-up window. First 30–60 seconds usually include cold caches, cold JIT, cold connections. Report from steady state.
Match production's arrival distribution and request mix. A pure GET test exercises one path; production exercises many.
Vary the keys / inputs. Hammering one URL produces a hot-cache result that says nothing about real load.
Validate latency at 70–80% of saturation. The latency at the peak of a saturation test is meaningless for SLOs.
Use shadow traffic or replay when available. Real distributions beat any synthetic generator. Worth the operational effort for high-stakes launches.
Cross-check with the service's own metrics. If the generator's reported P99 disagrees with Prometheus's P99 from the same time window, one is lying — usually the generator.

Load testing without lying

Three things a load test is for

Open-loop vs closed-loop in one paragraph

Coordinated omission

Tools that get it right (and wrong)

Designing a realistic load test

Reading the throughput and latency curve

Saturation tests vs latency tests

The four test shapes

Testing in prod vs staging

Shadowing and traffic replay

A worked example

Production checklist

Further reading

Back to all methods