03 / 08

Methods / 03 · Services & APIs

The RED method

Three metrics per service: request rate, error rate, and the duration distribution. Tom Wilkie's pattern from his time at Weaveworks; now the default for the first dashboard you build on any new service. RED tells you whether a service is healthy from the request side; USE tells you whether the box is healthy from the resource side. Most production investigations need both.

Why three, and why these three

RED is small on purpose. The point is that three numbers per service, viewed together, cover almost every operational question someone might ask in a degraded-service moment. Building anything more elaborate first is usually premature, and a fleet where every service exposes a different ad-hoc set of metrics is a fleet nobody can reason about at three in the morning.

The three signals are not an arbitrary pick. They fall out of a simple model of what a service is: a thing that takes requests in, fails on some of them, and spends time on the rest. Rate measures how much work is arriving. Errors measures how much of that work is going wrong. Duration measures how long the work that succeeds is taking. There is no fourth thing a request can do — arrive, fail, or take time are the only three states a unit of work passes through from the caller's point of view. That is why the set feels complete in practice even though it has only three members: it covers the entire surface a client can observe.

Notice what RED deliberately leaves out. It says nothing about CPU, memory, disk, file descriptors, or any other resource inside the box. That omission is the whole design. RED is a request-side method; it describes the contract between a service and its callers, not the machinery that fulfils the contract. When a request is slow, RED tells you it is slow but not why; the "why" is a resource question, and resource questions belong to USE. Keeping the two views separate is what makes each one sharp. The moment you start cramming garbage-collection pauses and thread-pool depth into your RED dashboard, it stops being a clean statement of service health and becomes a grab-bag.

Metric	Definition	Implementation
Rate	Requests per second handled by the service. Aggregate and broken out by endpoint, by status code class, by caller.	Prometheus `rate(http_requests_total[1m])`
Errors	Number (or fraction) of requests that errored. Usually HTTP 5xx, gRPC non-OK, app-level error codes.	`rate(http_requests_total{status=~"5.."}[1m])`
Duration	Latency distribution — at minimum P50, P95, P99. Histogram, not average.	`histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m]))`

Distribution, not average. Duration in particular must be a histogram — averages bury the tail and the tail is where the user pain lives. Same reasoning as the latency budgets page: track P50, P95, P99 separately and the gap between them is the tail factor.

Duration is a distribution, not a number

Of the three signals, duration is the one teams get wrong most often, and the mistake is almost always the same: they track the average. The average response time is one of the most misleading numbers in operations. It hides the shape of the distribution, and the shape is the whole story. A service whose average is a comfortable 40 ms can still be timing out one request in fifty, because the average happily absorbs a handful of ten-second responses among thousands of fast ones. The user who hit the ten-second response does not feel an average. They feel ten seconds.

Percentiles fix this by describing the distribution at named points. P50, the median, is the experience of a typical request. P95 is the experience of the worst one in twenty. P99 is the worst one in a hundred. These are not academic distinctions. On a service handling a thousand requests a second, the P99 is ten slow requests every second, every second — a steady drip of unhappy users that an average will never show you. And because real production traffic is almost never symmetric, the median and the tail can live in completely different worlds: a median of 30 ms sitting under a P99 of 1.2 seconds is an entirely normal and entirely alarming shape.

Real latency is right-skewed: a dense bunch of fast requests and a long tail of slow ones. The average lands in the empty middle, describing almost no actual request.

This is also why duration must be collected as a histogram rather than a pre-computed percentile. If each instance of a service exports its own P99, you cannot average those P99s together to get the fleet P99 — percentiles do not add up that way, and averaging them produces a number that is wrong in a direction you cannot predict. A histogram, by contrast, is a set of bucket counts, and bucket counts do add. You sum the histograms across every instance and compute the quantile once, at query time, on the combined distribution. That is exactly what histogram_quantile does in the queries above, and it is the technical reason RED dashboards are built on _bucket series rather than gauges that report a percentile directly.

The cost of this is that a histogram's accuracy is bounded by its bucket layout. A percentile is interpolated within whichever bucket it falls into, so if your buckets jump from 100 ms straight to 1 second, every P99 between those two values reports as roughly the same blurry number. The fix is to put bucket boundaries where you actually care about precision — usually clustered around your SLO threshold — and accept coarse resolution out in the far tail where the exact value matters less than the fact that it is large.

RED and the four golden signals

Google's SRE book uses a slightly different list — the "four golden signals": latency, traffic, errors, saturation. Saturation is the extra one; it's roughly USE's saturation, brought up to the service level. In practice the choice doesn't matter much: start with RED, add saturation if your service has a clear queue (a thread pool, a connection pool, a buffer) where depth is a leading indicator.

	RED (Wilkie)	Four Golden Signals (Google SRE)
Rate	✓	Traffic
Errors	✓	Errors
Duration	✓	Latency
Saturation	—	✓

Both methods make the same point: a handful of consistent metrics across every service is more useful than a deep custom set on a single service. The second any new service ships, it should expose those metrics; that's how a fleet stays comprehensible.

The honest way to think about the relationship is that RED is the three golden signals you can always measure from outside the service, and saturation is the fourth one you can only measure from inside it. Rate, errors, and duration are all visible at the edge — a load balancer, a sidecar, or the caller itself can record every one of them without knowing anything about the service's internals. Saturation is different; it asks "how full is the most constrained resource," which requires knowing what the resources are. That is why saturation tends to migrate over to the USE side of the wall in practice, and why RED stops at three. Add saturation back into a RED dashboard only when the service has one obvious queue whose depth predicts trouble before errors or latency move — a thread pool, a connection pool, a message backlog. If there is no single dominant queue, saturation is better left to a USE view of the host.

RED versus USE: two views of the same incident

RED and USE are often presented as competitors. They are not. They answer different questions about the same system, and a real investigation usually crosses from one to the other. RED is the request-side view: it watches the contract between a service and its callers and tells you whether that contract is being honoured. USE is the resource-side view: it watches the CPU, memory, disk, and network of the box and tells you whether any resource is the bottleneck. RED notices that requests are slow; USE explains that the run queue is deep because the CPU is pegged. One is the symptom, the other is the cause.

Same incident, two vantage points. RED sees the requests crossing the boundary; USE sees the resources being consumed to serve them. You start with RED and cross to USE when the cause is inside the box.

The practical workflow follows the arrow. When a page fires, you open the service's RED dashboard first, because the request side is where user impact lives and where the symptom is unambiguous. RED narrows the problem: is more work arriving, is more of it failing, or is the same work taking longer? If the answer is "the same traffic is suddenly slower," that is your cue to cross the wall and bring up USE on the host — now you are looking for the saturated resource that explains the extra latency. If the answer is "more requests are failing at flat traffic," you skip USE entirely and go to the dependency that the errors point at. RED tells you which door to walk through; USE is what is behind one of the doors.

	RED	USE
Vantage point	Request side (the caller's view)	Resource side (the box's view)
Unit of analysis	A service or endpoint	A resource: CPU, memory, disk, NIC
Question it answers	Is the service honouring its contract?	Is any resource the bottleneck?
Signals	Rate, errors, duration	Utilisation, saturation, errors
Best for	Detecting user-facing impact	Explaining where the impact comes from

Building the dashboard

A RED dashboard for a single service is six panels, each in two views (overall and broken out by route). It takes about an hour to build and pays for itself the first time something goes wrong.

# Six standard panels for a Prometheus-backed RED dashboard.

# 1. Request rate by status class
sum by (status_class) (rate(http_requests_total[1m]))

# 2. Error rate
sum(rate(http_requests_total{status=~"5.."}[1m]))
/
sum(rate(http_requests_total[1m]))

# 3. Latency P50, P95, P99
histogram_quantile(0.50, sum by (le) (rate(http_request_duration_seconds_bucket[5m])))
histogram_quantile(0.95, sum by (le) (rate(http_request_duration_seconds_bucket[5m])))
histogram_quantile(0.99, sum by (le) (rate(http_request_duration_seconds_bucket[5m])))

# 4. Rate by route (top 5)
topk(5, sum by (route) (rate(http_requests_total[1m])))

# 5. Errors by route (top 5)
topk(5, sum by (route) (rate(http_requests_total{status=~"5.."}[1m])))

# 6. Latency by route (P99 only, top 5)
topk(5, histogram_quantile(0.99,
       sum by (le, route) (rate(http_request_duration_seconds_bucket[5m]))))

One thing worth being deliberate about: the histogram bucket layout. Wide buckets (powers of two from 1 ms to 30 s) give cheap storage and acceptable percentile accuracy. Narrow buckets give better accuracy but higher cardinality. The Prometheus default — 0.005, 0.01, 0.025, ... 10 — is a reasonable starting point for human-facing APIs; service-to-service workloads with sub-millisecond latency need a custom set.

A few layout choices repay the effort of getting them right early. Put the two route-broken-out panels next to their aggregate counterpart so the eye can drop from "the service is slow" to "this one route is slow" without scrolling. Use topk on the per-route panels so a service with two hundred endpoints does not draw two hundred lines; the five busiest, or the five worst, are what you want during an incident, and a long tail of flat lines just adds noise. Keep the time ranges on the rate panels short — a one-minute rate() window reacts quickly enough to show a traffic spike as it happens — while the duration panels can use a longer five-minute window, because percentile estimates over a histogram are noisy and a longer window smooths them without hiding a real shift.

Resist the urge to add panels. The discipline of RED is that the dashboard stays small enough to read in one glance under stress. Every extra panel is one more thing the oncall has to scan past at the worst possible moment. If a service needs a deeper view — a cache hit ratio, a queue depth, a downstream call breakdown — put it on a second, linked dashboard, and keep the RED page as the front door that everyone opens first.

Interpreting a RED dashboard

Three patterns cover most degraded-service moments. Recognising them on sight is most of what makes oncall fast.

Pattern	What you see	Usual cause
Rate up, errors up, duration up	Traffic spike. Errors and tail latency follow.	Real traffic increase or retry storm. Check upstream first; back-pressure may be needed.
Rate flat, errors up, duration flat	Same volume of requests, suddenly more failing.	Downstream dependency degraded or returning bad data. Usually database or auth service.
Rate flat, errors flat, duration up	Same traffic, same success rate, just slower.	Resource saturation on the service host (CPU, GC, lock contention) or on a hidden dependency. Switch to USE for the host.
Rate down, errors down or up, duration flat	Less traffic reaching the service.	Something further upstream is dropping requests — load balancer, gateway, DNS. Check from the outside.

The 30-second triage. When the page fires, the first thing to look at is the service's RED dashboard. The pattern usually narrows the problem down to one of the four rows above within thirty seconds. Then you drill: USE on the box, profile on the process, or call the dependency team.

RED in a service mesh

One of the practical reasons service meshes (Istio, Linkerd, Cilium) have gained traction is that they emit RED metrics automatically for every service, without the service team writing instrumentation. Every sidecar records rate, errors, and duration for inbound and outbound calls; the mesh control plane exposes the result as Prometheus metrics with consistent labels.

The trade-off is that the metrics come from the sidecar's perspective, not the application's. Latency measured at the sidecar excludes time spent inside the application process; for most operational use that's fine, but when comparing to an SLO it's worth knowing whether you're measuring service-as-the-network-sees-it or service-as-the-process-sees-it.

RED for non-request workloads

RED is shaped around request/response services, but the same three columns work for other workload types with small adaptations:

Workload	Rate	Errors	Duration
HTTP/gRPC service	Requests/sec	Failed responses	Per-request latency
Queue consumer	Messages consumed/sec	Failed processing	Processing time + queue wait
Batch job	Items processed/sec	Items failed	End-to-end batch duration
Streaming pipeline	Events/sec	Drops + dead-letter sends	End-to-end event latency
Database query path	Queries/sec	Query failures (timeouts, syntax)	Query duration histogram

The pattern works because every system has work coming in, some of that work failing, and the rest taking some amount of time. Once you have those three views, most operational questions resolve to "which one moved, and when?"

Where RED goes wrong

RED is simple, which makes it easy to implement badly. Most of the failures are not in the idea but in the instrumentation, and they share a theme: a metric that looks fine on the dashboard but answers the wrong question or cannot be aggregated. Three traps account for nearly all of them.

Averaging the duration. This is the one that bites hardest. A team exports a mean response time, or worse, exports per-instance percentiles and then averages those across instances. The mean hides the tail, as the histogram section above showed; averaged percentiles are mathematically meaningless because percentiles do not combine by averaging. The symptom is a dashboard that stays green while users complain, because the number on the panel describes a request that almost nobody actually experienced. The fix is structural, not cosmetic: collect duration as a histogram, aggregate the bucket counts, and compute the quantile once over the combined distribution.

No error taxonomy. Counting "errors" as a single number treats a client sending a malformed request the same as the database being down. Those are completely different incidents — one is the caller's fault and needs no action, the other is yours and needs a page — and a single error counter folds them together. A useful error signal is broken out by class: 4xx (the caller did something wrong) kept separate from 5xx (the service did), and ideally tagged with an application-level reason. The classic false alarm is a deploy that ships a stricter validation rule, 4xx rate jumps, the error panel goes red, and the oncall burns twenty minutes before realising nothing is actually broken. Without a taxonomy, every error spike looks like an outage.

Cardinality. The labels that make RED metrics useful — route, status, method, caller — are also what can blow up your metrics store. Every distinct combination of label values is a separate time series, and a histogram multiplies that by its bucket count. Put a high-cardinality value in a label — a raw URL path with embedded IDs, a user identifier, a request ID — and you create a near-infinite number of series, each carrying a full set of histogram buckets. The store slows down, queries time out, and the bill climbs. The discipline is to label only with bounded, low-cardinality dimensions: a templated route (/users/:id, never /users/8412), a status class, a method. Anything unbounded belongs in a trace or a log, not a metric label. This is one of the clean dividing lines in the logs, metrics, and traces model: metrics are for bounded aggregate signals, traces and logs are for the high-cardinality detail you reach for once a metric has told you where to look.

The cardinality rule of thumb. Before adding a metric label, ask how many distinct values it can take. If the answer is "more than a few hundred" or "unbounded," it does not belong in a label. Route templates, status classes, and methods are safe; raw paths, IDs, and free-form strings are not.

From RED to SLOs and budgets

RED is also the raw material for service-level objectives. An SLO is just a target drawn on top of the RED signals: "99.9% of requests succeed" is an error-rate target, and "95% of requests complete under 200 ms" is a duration target read straight off the histogram. Because the duration signal is already a distribution, you can express latency objectives the honest way — as a percentile under a threshold — instead of the misleading "average under X" that an average-based metric forces on you. The error budget that falls out of an SLO is then a direct function of the error and duration panels you are already watching, which is why a well-built RED dashboard is most of the work of running SLOs. The threshold-setting, the burn-rate alerting, and the way percentile budgets compose across a request path are covered on the latency budgets and percentiles page; RED is where the numbers those budgets are built on come from.

Production checklist

Every service exposes rate, errors, duration. Same metric names across the fleet (http_requests_total, http_request_duration_seconds_bucket). Consistency is what makes the fleet legible.
Duration is always a histogram. Never a mean. P50, P95, P99 at minimum.
Six-panel RED dashboard for each service. Three aggregate (rate, errors, P50/P95/P99), three broken out by route (rate, errors, P99).
Alert on errors and duration, not rate. Rate spikes are normal; sustained error-rate or latency breaches are not.
Pair with USE when you suspect the host. RED diagnoses the service; USE diagnoses the box underneath.
Add saturation when there's an obvious queue. Thread-pool depth, connection-pool wait, message-queue backlog. Worth tracking when one exists.
Histogram bucket layout fits the workload. Default Prometheus buckets work for human-facing APIs; service-to-service or sub-millisecond workloads need custom buckets.
Mesh-emitted metrics are convenient but not authoritative. They measure at the sidecar; for SLO-grade tracking, instrument inside the application too.

The RED method

Why three, and why these three

Duration is a distribution, not a number

RED and the four golden signals

RED versus USE: two views of the same incident

Building the dashboard

Interpreting a RED dashboard

RED in a service mesh

RED for non-request workloads

Where RED goes wrong

From RED to SLOs and budgets

Production checklist

Further reading

Top-down microarchitecture