Circuit Breaker Simulator: when the downstream is sick, stop calling.
A circuit breaker stops calls to a failing dependency so one slow service can't cascade into an outage. It has three states (closed, open, half-open): trip on N failures, sleep, probe with one request, then recover.
The three boxes are the breaker's states — CLOSED, OPEN, HALF_OPEN — and the lit one is where the breaker is right now. CLOSED counts consecutive failures toward the threshold; OPEN shows the seconds left on its cooldown; HALF_OPEN tallies trial probes. Set the downstream health (how often a call succeeds) and the failure threshold, then drive traffic with Call or let Auto-load run a steady stream. The state readout, pass rate, and the recent OK/FAIL/REJECT log all update per call.
Set health to Sick 50% with the threshold at 5 and start calling. Watch the failure counter climb and, on the fifth consecutive failure, the breaker snap to OPEN. From that moment calls come back REJECT in microseconds instead of attempting the sick downstream. After the cooldown it slips into HALF_OPEN and sends a few probes: three clean successes close it, one failure throws it straight back to OPEN. What should surprise you is how a low threshold trips on a brief unlucky streak even while the downstream is mostly fine.
What is the circuit breaker pattern?
Why one sick service takes down a healthy one.
The circuit breaker pattern is a resilience pattern that wraps a remote call so it fails fast — and stops failing fast once the downstream recovers. Michael Nygard popularised it in Release It! (2007). A breaker has three states: closed (calls flow normally), open (calls fail immediately, no remote call attempted), and half-open (a small probe tests recovery). It is the single most useful pattern for stopping cascading failure in microservices.
Picture a checkout flow. The user clicks "Pay"; your service receives the request; you call the payments service; the payments service returns success; you write the order to the database; you return 200 to the user. End-to-end the call takes 80 milliseconds. Three hundred requests per second flow through cleanly.
Now the payments service has an outage. Its responses are timing out at 30 seconds instead of returning in 50 milliseconds. Your code, which trusted the payments service to be fast, was written without urgency around timeout handling — the default HTTP client waits the full 30 seconds before giving up. So every checkout request now takes 30 seconds. Your service still has the same 100 worker threads it always had. At 300 requests per second, those workers fill up in a third of a second; after that every new request waits in a queue, which itself fills up; after another second your load balancer marks your service as unhealthy and starts routing traffic away. From the user's perspective the entire site is down, even though the only thing that has actually broken is the payments service. The healthy parts of your stack have been dragged down with the sick part because they kept calling it.
If your code retries failed calls — the well-meaning instinct of every junior engineer — it gets worse. Each user-facing request now generates three or four payment-service calls, multiplying the load on a downstream that was already buckling. The downstream falls further behind. The cascade accelerates. This is the retry storm. It is how a small, contained failure in one service becomes a customer-visible outage across many.
The circuit breaker is the standard fix. It is a small wrapper that sits between your service and the one you depend on, and it watches the recent error rate. As long as calls are succeeding, the breaker is "closed" — calls flow through normally. If too many calls fail in too short a time, the breaker "trips" to the "open" state and starts rejecting new calls instantly without ever contacting the downstream. The rejection takes microseconds instead of waiting out the timeout. After a cooldown the breaker probes the downstream with a few trial calls; if those succeed, it closes again. If they fail, it goes back to open and waits longer.
The whole machine is three states and two transitions, and it is the closest thing to a universally agreed-upon resilience pattern in distributed systems. The simulator above shows the state machine in action; toggle the downstream's health and watch the breaker fire and recover.
The three states — closed, open, half-open
Closed, open, half-open.
A circuit breaker is a small state machine with three states. In the closed state the breaker passes calls through to the downstream and counts failures. If failures cross a threshold within a measurement window the breaker transitions to the open state. In open the breaker rejects calls immediately without contacting the downstream; the rejection is fast and cheap, and it gives the downstream room to recover. After a configurable cooldown the breaker moves to half-open, where it admits a small number of probe calls. If the probes succeed the breaker closes; if any probe fails, the breaker re-opens and the cooldown timer restarts.
The half-open state is the most subtle of the three. Without it, the breaker has no principled way to know when to re-engage; with it, the breaker can validate downstream health using a small, controlled fraction of traffic before fully restoring service. The number of allowed probes is a knob: too few and a single random failure flips the breaker back open; too many and the half-open state itself becomes a load source on a still-recovering downstream. Production implementations typically allow between three and ten probes, often with a maximum-concurrency cap so that the probes do not all fire simultaneously.
The trip condition needs care. The simplest scheme counts consecutive failures and trips at a fixed count — three or five is common — but that misclassifies a healthy service that produces an occasional error. A better scheme uses a sliding window: count failures and successes over the last fixed time interval (typically ten to sixty seconds) or over the last fixed number of calls (typically a hundred to a thousand), and trip when the failure ratio crosses a threshold. The window must be large enough to be statistically meaningful and small enough that the breaker reacts before the cascade is irreversible.
What counts as a "failure" is an underrated design decision. Network errors and timeouts almost always count. HTTP 5xx responses usually count. HTTP 4xx responses usually do not, because a 404 from a downstream is a perfectly correct response to a malformed query and should not trip the breaker. Timeouts, however, are often misconfigured: a per-request timeout longer than the breaker's window means the breaker trips on caller side before the downstream's response arrives, leaving the breaker convinced of failures that the downstream did not actually produce. Aligning the timeout, the window, and the threshold so they make sense together is the work of tuning a breaker for a real service.
Origins — Nygard, Release It!, and the failure cascade
Nygard, Release It!, and the failure cascade.
The circuit-breaker pattern entered software engineering through Michael Nygard's Release It! Design and Deploy Production-Ready Software, first published in 2007 by The Pragmatic Bookshelf and substantially revised in a second edition in 2018. Nygard's thesis was that production failures rarely come from the kinds of bugs caught by unit tests; they come from emergent interactions between components under load. The book catalogued anti-patterns — integration points without timeouts, unbounded result sets, blocked threads, attacks of self-denial — and named the corresponding stability patterns. The circuit breaker was the pattern that gave the book its lasting influence.
The metaphor is borrowed verbatim from residential electrical wiring. A breaker is a switch that trips automatically when current exceeds a threshold; it stays open until a person resets it; the protected circuit suffers no damage during the fault. Translated to a service call, the same shape: a wrapper that detects when downstream calls are failing and stops issuing them, holding the line open until conditions improve. The word "breaker" rather than "fuse" matters — a fuse is single-use, a breaker is resettable, and the resilience pattern wants the latter behaviour.
The pattern took hold because it solved a specific failure mode that distributed systems kept producing. Without a breaker, the standard reaction to a failing downstream is for callers to retry. Retries multiply the load on the already-failing downstream. The downstream, now under more load than it had when it started failing, falls further behind. Latency rises. Caller threads pile up waiting on responses. Caller thread pools saturate. The caller now fails to its callers, who retry, and so on up the dependency graph. This is the retry storm, and it converts a partial failure into a system-wide outage in minutes.
Nygard's framing supplied the vocabulary that engineering teams now use to discuss the problem. Bulkheading isolates failures inside one connection pool from another. Steady state means leaving systems in a configuration that does not require periodic restart. Fail fast means rejecting a request immediately when the downstream is known to be sick, rather than holding the request until the timeout. The breaker is the operationalisation of fail-fast: it knows the downstream is sick because it has been counting recent failures, and it short-circuits new calls until a probe shows the downstream is healthy again.
Tuning circuit breaker thresholds — window, cooldown, probes
Threshold, window, cooldown, probes.
Every circuit-breaker implementation exposes a similar handful of knobs, and the choice of values is what separates a useful breaker from a misconfigured one. The failure threshold is the count or ratio that trips the breaker. Too low and ordinary noise — the occasional slow response from a perfectly healthy downstream — flips the breaker constantly, which produces false rejections and confuses operators. Too high and the breaker takes too long to react when a real outage starts, defeating the purpose of fail-fast. A reasonable starting point on a service that handles a few hundred requests per second is a fifty-percent failure ratio over the last ten seconds, with a minimum of twenty calls before the ratio is even computed.
The cooldown or reset timeout determines how long the breaker stays open before probing. Too short and the probes hit a downstream that has not actually recovered, restarting the cycle. Too long and the breaker keeps rejecting valid traffic after the downstream has stabilised. Common values run from a few seconds for fast-recovering downstreams to a minute or two for downstreams that need to drain queues or rebuild caches. Adaptive variants double the cooldown after each failed probe, capped at a ceiling — exponential back-off applied at the breaker layer rather than the request layer.
The probe count in half-open governs how confident the breaker has to be before it fully re-engages. A single successful probe is rarely enough; three to ten with a small concurrency cap is typical. Some implementations allow only one probe at a time and only count a probe as successful if it returned within a stricter latency budget than usual. The latency budget is its own knob — a downstream that is responding but very slowly may still be unable to handle full traffic, and a strict probe latency catches that case.
Two further knobs matter at scale. The first is whether the breaker is per-instance or shared across the caller fleet. Per-instance breakers fail independently and can produce a partial outage where some callers reject and others succeed; shared breakers, backed by a coordinator like Redis or Zookeeper, give uniform behaviour at the cost of an extra dependency. The second is whether the breaker is applied per-route or globally. A breaker that trips on the entire downstream service when only one endpoint is failing rejects too much traffic; per-route breakers keep the failure scope tight but multiply the configuration burden.
| State | Calls go through? | Failure handling | Exits to |
|---|---|---|---|
| CLOSED | all | count toward window | OPEN if ratio > threshold |
| OPEN | none — reject fast | no calls issued | HALF-OPEN after cooldown |
| HALF-OPEN | N probes only | single fail re-opens | CLOSED on N successes; OPEN on any fail |
Circuit breakers in production — from Hystrix to Envoy and the data plane
From a Netflix library to the data plane.
The first widely deployed circuit-breaker library was Hystrix, open-sourced by Netflix in 2012. Hystrix bundled a circuit breaker with thread-pool bulkheading, request collapsing, and a metrics stream consumable by a real-time dashboard called Hystrix Dashboard. The combination was novel for its time; many Java services adopted Hystrix as the default protection wrapper around all outbound calls. Hystrix's design notes — particularly the section explaining why thread-pool isolation matters — became required reading for engineers thinking about resilience patterns.
Hystrix entered maintenance mode in 2018. The Netflix team replaced it internally with a lighter library called Concurrency Limits, which uses adaptive concurrency control rather than fixed-threshold breaking. The community successor for the broader Java ecosystem is Resilience4j, a functional library that supplies circuit breaker, rate limiter, retry, bulkhead, and timeout primitives as composable decorators. Resilience4j's circuit breaker uses a sliding window and configurable failure-classification predicates, and it integrates with Micrometer for metrics export.
The .NET ecosystem's equivalent is Polly, which provides similar primitives plus a nicely composable policy-builder API. Sentinel, open-sourced by Alibaba, is the dominant choice in the Chinese cloud-native community and ships with a centralised dashboard for managing breaker rules across a microservice fleet. Each of these libraries differs in which knobs are exposed and how they default, but the underlying state machine — closed, open, half-open — is the same in all of them.
The contemporary trend is to push circuit breaking out of application code and into the service-mesh data plane. Envoy's outlier detection ejects backend instances from the load-balancing pool when their consecutive 5xx counts or success rates fall below a threshold; the ejection is per-instance rather than per-call, and the ejected instance is periodically re-introduced for a probe. Istio exposes this through DestinationRule custom resources; Linkerd implements similar logic inside its proxy. The benefit is that circuit-breaking policy becomes uniform across services regardless of the language they are written in; the cost is that the breaker no longer has access to application-level signals like business-error response codes.
# resilience4j — circuit breaker config (application.yml)
resilience4j.circuitbreaker:
instances:
paymentService:
slidingWindowType: COUNT_BASED
slidingWindowSize: 100
minimumNumberOfCalls: 20
failureRateThreshold: 50 # %
waitDurationInOpenState: 30s
permittedNumberOfCallsInHalfOpenState: 5
automaticTransitionFromOpenToHalfOpenEnabled: true
recordExceptions:
- java.io.IOException
- java.util.concurrent.TimeoutException
ignoreExceptions:
- com.example.BusinessExceptionHalf-open thundering herds and other circuit-breaker surprises
Half-open thundering herds, and other surprises.
The classic failure mode of a circuit breaker is the half-open thundering herd. The breaker has been open for a minute; the cooldown elapses; the breaker transitions to half-open at the same instant on every caller; every caller probes simultaneously; the still-fragile downstream receives a thundering herd of probes and falls over again. Mitigations include limiting half-open concurrency to a small fixed value, jittering the cooldown across callers so they do not transition in lockstep, and using a token-based admission scheme where probe tokens are refilled at a controlled rate.
A second pitfall is the misclassified failure. The downstream returns a 401 because the caller's auth token has expired; the breaker counts the 401 as a failure and trips; legitimate traffic is now blocked because of an authentication issue, not a downstream-health issue. The fix is to decide carefully what counts as a failure. The default rule "5xx counts, 4xx does not" is a good starting point, but a per-route override may be needed when the downstream uses 4xx semantics in ways that overlap with health.
A third is breaker oscillation. The downstream is at the edge of its capacity. The breaker trips, traffic drops, the downstream recovers, the breaker closes, traffic returns, the downstream overloads, the breaker trips again. This pattern manifests as a saw-tooth latency curve and is a sign that the downstream needs to scale rather than that the breaker needs more aggressive tuning. Adaptive concurrency control (the approach Netflix's Concurrency Limits takes) replaces the binary trip with a continuous limit on in-flight requests, smoothing the saw-tooth into a steady rate match.
A fourth is the interaction with retries. A retry policy that retries on every failure feeds the breaker's failure counter twice for every original error. The fix is to wrap the retry inside the breaker rather than the breaker inside the retry, so the breaker sees the original call and the retry sees the breaker. Closely related is the choice of timeout: the per-request timeout must be short enough that the call returns before the breaker's window closes; otherwise the breaker classifies in-flight calls as failures even though they may eventually succeed. Connection pool saturation deserves the same attention — a breaker that opens because the connection pool is exhausted is reporting load, not downstream sickness, and the right response may be back-pressure rather than rejection.
If the retry sits outside the breaker, every retry is a fresh call as far as the breaker is concerned, and a single bad response feeds the failure counter several times. Put the breaker on the outside: the breaker rejects fast, the retry never fires when the breaker is open, and the failure counter sees one event per logical call.
Why fail-fast is a queueing argument — Little's law, head-of-line
Why fail-fast is also a queueing argument.
The deeper reason a circuit breaker helps is queueing. Little's Law says that for any stable queueing system, the number of items in flight equals the arrival rate times the average time in the system. When a downstream is sick its average time in the system grows; if the arrival rate stays the same, the in-flight count grows in proportion. The in-flight count is bounded by the size of the connection pool or the thread pool; once that bound is hit, callers block, and the queueing system is no longer stable. Fail-fast keeps the in-flight count under the bound by reducing the effective arrival rate when latency starts to climb.
This perspective gives the bulkhead pattern its name. A ship's bulkheads partition the hull into sealed compartments so that a hole in one section does not flood the whole vessel. In a service, a bulkhead partitions the resource pool — connection pool, thread pool, semaphore count — so that traffic to one downstream cannot exhaust the resources needed for another. Bulkheading and circuit breaking are complementary: bulkheading limits how much one downstream's failure costs, while circuit breaking detects the failure and stops feeding it.
AWS's adaptive throttling, deployed in DynamoDB's request router, is an extension of the same idea. Rather than a binary breaker, the router maintains a per-partition admission rate that adapts to observed downstream throughput; when a partition is hot, the rate decreases; when it is cold, the rate climbs. The result is a smoother control surface than a tripping breaker, and it interacts more gracefully with autoscaling. Google's SRE Book chapter on overload describes a similar approach used in Google's load balancers, where each backend reports its current capacity to the balancer and the balancer admits requests in proportion to capacity.
The lesson generalises. A circuit breaker is the simplest control loop that keeps a caller from helping its downstream into a worse state. More sophisticated control loops — adaptive concurrency, token-bucket admission control, response-time-based throttling — are refinements of the same idea. The choice between them is largely a question of how much tuning effort and operational complexity the operator is willing to take on. A well-tuned binary breaker is far better than a sophisticated adaptive scheme that nobody understands; a sophisticated adaptive scheme operated by a team that does understand it is far better still.
When NOT to use a circuit breaker
When the pattern is the wrong tool.
A circuit breaker is a fail-fast mechanism. It works when the right behaviour during a downstream outage is to reject quickly and let the downstream recover. There are workloads where that response is wrong. A queue-backed asynchronous job pipeline does not benefit from rejecting work; the work just sits in the queue and gets retried later. An idempotent batch process that needs to complete eventually is better served by an exponential back-off retry than by a breaker that rejects the work permanently. A user-facing API that has a meaningful fallback — return cached data, return a degraded response, redirect to a static page — benefits from a breaker plus the fallback, but a user-facing API with no fallback may be better served by surfacing the original error.
Breakers are also the wrong tool when the downstream's sickness is best diagnosed with a different signal. Health checks, deep liveness probes, and operator-driven traffic shifts can all do work that a breaker is poorly suited for. A breaker reacts to symptoms; a health check inspects state directly. The two are complementary, not redundant: a service with a working health-check endpoint and a load balancer that respects it will see fewer breaker trips because failed instances are removed from the rotation before their requests start failing.
The pattern competes with several adjacent ones. Rate limiting caps incoming load to protect a service from upstream traffic spikes; the difference is direction — a breaker protects the caller from a sick callee, a rate limiter protects the callee from an aggressive caller. Bulkheading partitions resources so failures stay scoped. Hedged requests issue duplicate calls to multiple replicas and accept the first response; this can mask tail latency without ever needing a breaker. Adaptive concurrency replaces the binary trip with a continuous in-flight limit. The right combination depends on the workload, and most production systems use several at once.
It helps to be precise about the four basic primitives, because they compose rather than substitute. A timeout bounds how long a single call may hold a thread. A retry absorbs transient blips. A circuit breaker stops calling a downstream that retries cannot save. A bulkhead caps what one downstream's failure can cost the caller. They nest in that order: timeout inside retry, retry inside breaker, the whole stack inside a bulkhead. None of them removes the need for the others, and the retry layer carries its own discipline — bounded attempts, backoff, jitter — covered in back-pressure and retries.
| Primitive | What it bounds | Mis-used, it | Layer |
|---|---|---|---|
| Timeout | time one call holds a thread | too long: threads pile up; too short: healthy calls counted as failures | innermost, on every remote call |
| Retry | transient blips | unbounded retries amplify an outage into a retry storm | wraps the timeout |
| Circuit breaker | calls to a downstream retries cannot save | trips on noise, or reacts after the cascade has started | wraps the retry |
| Bulkhead | what one downstream's failure can cost | pool sized too small starves healthy traffic | outermost, partitions the resources |
A final note. The circuit-breaker state needs to be observable. A breaker that has tripped and stayed open without any signal to operators is a silent outage; teams discover it only when users complain. Useful instrumentation includes the current state of every breaker, the trip count over time, the rejection count, and the failure ratio inside the closed-state window. Most production breaker libraries export these as Prometheus counters or to a metrics dashboard; keep that data in front of operators, and the pattern earns its complexity. Skip the instrumentation, and you have added a hidden failure mode rather than removing one.
It is also worth saying out loud that breakers can be tested. Game-day exercises in which the team intentionally degrades a downstream and watches whether the breaker trips on the right signal, rejects the expected fraction of traffic, and re-engages cleanly are the only way to know that the configuration values you chose at design time still match the system you have today. The principle is the same as fire drills: the time to discover that the breaker's threshold was set to a value that no longer matches reality is during a planned exercise, not during an incident.
Further reading on the circuit breaker pattern
Primary sources, in order.
- Nygard · 2018Release It! Design and Deploy Production-Ready Software (2e)The book that named the pattern. Stability and capacity anti-patterns, plus the corresponding stability patterns.
- Martin FowlerCircuitBreakerThe canonical short writeup. A useful reference when explaining the pattern to a colleague.
- Netflix · 2012Hystrix — how it worksNow in maintenance mode, but the design doc remains the most detailed treatment of bulkheading-plus-breaker for Java services.
- Resilience4j docsCircuitBreaker referenceThe functional successor to Hystrix in the JVM ecosystem; sliding-window failure detection with configurable predicates.
- Envoy docsOutlier detectionService-mesh-native breakers in the data plane. Per-instance ejection rather than per-call rejection.
- Amazon Builders' LibraryAvoiding fallback in distributed systemsWhy static stability and shedding load can outperform clever fallback logic.
- Google SRE BookHandling overloadAdaptive throttling and the relationship between client- and server-side load shedding.
- SREcon · 2018Stop Rate Limiting! Capacity Management Done RightA pragmatic talk on the spectrum between binary breakers, rate limits, and adaptive concurrency control.
- Aphyr / JepsenThe Network Is ReliableA field guide to partial failures — the kind of failure modes the breaker exists to mitigate. The reading that motivates the pattern.
- Microsoft Azure docsCircuit Breaker patternVendor-neutral overview with diagrams; useful as a student-friendly entry point and to see how cloud platforms expose the primitive.
- InfoQ · 2017Adrian Cockcroft — Failure Modes and Continuous ResilienceForty-five minutes from one of the architects who deployed Hystrix-style breakers across Netflix's microservices fleet.
- Semicolony simulatorRate limiterThe complementary primitive that protects callees from aggressive callers.
- Semicolony guideLoad balancingWhere the breaker fits next to outlier detection, health checks, and traffic shifting.