How do I start with chaos engineering safely?

Start in staging, not production. Pick one failure mode (kill one pod, or add 100ms latency to one dependency). Define the blast radius explicitly — how many users can be affected, for how long, with what circuit breaker. Run during business hours when oncall is awake. Document the hypothesis and outcome. Only graduate to production after you have a runbook for what to do when something breaks.

Chaos Playground Simulator: pull the levers, break the system.

Q: What is chaos engineering?

Chaos engineering is the discipline of experimenting on a system in order to build confidence in its capability to withstand turbulent conditions in production. The four-step loop: form a hypothesis about steady state, vary real-world events (terminations, latency, partitions), run the experiment in production with controlled blast radius, automate the experiment to run continuously.

Chaos engineering injects controlled faults to find where a system breaks before production does. Add latency, packet loss, and node failure independently and watch the pass rate drop. Then see whether your retries and timeouts actually hold.

Pass rate

100%

Sent

Latency injection

0 ms

Packet loss

Node failure

Log

— quiet —

What you're looking at

Three sliders are your fault injectors: added latency in milliseconds, packet-loss percentage, and node-failure percentage. Send fires one request through them; burst 10 fires a batch. Each request either makes it (logged OK with its round-trip time), gets dropped on the wire, or hits a dead node — the log names which. The pass rate up top is the running fraction of requests that survived, colour-shifting from green toward plum as more of them die.

Leave every slider at zero and send a burst: 100% pass, every request OK. Now drag packet loss to 30% and burst again. Roughly a third vanish before they arrive, and the pass rate settles near 70% rather than dropping cleanly to it, because loss is rolled per request. Push both loss and node failure up together and watch them compound — two independent 30% failure modes leave you well under 50% success, not 70%. That compounding is the lesson: each dependency you add multiplies its failure odds into the whole.

What is chaos engineering?

Find the weakness on purpose.

Chaos engineering is the practice of deliberately injecting failure into production systems to find weaknesses before they find you. Netflix popularised it with Chaos Monkey (2011) and the Principles of Chaos Engineering manifesto (2014). Modern teams run controlled-blast-radius experiments using tools like Gremlin, Chaos Mesh, and Litmus.

Imagine a payments service that has run for two years without a serious outage. The team trusts their dashboards, their retries, their replicated database. Then one Tuesday a single networking switch in one availability zone develops a subtle bug — packets between two specific subnets are corrupted at random. Every individual component reports healthy; the load balancer keeps sending traffic to the bad zone; the queues build up; somewhere downstream a circuit breaker that everyone forgot existed trips, and the whole product stops accepting payments for forty-three minutes. The postmortem is twenty pages long. None of the dashboards showed anything until the very end.

The problem chaos engineering exists to address is precisely this: production failures hide in the gaps between components that nobody routinely exercises together. A unit test catches a bug in one function. An integration test catches a misuse of an API. A load test catches a throughput limit. None of those find what happens when DNS gets slow, when a dependency returns 500s for thirty seconds, when an EC2 host disappears at 3 AM, when a clock jumps two seconds backwards because of an NTP correction. The only way to know your system survives those is to make those things happen, deliberately, while you're watching.

The technique is straightforward. Form a hypothesis about how your system should behave under a specific failure (“if a database replica becomes unreachable, the application should fail over to the secondary within five seconds with no observable user impact”). Pick a small blast radius — one host, one canary instance, one percent of traffic. Inject the failure: kill the process, drop the network, slow the disk, stuff garbage into a header. Watch your steady-state metrics — error rate, p99 latency, queue depth — and compare to the baseline. If the metrics stay inside their normal envelope, the hypothesis was correct and the system passed. If they don't, you have just discovered a weakness on a Tuesday afternoon instead of at three on a Saturday morning.

Some grounding numbers. Netflix reports that their continuous chaos exercises catch on the order of 1–2 production-impacting issues per quarter that traditional testing missed. AWS's pre-launch GameDays for new services routinely surface failure modes the design review didn't anticipate. The simulator above lets you twist three knobs — request rate, fault injection rate, and circuit-breaker threshold — and watch the system's effective availability collapse or recover. It's the smallest possible chaos lab: pull the levers, watch the bars, learn which numbers move first when the system is starting to topple. The rest of this article walks through the practice in five tiers from tabletop drills to continuous production experiments, the seven failure modes worth injecting, and the postmortem culture that turns chaos exercises into permanent learning rather than expensive theatre.

Origins — Chaos Monkey at Netflix and what came after

Chaos Monkey, and what came after.

The discipline now called chaos engineering was born inside Netflix in 2010 as a single shell script. The migration from data-centre Oracle to Amazon Web Services had concentrated Netflix's exposure to AWS instance failures: every EC2 host died occasionally, and a fleet built on the assumption of perfect-running hosts was a fleet built on a lie. The Netflix engineers' answer was deliberate sabotage: a tool, named Chaos Monkey, that randomly terminated EC2 instances during business hours so that engineers had no choice but to build resilient services. Yury Izrailevsky, Cory Bennett, and Ariel Tseitlin published the first public description in the Netflix tech blog in July 2011 (“The Netflix Simian Army”); the source was open-sourced in 2012.

The Simian Army grew. Chaos Gorilla (2012) terminated entire Availability Zones; Chaos Kong (2014) terminated entire AWS regions; Latency Monkey introduced artificial network delays; Conformity Monkey killed instances that violated configuration policy; Doctor Monkey killed unhealthy ones; Janitor Monkey reaped unattached resources. By 2015 the original ad hoc collection had become an internal platform called Failure Injection Testing (FIT), described by Naresh Gopalani and Bruce Wong in the Netflix blog in October 2014. FIT exposed an HTTP API for declaring failures by request attribute — a header value, a percentage of traffic, a specific user cohort — which made the chaos targetable and, more importantly, automatically revertible.

The pattern transferred. Casey Rosenthal, who led the Chaos team at Netflix from 2014 to 2017, co-authored with Lorin Hochstein, Aaron Blohowiak, Nora Jones, Ali Basiri, and others Chaos Engineering: System Resiliency in Practice (O'Reilly; first edition 2017, second edition 2020). The book formalised the four Principles of Chaos — build a hypothesis around steady-state behaviour, vary real-world events, run experiments in production, automate experiments to run continuously — that the website principlesofchaos.org lists as the discipline's manifesto. Rosenthal and Jones later co-founded Verica to commercialise the practice; Kolton Andrus, an early Netflix Chaos engineer, founded Gremlin in 2016 (acquired by FireMon in 2024). The vocabulary that started as private Netflix slang became industry terminology in under a decade.

The deeper lineage goes further back. Jesse Robbins ran Amazon's first GameDay exercises starting in 2004 — planned, large-scale resilience tests with the explicit goal of finding what didn't work. Robbins drew on his volunteer fire-service training: live drills under controlled conditions outperform any amount of paperwork. The same instinct shaped Google's DiRT (Disaster Recovery Testing) program (Kripa Krishnan's 2012 USENIX LISA talk “Weathering the unexpected”), Microsoft's monthly resilience exercises, and the practice every major cloud provider now runs internally. Chaos engineering's contribution was less the idea than the toolchain: making it cheap and routine to break things on purpose.

Five colours of chaos — latency, error, resource, state, dependency

Five colours of chaos.

Chaos engineering is not one practice but a spectrum, from the safest paper exercise to the boldest bare-knuckle production fire drill. Pick your point on the spectrum based on how much you trust your runbooks, your alerting, and your blast-radius controls. Most teams underestimate how far the lower tiers will carry them.

Ali Basiri, Niosha Behnam, Ruud de Rooij, Lorin Hochstein, Luke Kosewski, Justin Reynolds, and Casey Rosenthal published the academic framing in “Chaos engineering” (IEEE Software, May 2016). The paper distinguishes fault injection (a single deliberate failure to test a known failure mode) from chaos engineering proper (continuous experimentation against the assumption of steady-state behaviour). The maturity model has been refined repeatedly since — Adrian Cockcroft's five-level chaos-engineering capability matrix at AWS, Gremlin's documented Chaos Engineering Maturity Model, John Allspaw and Richard Cook's resilience-engineering framings — but the core observation holds: the value of each tier compounds the value of the tier below it.

Tier	Where	Risk	Yield
Game days (paper)	A meeting room.	Zero.	Reveals process gaps without touching production.
Staging chaos	Pre-prod environment.	Low.	Catches concrete bugs but misses traffic-shaped issues.
Shadow / dark chaos	Prod traffic mirror.	Low.	Real shapes; no user impact.
Canary chaos	1% of prod.	Bounded.	First true failure-mode validation.
Full-prod chaos	100% of prod.	Real.	The Netflix tier. Trust earned over years.

Most teams should live at tier 2 or 3. Tier 5 is where Netflix, Stripe, AWS internal teams, and parts of Google operate, and where a runbook that says “the system handles this; observed by SLO” is more valuable than the chaos itself. The mistake to avoid is leaping to tier 5 before the lower tiers have closed their gaps; the discipline reveals failure modes much faster than it teaches teams to handle them.

Seven failure modes worth injecting

The seven failure modes to inject.

Every layer has its own characteristic failures. Pick the one that matches a real risk in your architecture. Random chaos is theatre; calibrated chaos is engineering. The catalogue below comes from the union of the Netflix FIT taxonomy, the AWS Fault Injection Service action library (released 2021), and Chaos Mesh's FaultKind enumeration (CNCF, version 2.6 onward). Seven categories cover roughly ninety percent of what production systems do wrong.

NETWORK
Latency injection (50–500 ms), packet loss (0.1–10%), bandwidth caps, partition between AZs, DNS failure. Toxiproxy is the canonical tool.
HOST
CPU saturation, memory pressure, disk-fill, clock skew, kernel panic. stress-ng for synthetics; Chaos Mesh for k8s pods.
DEPENDENCY
Database slow query, cache miss storm, downstream service 500s, full queue. Most informative — your retry/circuit-breaker behaviour shows up here.
APPLICATION
Exception injection, thread-pool exhaustion, GC pauses, leaked file handles. Less common; usually surfaced by load tests instead.
REGIONAL
Whole-AZ failure, full-region outage, traffic shift between regions. The hardest to test, the most expensive to ignore. AWS FIS exposes aws:ec2:stop-instances scoped by tag; AWS Resilience Hub validates the recovery time.
CLOCK
Clock skew between hosts, NTP failure, sudden time jumps backward. Distributed-systems bugs surface here with surprising frequency. Chaos Mesh's TimeChaos shifts the system clock for a target pod.
CONFIG
Bad-config rollouts, feature-flag flips, secret rotation. Most public outages reduce to this category. The Cloudflare 2020-07-17 outage was a router-config push; the Facebook 2021-10-04 outage was a BGP withdrawal triggered by a maintenance command.

The seven categories above are intentionally orthogonal. Most production incidents are multi-category — a network partition triggers a host-resource exhaustion that triggers a dependency timeout that triggers a config push to mitigate that triggers a regional cascade. The discipline is to test each category in isolation first, then test the combinations once the singles pass. The IEEE Software paper recommends specifically against testing complex multi-failure scenarios first, on the grounds that the system has not yet earned the right to be that broken.

One detail worth highlighting: the application category is the least-tested in most organisations because thread-pool exhaustion and GC pause are hard to inject without crashing the process. Netflix's FailFast annotation (Hystrix-era, 2013) is the canonical answer — declarative exception injection at the method level, gated on a runtime feature flag. Chaos Semicolony and Gremlin both expose JVM-level versions; the eBPF-based kchaos tool from Aqua Security can inject Go runtime panics by patching syscall return values, without modifying application code.

Blast radius — chaos, controlled

A blast radius, controlled.

The four Principles of Chaos Engineering, as listed at principlesofchaos.org, are: build a hypothesis around steady-state behaviour; vary real-world events; run experiments in production; automate experiments to run continuously. A fifth principle — minimise blast radius — is implicit in the rest and explicit in every practitioner's day-to-day. The last one is what separates an experiment from an incident.

Steady-state is the operative noun. The hypothesis must be expressed in terms of an observable metric the system already exposes: not “does the database survive” but “does p99 checkout latency stay below 250 milliseconds while the secondary database fails over”. The metric is your tripwire; if the experiment crosses it, the experiment aborts and reverts. Honeycomb's production excellence framing (Charity Majors, Liz Fong-Jones, George Miranda; Observability Engineering, O'Reilly 2022) ties chaos engineering tightly to good observability: you cannot validate that a hypothesis held if you cannot read the metric that defines steady-state.

Blast radius control is technical, not aspirational. AWS FIS exposes a stop conditions field on every experiment template: a CloudWatch alarm that aborts the experiment automatically. Chaos Mesh exposes a Workflow resource with explicit pause and resume points. Litmus exposes a probe mechanism that gates each step on a steady-state check. The discipline of declaring the abort condition before running the experiment is the most important habit in the practice; the alternative — deciding to abort while you're watching the dashboard — is how an experiment becomes an incident.

The mature pattern is continuous chaos: small experiments running every day in production, scoped to single instances or shadow traffic, with auto-revert. The result is not bravery; it is an architecture that no longer surprises you under load. Stripe runs production fault injection on every deploy through a system internally called Game Day Continuous (described in their 2017 strategy talk by Marc Hedlund); Slack runs “Disasterpiece Theater” (Richard Crowley, 2018 SREcon talk) where pre-announced exercises run with full audience and post-mortem; Google's DiRT program (Krishnan, USENIX LISA 2012) runs continent-scale exercises annually, with whole-region failovers and intentional capacity reductions.

The vary real-world events principle deserves emphasis. Synthetic faults — perfectly clean kill signals, ideal partitions, neat 100-percent packet loss — rarely match what production fails like. Real partitions are partial and asymmetric; real network slowdowns are bursty and correlated; real host failures take whole racks at a time because they share a power supply or a top-of-rack switch. Bryan Cantrill's 2019 talk “Fault tolerance through optimism” goes deep on this; the corresponding chaos engineering principle is to draw your fault distributions from your incident history, not from a uniform sampler.

Chaos engineering tools — the zoo

The chaos tool zoo.

Chaos Monkey (Netflix, 2010, open-sourced 2012) is the namesake. The current Spinnaker-integrated version is the canonical demo tool but has been replaced internally at Netflix by FIT and its successors. Chaos Mesh (PingCAP, 2019, donated to CNCF in 2020, graduated 2024) is the dominant Kubernetes-native chaos platform: pod-, network-, IO-, time-, kernel-, and stress-faults declared as Custom Resources, GitOps-friendly, with a workflow engine for chaining experiments. Litmus (MayaData / ChaosNative, donated to CNCF 2020, incubating) is the other Kubernetes-native option, packaging experiments as ChaosHub reusable artifacts.

Gremlin (San Jose, founded by Kolton Andrus and Matthew Fornaciari in 2016) is the longest-running commercial offering, with an agent-based model that works across Kubernetes, ECS, bare metal, and Windows. AWS Fault Injection Service (FIS, GA March 2021) is the cloud-native option: experiment templates expressed in a YAML-like JSON, executed against EC2, ECS, EKS, RDS, and a growing list of services, billed at one cent per action-minute. Azure Chaos Studio (GA November 2023) is the Microsoft equivalent.

Chaos Semicolony (Russ Miles, ChaosIQ, 2017) is an open-source, vendor-neutral CLI that orchestrates experiments declared in JSON or YAML. Toxiproxy (Shopify, 2014) is the network-fault Swiss army knife: a TCP-level proxy that applies toxics (latency, bandwidth limits, slicer, timeouts, full disconnect) to specific upstreams. tc and netem, native Linux qdiscs, are the underlying primitives most network-chaos tools wrap. stress-ng is the host-level synthetic load tool; fio and blkdiscard stress storage; jepsen (Kyle Kingsbury) injects partitions and clock skew specifically against distributed databases.

The Kubernetes ecosystem has converged: most production chaos in 2024–2025 runs through Chaos Mesh or Litmus, with AWS FIS for cloud-native AWS shops and Gremlin for organisations that want a managed UI and SOC-2-shaped vendor. Steadybit (Berlin, 2021) and Use Chaos Engineering (acquired ChaosNative in 2022) are newer commercial entrants that focus on tighter integration with deployment pipelines.

Tool	Origin	Year	Scope
Chaos Monkey	Netflix	2010	EC2 instance kill
Chaos Kong	Netflix	2014	whole AWS region
Chaos Gorilla	Netflix	2012	Availability Zone
FIT	Netflix	2014	request-targeted faults
Chaos Mesh	PingCAP / CNCF	2019	k8s CRDs · graduated
AWS FIS	AWS	2021	action templates · stop conditions

# Chaos Mesh · NetworkChaos · 30s of latency to checkout
apiVersion: chaos-mesh.org/v1alpha1
kind: NetworkChaos
metadata:
  name: checkout-latency-spike
  namespace: payments
spec:
  action: delay
  mode: all
  selector:
    namespaces: [payments]
    labelSelectors:
      app: checkout
  delay:
    latency:    "200ms"
    correlation: "25"
    jitter:     "50ms"
  duration: "30s"
  scheduler:
    cron: "@every 1h"

A two-hour game day, in steps

A two-hour game day, in steps.

The first chaos exercise most teams run is a tabletop game day — no production traffic touched. Everyone gathers in a room, the runbook below is filled in, and the GM (Game Master) walks through scenarios. Two hours, no risk, large payoff.

00:00
Set the scenario.
"Database primary in eu-west-1 just disappeared. What happens?" The GM doesn't tell anyone the answer.
00:05
Simulate detection.
"Pretend you got the page. What do you check first? What dashboard? What runbook?" People look it up live.
00:20
Walk the runbook.
Follow the recovery steps. Where does the runbook fail? Out-of-date commands? Missing permissions? Steps that depend on a wiki page that's been moved?
00:50
Inject the curveball.
"While you're recovering, the on-call SRE for the platform team is unreachable." Now what?
01:30
Document the gaps.
Every missing piece becomes a ticket. Out-of-date runbook step? Ticket. Missing dashboard? Ticket. Permissions you didn't have? Ticket.
02:00
Wrap.
Schedule the next game day in 4 weeks. Track gap-closure between sessions.

The point isn't the chaos — it's the gaps.

A team's first three game days find more gaps than the next ten. Most teams never get to the “live chaos in production” tier — and that's fine, because the tabletop tier alone catches the most expensive defects.

The format above is borrowed from Robbins' original Amazon GameDay protocol and refined by Adrian Hornsby's public AWS Solutions Architecture writeups (2017–2020). The two-hour cap is deliberate; longer exercises lose attention and stop generating actionable findings around the ninety-minute mark. The four-week cadence between game days is short enough that gaps are still salient and long enough that closing tickets between exercises is realistic.

The role distinction matters. The Game Master is the only person who knows the scenario in advance; everyone else is a player walking through the response in real time. A note-taker captures the timeline, every command run, every dashboard checked, every miscommunication. The post-exercise document is the deliverable, not the chaos itself. Pat Cable and Casey Rosenthal's chapter in the O'Reilly book has a sample template; Heidi Waterhouse's 2018 Velocity talk “Five questions to ask before your next game day” is the right pre-flight checklist.

Once the tabletop tier feels routine, the next step is a controlled live exercise: real production environment, single-node blast radius, pre-announced window, every senior engineer on call, every dashboard up. AWS GameDays at re:Invent run this protocol publicly. Whatever you call it — Disasterpiece Theater at Slack, Wheel of Misfortune at Google, Bad Day Drill at Stripe — the rhythm is the same: announce, hypothesise, inject, observe, abort or complete, debrief, ticket the gaps.

When chaos writes the runbook — the cultural payoff

When chaos writes the runbook.

The deepest reason chaos engineering matters is the inversion it forces in the postmortem culture. Pre-chaos teams treat outages as anomalies to be explained; post-chaos teams treat them as predicted events to be tested for. The discipline only works if the public failure record is rich enough to learn from. The published postmortems below are the textbooks of the field.

Knight Capital, August 1 2012. A botched deploy left old code on one of eight servers and re-purposed an unused configuration flag. Forty-five minutes of trading lost the firm $440 million; the SEC litigation report (October 2013) is the canonical primary source. The lesson: deploy-config drift between hosts is a chaos category in its own right, and an automated comparison between expected-on-disk binaries is cheap. Modern blue-green deploys and image-based immutability are direct descendants.

Cloudflare, July 17 2020. A bad regular expression in the WAF caused CPU exhaustion across the global edge fleet; a router-policy push twenty-seven minutes earlier had degraded backbone capacity. The combination caused a 27-minute global outage. John Graham-Cumming's postmortem (cloudflare.com blog, 2020-07-17) is unusually clear about the role of test infrastructure: the regex had passed unit tests but had never been load-tested against production-shaped traffic. The lesson: chaos must include performance pathology, not just functional failure.

GitHub, October 21–22 2018. A 43-second network partition between US East and US West triggered Orchestrator's automatic MySQL failover, splitting the cluster topology in a way the recovery automation could not unwind. The 24-hour degraded service that followed was documented in detail (github.blog, 2018-10-30). The lesson: high-frequency partitions are a real failure mode and the recovery path itself needs explicit chaos testing.

AWS Kinesis, November 25 2020. An OS thread-limit bottleneck on a particular EC2 instance type cascaded into a region-wide Kinesis outage that took down dependent services including CloudWatch alarming. The postmortem (aws.amazon.com/message, December 2020) explained the dependency loop in painful detail. The lesson: control-plane dependencies on the data plane being measured are the canonical antipattern, and chaos engineering against the alarm system itself is non-negotiable.

What ties these incidents together is that each was, in retrospect, an experiment that reality ran on the system without permission. The Basiri et al 2016 paper makes the point most directly: organisations that adopt chaos engineering early get to choose which experiments they run. Organisations that don't get to choose when.