Volume X — opening pages

When the lights
went out.

Postmortems are the literature of operations. Read with care: the lessons are general; the specifics rarely repeat. Each entry has the same structure — what broke, why, when, and what changed afterwards.

Config & deploy.

A typo, a flag, a bad migration — the loudest outages.

2 July 2019 · 27 minutes global · Cloudflare

Cloudflare's regex of doom

A new WAF rule with catastrophic backtracking spiked CPU to 100% on every edge worldwide.

Impact

Most of Cloudflare's edge — every site behind it — returned 502s for 27 minutes. Roughly half the modern web.

Trigger

A routine WAF rule update intended to block a class of XSS attacks.

Root cause

A regex with nested quantifiers (.*(?:.*=.*)) deployed to the WAF. Against pathological input it triggered exponential backtracking. The deploy went to every edge simultaneously.

Timeline

13:42 UTC Rule deployed globally.
13:42 Edge CPUs spike to 100% across all data centers.
13:43 Pages start returning 502 errors at scale.
13:45 Internal alerting fires; engineers identify it's the new WAF rule.
14:09 Global kill switch flipped on the WAF; CPU drops, traffic recovers.

What changed

Reverted the rule. Added: regex complexity limits, staged deployment (canary regions before global), Lua-based execution timeouts, and "kill switches" for every WAF rule type.

Lessons

Global simultaneous deploy of any code path is a tail risk. Always canary.
Regex engines without timeouts are a denial-of-service waiting to happen.
Test on adversarial inputs, not just the inputs you expect.
Have a kill switch for every individually-shippable thing.

→ official postmortem

8 June 2021 · ~1 hour · Fastly

Fastly's undefined behaviour

A customer's valid configuration triggered a latent bug in Fastly's software, taking down the CDN globally.

Impact

Reddit, the New York Times, Twitch, GitHub, Stack Overflow, Hulu, and a substantial fraction of the public web returned 503s.

Trigger

A single customer applying a routine configuration change.

Root cause

A bug introduced in a 12 May Fastly release allowed a specific (valid) customer config to trigger undefined behaviour. The path lay dormant until a customer's self-service config change activated it.

Timeline

09:47 UTC Customer applies a valid config change.
09:47 Bug triggers: 85% of the Fastly network starts returning errors.
10:00 Fastly engineers identify the affected config; disable it.
10:36 Vast majority of services restored.
12:35 Underlying bug fixed and deploy completes.

What changed

Disabled the customer's config to recover; pushed a code fix later that day; ran a full audit on similar latent paths.

Lessons

Latent code paths that need a specific configuration to activate are hard to find with normal testing.
A customer self-service change should not be able to trigger a global outage. Blast-radius isolation is required.
A fast detection-to-recovery window matters more than preventing all bugs.

→ official postmortem

28 February 2017 · ~4 hours · AWS S3 (US-EAST-1)

A typo in a debugging command

An engineer ran a debugging command that took down a larger fraction of S3 than intended.

Impact

A large slice of S3 in US-EAST-1 unavailable for ~4 hours. Took down Trello, Slack, Quora, GitHub, Coursera, Docker Hub, and a long list of others. Even AWS's own status page (which used S3 for its icons) was broken.

Trigger

A planned debugging operation in the billing subsystem.

Root cause

An engineer running a runbook command to remove some S3 servers from a billing subsystem made a typo in the parameters. The command removed more capacity than intended, including subsystems that the rest of S3 depended on. Restart of those subsystems took hours because they hadn't been restarted at scale in years.

Timeline

17:37 UTC Engineer runs the command. Larger-than-expected fraction of S3 capacity removed.
17:37 S3 services in US-EAST-1 begin failing.
~18:00 Restart of the index/placement subsystems begins; takes much longer than expected.
~21:30 Object retrievals largely restored.

What changed

Added validation to the runbook tooling that prevents removing more than a small percentage of capacity at once. Started regular fire drills of the index subsystem restart.

Lessons

Tooling that lets a single command take down a service is a foot-gun. Build the rate-limit into the tool.
If you haven't restarted a subsystem at scale in years, you don't know how long it will take. Practice.
Status pages should not depend on the system they're reporting on.

→ official postmortem

Data & databases.

When the source of truth lies, or disappears.

31 January 2017 · ~6 hours, ~6 hours of data lost · GitLab.com

rm -rf on the wrong server

A tired engineer working on a replication issue ran rm -rf on what they thought was the secondary; it was the primary.

Impact

GitLab.com offline for ~6 hours. ~5,000 projects, 5,000 comments, and 700 new accounts permanently lost — the gap between the last successful backup and the rm.

Trigger

Replication lag investigation that escalated through several hours of debugging.

Root cause

A long debugging session at the end of a long day. Multiple terminals open. A command meant for the secondary database ran on the primary. Worse: of the five backup mechanisms in place, four were silently failing. The S3 bucket meant to store backups didn't exist.

Timeline

21:00 Engineer notices replication lag, starts investigating.
~22:00 Various recovery actions tried. Tired, multiple SSH sessions.
23:00 rm -rf intended for secondary executed on primary db.
23:01 Engineer realises within seconds. Stops the rm. ~300 GB of data already deleted.
23:30 Discover that 4 of 5 backup mechanisms are broken; only an LVM snapshot from 6 hours earlier survives.
next day Restore from the 6-hour-old snapshot. ~6 hours of customer data lost.

What changed

Live-streamed the recovery to a YouTube audience for transparency. Implemented: read-only replicas labelled in the prompt, multi-terminal awareness training, working backup verification ("if you haven't tested a restore, you don't have a backup"), and PostgreSQL barman-based PITR.

Lessons

Backups that aren't tested aren't backups.
Tired engineers should not be doing destructive operations. "Two-key rule" or pause until rested.
The shell prompt should make production-vs-staging unmistakable.
Public, transparent postmortems build trust. GitLab's YouTube live-stream is a model for incident communication.

→ official postmortem

21 October 2018 · ~24 hours degraded · GitHub

A 43-second network partition

A 43-second connectivity blip between GitHub's US East and US West coasts caused MySQL automatic failover to elect a new primary on the West coast — with stale data.

Impact

~24 hours of degraded service while engineers reconciled the conflicting MySQL clusters by hand. Some webhook deliveries lost.

Trigger

A scheduled network maintenance window that triggered a 43-second connectivity blip.

Root cause

During the 43-second partition, both data centers thought they were primary. The Orchestrator failover system promoted the West coast to primary. When connectivity returned, both sides had accepted writes. GitHub had to choose which to keep and replay the diff.

Timeline

22:52 UTC Network maintenance starts.
22:52 43-second partition between US East and US West.
22:52 Orchestrator promotes West coast to primary.
22:53 Network heals; both sides have accepted writes.
~23:30 Engineers identify the split and start manual reconciliation.
next day Reconciliation complete. Service fully restored.

What changed

Rewrote failover to require a longer partition window before promoting (avoid promotions on momentary blips). Cross-region writes routed via the primary always. Better tooling for manual reconciliation when split-brain happens.

Lessons

Aggressive automatic failover during transient network problems can be worse than the original problem.
A globally distributed primary is hard. Regional primaries with explicit failover are simpler.
When you must reconcile split-brain by hand, having tooling that makes the conflict obvious is gold.

→ official postmortem

April 2022 · 14 days · Atlassian

883 customer sites deleted by a script

A migration script meant to disable a deprecated app instead deleted 883 customer sites from Confluence, Jira, and other Atlassian Cloud products.

Impact

883 customer sites totally gone. Restore took up to 14 days for some customers. Some customer's entire SaaS work history vanished for two weeks.

Trigger

Routine deprecation of a legacy product feature.

Root cause

A script meant to disable an app passed the wrong identifier downstream — the site IDs of those customers, not the app instance IDs. The receiving service trusted the input and deleted the sites.

Timeline

Day 0 Migration script runs. 883 sites deleted across 3 products.
Day 0–1 Customers report missing sites. Atlassian engineers identify the bug.
Day 1 Restore process begins. Atlassian discovers their bulk-restore tooling can only restore one site at a time, with manual steps.
Day 1–14 Sites restored sequentially. Last customer's site restored on Day 14.

What changed

Rebuilt the bulk-restore tooling to handle hundreds of sites in parallel. Removed cross-service trust on identifiers; everything now revalidates. Multiple "are you sure?" gates added.

Lessons

When one service trusts another's input, it inherits that service's bugs. Validate at the boundary.
Disaster recovery tooling needs to be tested on disasters of the size you actually face.
Communicating "you have lost your data for 14 days" is a hard public-relations problem. Atlassian's communication was widely criticised — a reminder that the technical fix isn't the whole story.

→ official postmortem

III

Protocol & routing.

When the network itself betrays you.

4 October 2021 · ~6 hours · Facebook / Meta

BGP withdrawn — Facebook off the internet

A misconfigured backbone command withdrew Facebook's BGP routes from the internet. The internal tools needed to fix it depended on the same network.

Impact

Facebook, Instagram, WhatsApp, Oculus all unreachable for ~6 hours. ~3.5 billion users affected. Engineers couldn't SSH into Facebook data centers because DNS was down. They couldn't physically badge into the data centers because the badge readers used the same auth system.

Trigger

A scheduled audit of backbone routing capacity.

Root cause

A backbone audit command meant to assess capacity instead withdrew Facebook's BGP advertisements. The internal recovery tools depended on the network they were trying to recover. The DNS servers, noticing they couldn't reach the rest of the network, withdrew their own advertisements as a safety mechanism — making the recovery worse.

Timeline

15:39 UTC Audit command runs. BGP advertisements withdrawn.
15:40 Facebook DNS servers can't reach the network; withdraw their own advertisements.
15:40 Facebook is fully off the internet from outside.
~17:30 Engineers physically arrive at data centers; bypass badge readers manually.
~21:00 BGP advertisements restored; DNS recovers; services come back.

What changed

Audit commands now run in a "dry run" mode by default. Out-of-band management network for emergency access — independent of the production network. DNS no longer withdraws advertisements when its sees the network — fail-stop is too aggressive.

Lessons

Recovery tools must not depend on the system they recover.
Out-of-band management is non-optional for an org running its own backbone.
Fail-stop is dangerous. Sometimes "keep serving stale data" is better than "go silent".
When everything depends on one service (DNS in this case), that service is the most important thing in the world.

→ official postmortem

24 June 2019 · ~3 hours · Verizon / global routing

A small ISP's BGP leak takes out half the internet

A Pennsylvania ISP's BGP optimiser leaked re-prioritised routes globally; Verizon's upstream accepted them, redirecting traffic for Cloudflare, Amazon, Facebook through the small ISP's network.

Impact

Cloudflare, Amazon, Apple, Facebook, Twitch, Coinbase services degraded or unavailable for ~3 hours. Traffic was being routed through tiny networks that couldn't handle it.

Trigger

Routine BGP optimisation at the small ISP.

Root cause

A BGP route optimiser at a small ISP altered route paths to look more attractive. Without proper RPKI / route filtering, the ISP's upstream provider (Verizon) accepted the alterations and propagated them globally. Major destinations' traffic suddenly preferred a path through a small ISP that could not handle the load.

Timeline

10:30 UTC BGP optimiser at small ISP rewrites paths.
10:30 Verizon's router accepts the rewritten paths and propagates.
10:31 Cloudflare, AWS, others see traffic redirect through Verizon → small ISP.
~13:00 Verizon withdraws the bad routes after 2.5+ hours of pressure.

What changed

Cloudflare and others have aggressively pushed for RPKI adoption since. ISPs increasingly filter received BGP announcements before propagating. Mutually Agreed Norms for Routing Security (MANRS) initiative.

Lessons

BGP is too trusting by default. RPKI + route filters are minimum bar.
Your service's availability depends on every transit provider in your AS path. You cannot fully verify them.
A small misconfiguration at a small ISP can cascade to everyone. Routing is a shared-fate system.

→ official postmortem

22 February 2022 · ~2.5 hours · Slack

A DNS migration that didn't roll back

A migration to a new DNS architecture failed in production; the rollback procedure also failed because of an unrelated issue with the previous architecture.

Impact

Slack login and connection establishment broken for ~2.5 hours. Most users with established sessions stayed online; new connections couldn't resolve Slack hostnames.

Trigger

A scheduled DNS migration window.

Root cause

A new DNS provider was being rolled in. The rollback was meant to flip back to the old provider — but the old provider's configuration had drifted in the days before, and rollback brought up an inconsistent state.

Timeline

~15:00 UTC Migration to new DNS provider begins. Some queries start failing.
~15:30 Decision to roll back. Old config is restored — but the config has drifted; restored state is inconsistent.
~17:30 After several rounds of debugging and partial fixes, full DNS resolution restored.

What changed

Treat rollback configurations as production. Test rollback regularly to detect drift. Stage migrations to a percentage of traffic first.

Lessons

A rollback path you haven't tested in N days has drifted. Test it.
When migration AND rollback both fail, you don't have an outage — you have an incident.
A DNS provider switch is one of the highest-stakes changes a service can make. Treat it that way.

→ official postmortem

22 March 2016 · a few hours · npm registry

left-pad — 11 lines unpublished, the JavaScript ecosystem broke

A maintainer un-published the "left-pad" package after a naming dispute with npm. Thousands of build pipelines that transitively depended on it suddenly failed to install.

Impact

undefined

Trigger

A package author exercising their right to un-publish their own work.

Root cause

A 17-line npm package (left-pad: pad a string with leading characters) was depended upon by Babel, React, and many large frameworks transitively. The package author un-published all of his ~250 npm packages in protest after npm forced a name transfer of an unrelated package. Within minutes, builds across the ecosystem started failing as left-pad couldn't be installed.

Timeline

14:00 UTC ~250 packages including left-pad un-published.
15:00 Build pipelines start failing globally. CI logs full of "ENOENT left-pad".
17:30 npm restores left-pad from the registry, citing public-interest concerns. The community debate about the precedent rages for weeks.

What changed

npm changed its un-publish policy: packages older than 72 hours can no longer be removed by the author alone. Caching and lockfile use (npm shrinkwrap, then package-lock.json) became standard practice. Many large projects adopted private mirrors (Verdaccio, Artifactory) for build determinism.

Lessons

A 17-line dependency is a dependency. Ecosystems built on cheap dependencies inherit cheap-dependency risk.
Build determinism requires lockfiles plus cached or vendored package mirrors. Live registry resolution at every CI build is a single point of failure.
The dispute was, fundamentally, about whether a maintainer's right to remove their work outweighs the ecosystem's reliance on it. The compromise (72-hour window) is the de-facto standard now.

→ official postmortem

Adjacent

Read the failures, build the fixes.

Every postmortem here points to a category of system Semicolony explains in depth. After reading the incident, read what was meant to prevent it.

Open guides → Or handbook

When the lightswent out.

Read the failures, build the fixes.

When the lights
went out.