GitLab, the rm -rf incident.
Late on February 1 2017, a GitLab engineer was troubleshooting a Postgres replica that had fallen too far behind to catch up. They SSH'd into what they thought was the secondary host. It was the primary. They ran rm -rf on the data directory, destroying around 300 GB of production data. Then they discovered the database had five documented backup methods, and every single one had silently failed for weeks.
TL;DR
A combination of unusual write load and replication-lag escalation led an on-call engineer to attempt manual replica resynchronisation late at night. They opened two terminals to two hosts and worked across them. They ran the destructive command on the wrong terminal. The primary's data directory was deleted. They then discovered that the LVM snapshot they relied on for fast restore wasn't being taken, the nightly pg_dump uploads had been failing for weeks (a credential change nobody had propagated), the cloud-provider snapshot was on the wrong volume, the Azure backup hadn't run since a recent migration, and a fifth path was failing for a fifth reason. The only working backup was a staging-environment LVM snapshot taken six hours earlier. They restored from it. Six hours of issues, merge requests, comments, code pushes, and wiki edits were permanently lost.
Timeline
| 22:00 UTC, 31 Jan | Production database under unusual write load. A misbehaving spam-fighting feature is creating thousands of database rows per second. |
| 23:30 UTC | Replication to the secondary falls hours behind. Standard recovery is to re-bootstrap the secondary from a fresh base backup. |
| 00:00 UTC, 1 Feb | Engineer opens terminals to db1 (primary) and db2 (secondary). Intends to clear db2's data dir and re-bootstrap. |
| 00:02 UTC | rm -rf /var/opt/gitlab/postgresql/data is run on the db1 terminal by accident. Realised within seconds; about 4.5 GB of the ~300 GB were already deleted when stopped. |
| 00:10 UTC | Team assembles to attempt recovery. The first option (LVM snapshot) doesn't exist. Each subsequent backup checked is found to be silently broken. |
| 02:30 UTC | The only working backup is found: a 6-hour-old staging-environment LVM snapshot. |
| 03:00 UTC – 17:00 UTC | Restore process. GitLab live-streams the whole thing on YouTube. The broadcast attracts thousands of viewers and becomes a famous moment of operational transparency. |
| 17:00 UTC | Service restored. 6 hours of user data permanently lost. |
| Days following | Integrity audits, user notifications, the public postmortem (Feb 10), policy + tooling changes. |
The five broken backups
This is the part of the postmortem that became required reading for every operations team:
| Backup | Why it failed |
|---|---|
| 1. LVM snapshots | The cron job that took them had been disabled during an unrelated maintenance window months earlier and never re-enabled. |
| 2. Regular pg_dump backups | Running. Producing tiny output (a few KB). The Postgres client version on the backup host was too old to talk to the upgraded server; the dumps were generating an error message that was being uploaded as the "backup". Nobody noticed because the upload itself succeeded. |
| 3. Disk snapshots in Azure | Were enabled for the wrong NFS volume after a recent migration. |
| 4. S3 backup uploads | Not configured for the production volume — only for a now-decommissioned staging instance. |
| 5. WAL archive shipping | The archive command was set but the destination bucket had been removed; the archive_command was returning success because the calling script wasn't checking its exit code. |
Why the live-stream worked
GitLab's most consequential decision was to make the recovery public in real time. They opened a YouTube live-stream within hours, with the engineers narrating their work. The broadcast peaked at around five thousand simultaneous viewers; the recording is still up. Then they published the full postmortem ten days later, including the embarrassing details about backup failures, the operator-error sequence, and the policy changes that followed.
The transparency turned what would normally be a credibility-destroying event into a credibility-building one. The contemporary discussion on Hacker News was uniformly positive — engineers identifying with the operator, customers thanking GitLab for honesty, security teams sharing their own near-misses. Recruiting metrics went up for months after. Several CTOs cited the postmortem when explaining why they chose GitLab over competitors.
What changed afterwards
GitLab rebuilt their backup infrastructure with first-class restore verification — every backup is automatically restored to a staging instance and a smoke test is run; if the smoke test fails, an alert fires. SSH access policies were tightened, including environment-distinct command prompts so the engineer's terminal explicitly says (production-primary) in red. Destructive operations on production now require a typed confirmation matching the hostname. On-call training was restructured around runbook-driven recovery rather than ad-hoc improvisation under pressure.
Beyond GitLab, the incident accelerated industry adoption of two practices: regular restore drills (most large infrastructure teams now have a documented monthly or quarterly "delete production and restore it" exercise on a clone), and chaos-engineering-style backup verification (continuously deleting and restoring test rows to confirm the end-to-end path works).
Lessons
A backup is what you successfully restored, not what you successfully created. Every check in this incident verified that the backup process completed; none verified that the output was a usable backup. The single discipline that prevents 90% of operational data-loss incidents is automated restore verification.
SSH-into-the-wrong-host is a category, not an accident. An on-call engineer at 2am, switching between two terminals, will at some non-zero rate type the wrong command into the wrong window. The system has to make this hard or impossible — coloured prompts, hostname banners, destructive-operation confirmations, separate IAM roles for primary vs secondary. The mistake is the engineer's; the system that permitted the mistake is the team's.
Tooling assumptions decay silently. Every one of GitLab's five backup mechanisms worked when it was set up. Each was broken by an unrelated change weeks or months later — a credential rotation, an OS upgrade, a volume migration. Backup configuration is the most common silently-decaying tool in the operations toolbox. Treat it as something that needs continuous testing rather than one-time setup.
Transparency compounds. Live-streaming a recovery was unprecedented. The willingness to be wrong in public turned a near-extinction event into a hiring asset. Most teams treat postmortems as PR risk; the GitLab incident is the strongest counterexample.
Further reading
- GitLab official postmortem — February 10 2017 — the canonical write-up. Detailed timeline, each broken backup mechanism, the people decisions.
- GitLab recovery live-stream — the full YouTube broadcast. Worth skimming once for the operational vibe alone.
- Hacker News discussion — contemporary engineering reactions. Several near-miss stories in the comments.
- Write-ahead logging — the mechanism that makes Postgres restore possible at all, and the mechanism whose archive_command was misconfigured here.
- B-tree internals — what was sitting in the data directory that got deleted.
- ARIES — the recovery algorithm Postgres implements, which is what makes point-in-time restore from WAL possible when WAL is being shipped.