2017 · 6h data lost · 5 broken backups
Postmortem · database · operator error

GitLab, the rm -rf incident.

Late on February 1 2017, a GitLab engineer was troubleshooting a Postgres replica that had fallen too far behind to catch up. They SSH'd into what they thought was the secondary host. It was the primary. They ran rm -rf on the data directory, destroying around 300 GB of production data. Then they discovered the database had five documented backup methods, and every single one had silently failed for weeks.

Date 31 Jan – 1 Feb 2017
Data lost ~6 hours of writes
Failed backups 5 of 5

TL;DR

A combination of unusual write load and replication-lag escalation led an on-call engineer to attempt manual replica resynchronisation late at night. They opened two terminals to two hosts and worked across them. They ran the destructive command on the wrong terminal. The primary's data directory was deleted. They then discovered that the LVM snapshot they relied on for fast restore wasn't being taken, the nightly pg_dump uploads had been failing for weeks (a credential change nobody had propagated), the cloud-provider snapshot was on the wrong volume, the Azure backup hadn't run since a recent migration, and a fifth path was failing for a fifth reason. The only working backup was a staging-environment LVM snapshot taken six hours earlier. They restored from it. Six hours of issues, merge requests, comments, code pushes, and wiki edits were permanently lost.

Timeline

22:00 UTC, 31 JanProduction database under unusual write load. A misbehaving spam-fighting feature is creating thousands of database rows per second.
23:30 UTCReplication to the secondary falls hours behind. Standard recovery is to re-bootstrap the secondary from a fresh base backup.
00:00 UTC, 1 FebEngineer opens terminals to db1 (primary) and db2 (secondary). Intends to clear db2's data dir and re-bootstrap.
00:02 UTCrm -rf /var/opt/gitlab/postgresql/data is run on the db1 terminal by accident. Realised within seconds; about 4.5 GB of the ~300 GB were already deleted when stopped.
00:10 UTCTeam assembles to attempt recovery. The first option (LVM snapshot) doesn't exist. Each subsequent backup checked is found to be silently broken.
02:30 UTCThe only working backup is found: a 6-hour-old staging-environment LVM snapshot.
03:00 UTC – 17:00 UTCRestore process. GitLab live-streams the whole thing on YouTube. The broadcast attracts thousands of viewers and becomes a famous moment of operational transparency.
17:00 UTCService restored. 6 hours of user data permanently lost.
Days followingIntegrity audits, user notifications, the public postmortem (Feb 10), policy + tooling changes.

The five broken backups

This is the part of the postmortem that became required reading for every operations team:

BackupWhy it failed
1. LVM snapshotsThe cron job that took them had been disabled during an unrelated maintenance window months earlier and never re-enabled.
2. Regular pg_dump backupsRunning. Producing tiny output (a few KB). The Postgres client version on the backup host was too old to talk to the upgraded server; the dumps were generating an error message that was being uploaded as the "backup". Nobody noticed because the upload itself succeeded.
3. Disk snapshots in AzureWere enabled for the wrong NFS volume after a recent migration.
4. S3 backup uploadsNot configured for the production volume — only for a now-decommissioned staging instance.
5. WAL archive shippingThe archive command was set but the destination bucket had been removed; the archive_command was returning success because the calling script wasn't checking its exit code.
Backup is the easy part. Restore is the part that matters. Every one of these failure modes is the same shape: the backup job runs without error, produces an output of some kind, and is never tested by actually restoring from it. The only backup that worked on February 1 was an LVM snapshot of the staging environment that happened to be on the same physical infrastructure as production, and was taken every six hours for the convenience of a developer who liked to test against fresh data. Without that accidental sixth backup, GitLab would have lost months of customer data.

Why the live-stream worked

GitLab's most consequential decision was to make the recovery public in real time. They opened a YouTube live-stream within hours, with the engineers narrating their work. The broadcast peaked at around five thousand simultaneous viewers; the recording is still up. Then they published the full postmortem ten days later, including the embarrassing details about backup failures, the operator-error sequence, and the policy changes that followed.

The transparency turned what would normally be a credibility-destroying event into a credibility-building one. The contemporary discussion on Hacker News was uniformly positive — engineers identifying with the operator, customers thanking GitLab for honesty, security teams sharing their own near-misses. Recruiting metrics went up for months after. Several CTOs cited the postmortem when explaining why they chose GitLab over competitors.

What changed afterwards

GitLab rebuilt their backup infrastructure with first-class restore verification — every backup is automatically restored to a staging instance and a smoke test is run; if the smoke test fails, an alert fires. SSH access policies were tightened, including environment-distinct command prompts so the engineer's terminal explicitly says (production-primary) in red. Destructive operations on production now require a typed confirmation matching the hostname. On-call training was restructured around runbook-driven recovery rather than ad-hoc improvisation under pressure.

Beyond GitLab, the incident accelerated industry adoption of two practices: regular restore drills (most large infrastructure teams now have a documented monthly or quarterly "delete production and restore it" exercise on a clone), and chaos-engineering-style backup verification (continuously deleting and restoring test rows to confirm the end-to-end path works).

Lessons

A backup is what you successfully restored, not what you successfully created. Every check in this incident verified that the backup process completed; none verified that the output was a usable backup. The single discipline that prevents 90% of operational data-loss incidents is automated restore verification.

SSH-into-the-wrong-host is a category, not an accident. An on-call engineer at 2am, switching between two terminals, will at some non-zero rate type the wrong command into the wrong window. The system has to make this hard or impossible — coloured prompts, hostname banners, destructive-operation confirmations, separate IAM roles for primary vs secondary. The mistake is the engineer's; the system that permitted the mistake is the team's.

Tooling assumptions decay silently. Every one of GitLab's five backup mechanisms worked when it was set up. Each was broken by an unrelated change weeks or months later — a credential rotation, an OS upgrade, a volume migration. Backup configuration is the most common silently-decaying tool in the operations toolbox. Treat it as something that needs continuous testing rather than one-time setup.

Transparency compounds. Live-streaming a recovery was unprecedented. The willingness to be wrong in public turned a near-extinction event into a hiring asset. Most teams treat postmortems as PR risk; the GitLab incident is the strongest counterexample.

Further reading

  • GitLab official postmortem — February 10 2017 — the canonical write-up. Detailed timeline, each broken backup mechanism, the people decisions.
  • GitLab recovery live-stream — the full YouTube broadcast. Worth skimming once for the operational vibe alone.
  • Hacker News discussion — contemporary engineering reactions. Several near-miss stories in the comments.
  • Write-ahead logging — the mechanism that makes Postgres restore possible at all, and the mechanism whose archive_command was misconfigured here.
  • B-tree internals — what was sitting in the data directory that got deleted.
  • ARIES — the recovery algorithm Postgres implements, which is what makes point-in-time restore from WAL possible when WAL is being shipped.
More postmortems
Back to the postmortems index
Famous outages — Cloudflare, GitHub, AWS, GitLab — what broke and what was learned.
Found this useful?