Facebook, the day it disappeared.

On October 4 2021 a routine BGP audit command at Meta accidentally withdrew the company's IP prefixes from the global routing table. Within a minute, facebook.com, Instagram, WhatsApp, Messenger, Oculus, and Workplace were unreachable for the entire planet. The internal tools needed to fix it — including the badge readers on datacenter doors — depended on the same DNS that had just disappeared. Recovery required physically driving to a datacenter with cutting tools.

Date 4 Oct 2021

Duration ~6 hours

Services Facebook · Instagram · WhatsApp · Messenger · Workplace · Oculus

Meta engineering writeup →

TL;DR

Meta runs an internal command-line tool to audit the available capacity of its backbone network. The tool was supposed to query backbone routers in a non-destructive way. A bug in the tool caused it to issue a command that withdrew Meta's BGP advertisements — the announcements telling the rest of the internet which AS owns Meta's IP space. A guardrail meant to catch exactly this kind of mistake had a bug of its own and didn't fire. Within ~60 seconds, every Meta IP prefix vanished from the global routing table. DNS resolvers stopped getting answers for facebook.com because Meta's authoritative DNS servers were now unreachable too. The internal tools needed to roll back the change ran in the same DNS namespace. The on-call engineer's badge wouldn't unlock the datacenter cage because the badge reader's authentication path also depended on internal DNS. Recovery required physical access to a datacenter, with hardware tools to bypass the digital locks, and an out-of-band path to manually re-advertise the BGP routes.

Timeline

15:39 UTC	The audit command runs. BGP withdrawals propagate globally.
15:40 UTC	External monitoring (Cloudflare, ThousandEyes, downdetector.com) shows Meta IP space disappearing from the global routing table.
15:50 UTC	Meta engineers attempt remote rollback. Cannot connect — internal tools depend on the same DNS that's down.
16:30 UTC	On-site team dispatched to Santa Clara datacenter to manually re-advertise routes from the routers themselves.
17:00 UTC	Physical access attempts. The badge readers on the datacenter cage doors don't authenticate (no DNS to reach the auth server). Eventually bypassed with hardware.
21:05 UTC	BGP routes re-advertised. Routes propagate over ~30 minutes; services come back online progressively as DNS caches refresh.
22:00 UTC	All Meta services fully restored.
Following days	Public postmortem published by Meta engineering. Industry-wide reflection on out-of-band access dependencies.

What went wrong technically

The audit command was designed to query backbone routers and report on their capacity. It used a router command that, in the version of the routing software Meta ran, could have a side effect of changing the BGP session state if invoked in a particular order. The audit tool invoked it that way. Meta's safeguards — including a guardrail tool that detects "this change is going to take the production network down" — failed to catch the change because of an unrelated bug in the guardrail. So the destructive change went through cleanly.

The BGP withdrawal itself was straightforward: Meta's backbone routers stopped announcing the prefixes that map to facebook.com, instagram.com, whatsapp.com, and the other Meta domains. Within seconds the announcements aged out of upstream routers' tables. Within minutes the entire internet had no route to Meta. Authoritative DNS for the Meta domains lived on those same withdrawn prefixes — so DNS queries for facebook.com started returning SERVFAIL globally. This is the cascade: BGP withdrawal → DNS failure → every service depending on DNS failure (which is everything).

The compound failure. The technical incident was a BGP misconfiguration — bad, but recoverable in minutes by a remote rollback. What made it a 6-hour outage was that the tools needed to roll back ran in the same failure domain. Internal SSH access used corporate DNS. The dashboard that showed router status was a service-mesh API behind corporate DNS. The pager that summoned on-call used Meta-hosted SMS routing. The badge readers on the datacenter doors authenticated through internal infrastructure. Every layer of recovery tooling required the layer below to be working — and the layer below was off.

The physical recovery

Once it became clear that remote recovery wasn't possible, Meta sent engineers to the Santa Clara datacenter. They arrived to find that the badge access system wasn't responding. Reports from inside Meta (later confirmed publicly) describe engineers using physical tools to open the cage housing the routers. From there, they could connect to the routers via local serial console — out-of-band, not dependent on any network — and manually re-issue the BGP announcements.

Once the announcements were back, BGP convergence took on the order of half an hour to propagate globally. DNS caches then had to expire and refresh; the user-visible recovery was gradual rather than instant. The full service restoration was roughly 90 minutes after the routers were physically reached.

Side effects nobody predicted

Meta's outage rippled to services that didn't obviously depend on Meta. Sign-in-with-Facebook on third-party sites broke (sites that allowed users to log in via Facebook OAuth couldn't reach Facebook's auth endpoints; many sites also relied on those auth endpoints for SDK loading and broke entirely rather than gracefully degrading). The increased DNS lookup load — caused by every user's device retrying facebook.com over and over for hours — significantly stressed upstream DNS providers, particularly Cloudflare's 1.1.1.1 and Google's 8.8.8.8. CDN edges saw retries fan out from devices stuck in retry loops; some CDN providers reported their own elevated load handling these. Downdetector itself went down briefly under the load of every user checking whether Facebook was down. WhatsApp users moved to Signal and Telegram in numbers that briefly saturated those services' sign-up flows.

Lessons

Out-of-band access must actually be out of band. "Out-of-band" historically meant a separate phone line for a console server. As infrastructure moved to corporate-network-based access tools and centralised authentication, many organisations lost that property without noticing. The Meta incident made this a top-of-mind concern industry-wide. Production infrastructure operators now treat out-of-band access — including badge access to datacenters — as a first-class architectural problem.

Test guardrails the same way you test code. Meta had a guardrail meant to prevent exactly this kind of outage. The guardrail had a bug. The lesson isn't "build a guardrail"; the lesson is "exercise the guardrail in the same drill where you'd exercise a recovery procedure". Many teams have written safety-net systems that have never been triggered in production-equivalent conditions.

BGP audit tools should be read-only by construction. The audit tool was supposed to be non-destructive but used an API that wasn't. Network management tooling should distinguish destructive from non-destructive operations at the API boundary, not in code that has to remember to invoke the right one.

Configuration dependencies create the largest blast radius. Meta's authoritative DNS being inside the same routing announcements as Meta's user-facing services meant the DNS and the services failed together. Distributing critical services across diverse failure domains — different ASNs, different DNS authorities, different physical infrastructure — bounds the blast radius. Cross-link BGP deep dive for the protocol mechanics.

What Meta changed

From the published postmortem and reporting since: BGP audit tooling rewritten to use read-only APIs, guardrail tooling rewritten and now part of routine drills, datacenter physical access systems refactored to have an authentication path independent of internal DNS, additional out-of-band consoles deployed to critical routers, and the on-call procedures updated to include "if this happens again, here's the explicit physical-access escalation". Several specific architectural changes are not public, but the general direction was clear: reduce the number of failure modes where recovery tooling shares fate with what it's recovering.