24 June 2019 · ~3h · BGP
Postmortem · Networking · BGP

A typo in Pennsylvania, felt everywhere.

On 24 June 2019, a small Pittsburgh-area ISP (DQE Communications, AS33154) leaked roughly 20,000 BGP routes through Verizon Business (AS701). Verizon happily re-advertised the leak to the rest of the internet. For about three hours, a meaningful fraction of global traffic to Cloudflare, AWS, Google, Facebook and others was funnelled through a small network in western Pennsylvania that could not possibly carry it.

Start 10:30 UTC · Restored 13:30 UTC · blog.cloudflare.com


TL;DR

DQE Communications (AS33154), a small Pittsburgh ISP, was running a BGP optimiser that decomposed its customer Allegheny Technologies' inbound routes into more specific prefixes for better path selection inside its own network. Those more-specific optimisations were never meant to leave AS33154. A misconfiguration leaked them upstream to Verizon Business (AS701), which had no prefix filters or max-prefix limits on the session and happily re-announced roughly 20,000 routes — including more-specifics of Cloudflare, AWS, Google, Facebook and many others — to the rest of the internet's tier-1 fabric. More-specific routes always win in BGP, so global traffic to those prefixes converged on a small Pennsylvania network. The outage lasted about three hours, from roughly 10:30 UTC to 13:30 UTC.

Timeline

UTCEvent
10:30 DQE Communications (AS33154) begins announcing roughly 20,000 more-specific prefixes — produced by an internal BGP optimiser — over its session with Verizon Business (AS701).
10:33 Verizon accepts the announcements without prefix filtering or a max-prefix limit and re-advertises them to its peers and customers across the global default-free zone.
10:35 Cloudflare's monitoring picks up a sudden traffic drop on numerous anycast prefixes. Service latency spikes; many edges show partial reachability.
10:40 External BGP looking-glass services (BGPMon, RIPE RIS) confirm route leak. Cloudflare, AWS, Linode, Google, and Facebook prefixes appear with AS_PATH ending in 33154.
10:45 Cloudflare's NOC begins paging Verizon. Status page updated; engineers start tracing upstream.
11:00 – 12:30 Sustained outage. Cloudflare's blog post — published while the incident is still ongoing — names Verizon publicly. Industry monitoring sites (ThousandEyes, Catchpoint) report errors across hundreds of services.
12:39 DQE Communications stops re-announcing the optimised prefixes upstream after operator contact.
13:00 Verizon begins withdrawing the leaked routes from its peers.
13:30 BGP convergence completes across most of the internet. Cloudflare traffic returns to normal levels.

Total time from leak onset to recovery: roughly three hours. The withdrawal-to-convergence gap alone is about 30 minutes, which is normal for BGP: even after the offending announcements stop, every router in the path has to re-evaluate and propagate the change.

What went wrong, technically

Three failures stacked. None of them alone would have caused the outage; the combination took the internet's edges down.

The optimiser fired more-specifics into eBGP. DQE was running a Noction IRP-style optimiser that ingests transit routes, breaks them into more-specific prefixes along the path it prefers, and re-injects them into the local routing table for hot-potato path selection inside the network. These optimised routes were tagged for internal use only. A configuration change — the public reports suggest a session was reconfigured without the no-export community in place — caused the optimiser's more-specifics to start flowing out via eBGP to Verizon.

Verizon had no prefix filter or max-prefix limit on the session. Tier-1 BGP hygiene calls for two safeguards on every customer session: a prefix-list that restricts what the customer can announce to the prefixes they actually own, and a max-prefix limit that tears down the session if the customer announces orders of magnitude more routes than agreed. Verizon's session with AS33154 had neither configured. When the leak began, 20,000 unauthorised prefixes simply walked in.

BGP always prefers more-specific routes. A router with both 1.1.1.0/24 and 1.1.1.128/25 in its table will send traffic for 1.1.1.200 to whoever advertised the /25, regardless of AS_PATH length. The optimiser's whole point was producing more-specifics. Once those more-specifics escaped into the global table, every router that heard them preferred them, and traffic for chunks of Cloudflare, AWS, and others started flowing through DQE.

Route leak vs route hijack — the difference. A hijack is when a network announces prefixes it does not own, usually maliciously (YouTube/Pakistan 2008, Amazon Route 53/MyEtherWallet 2018). A leak is when a network announces routes it learned from one neighbour to another neighbour against policy — typically advertising provider routes to a peer, or peer routes to another peer. The 2019 incident was a leak: DQE owned (or learned legitimately) the underlying prefixes, but had no business propagating them upstream as transit. RFC 7908 formalises six leak types; this was a Type 1 (a customer leaking from one provider to another).
# A normal BGP route announcement for Cloudflare's 1.1.1.0/24
*> 1.1.1.0/24    AS_PATH: 174 13335   (Cogent → Cloudflare)
*> 1.1.1.0/24    AS_PATH: 3356 13335  (Lumen → Cloudflare)

# What appeared on 2019-06-24 around 10:35 UTC
*> 1.1.1.0/25    AS_PATH: 701 33154 13335   (Verizon → DQE → Cloudflare)
*> 1.1.1.128/25  AS_PATH: 701 33154 13335   (Verizon → DQE → Cloudflare)

# More-specifics beat /24, so all global traffic for 1.1.1.x
# now flows: <wherever you are> → Verizon → DQE → Cloudflare
# DQE has roughly a 10 Gbps uplink. Cloudflare has terabits.

Why Cloudflare was disproportionately affected

Cloudflare runs an anycast network: the same IP prefix (for example 1.1.1.0/24) is announced from dozens of points of presence around the world, and BGP delivers each client's traffic to the nearest PoP by AS_PATH and policy. This is normally a feature — capacity and latency both win — but it depends on Cloudflare's announcements being the ones routers see.

The DQE leak inserted a more-specific announcement for Cloudflare prefixes into the global table. More-specifics override the anycast announcement entirely, regardless of how many PoPs Cloudflare was advertising from. Suddenly every router in reach of the Verizon-propagated route was sending Cloudflare-bound traffic to AS33154 instead of to the nearest Cloudflare PoP. The anycast topology that usually distributes load across hundreds of edges collapsed onto a single regional ISP.

Why a small ISP's leak hits everyone. BGP is a hop-by-hop distance-vector protocol with no global view. Each router picks the best path it knows; its idea of "best" is shaped almost entirely by what its neighbours tell it. When a tier-1 like Verizon validates a route by re-announcing it, every other AS that peers with Verizon inherits the same belief. There is no authority that says "Cloudflare's prefix can only be originated by AS13335" — unless you've deployed RPKI Route Origin Validation, which most networks in 2019 had not.

AWS and Google were hit too, but their announcement footprints are different — fewer prefixes, more uniform PoP layout — so the proportional impact on Cloudflare's anycast-heavy design was larger. Cloudflare's then-CTO John Graham-Cumming wrote the incident up the same afternoon.

The fix during the incident

There was nothing Cloudflare could do at the protocol level. They did not own AS33154, they did not have a BGP session to Verizon they could deprioritise, and they could not stop a more-specific announcement someone else was injecting. The only fix was to make the announcement stop.

Cloudflare's NOC paged Verizon repeatedly starting around 10:45 UTC. The public reporting suggests that escalation took a long time because the NOC contacts were either not staffed or did not have authority to tear down the offending session. DQE Communications eventually withdrew the optimised announcements upstream around 12:39 UTC. Verizon then withdrew the routes it had propagated. Normal BGP convergence — every router in the default-free zone re-running best-path selection and propagating withdrawals — took about another 30 minutes, putting full recovery at roughly 13:30 UTC.

The post-incident finger-pointing focused on Verizon. A tier-1 transit provider running eBGP sessions to small customers without a prefix-list and without a max-prefix limit is, by current operational consensus, an unforced error.

Lessons that propagated industry-wide

The 2019 leak became a forcing function for several BGP-security initiatives that had been moving slowly for years.

RPKI Route Origin Validation (ROV) adoption accelerated. RPKI lets a prefix owner publish a cryptographically signed Route Origin Authorisation (ROA) saying "AS13335 is allowed to originate 1.1.1.0/24". A router doing ROV would have looked at 1.1.1.128/25 AS_PATH 701 33154 13335, found a ROA saying only AS13335 originates that range, marked the more-specific as invalid, and dropped it. Cloudflare turned ROV on for its own peers shortly after the incident. Major transits — including AT&T, NTT, and eventually Verizon — followed over the next 18 months.

MANRS (Mutually Agreed Norms for Routing Security) gained signatures. MANRS commits a network to four practices: prefix filtering, anti-spoofing, coordination contact information, and global validation (publishing ROAs and routing policy). The number of network operators publicly committing to MANRS roughly doubled in the 18 months after June 2019.

Max-prefix filters became table stakes. The defence that would have caught the leak at Verizon's border — "this customer normally advertises 40 prefixes, tear the session down if they suddenly advertise 4,000" — is a one-line maximum-prefix directive in any IOS/JunOS configuration. It is now standard in tier-1 customer onboarding checklists. The 2019 incident is one of the canonical references operators cite when arguing for the configuration.

What Cloudflare actually changed

Cloudflare's own follow-up work focused on detection and response rather than prevention (since the prevention happens at other people's routers).

Internal monitoring for prefix hijacks and leaks of Cloudflare-owned prefixes was expanded. The team built tooling that watches global BGP feeds (RIPE RIS, RouteViews) for any announcement of a Cloudflare prefix whose origin AS is not AS13335 or whose AS_PATH contains unexpected hops. Detection latency for a leak of this shape dropped from "a customer tells us" to "alert fires within a couple of minutes".

Automated NOC escalation was hardened: pre-shared contact information for major transit providers, scripted incident creation, and a public commitment in the blog post that "we will name the upstream causing the leak in real time" — which became a notable cultural shift in how outages were communicated, and not just at Cloudflare. The 2019 post itself was published during the incident, naming Verizon, with traffic graphs showing the drop. That was unusual at the time.

Cloudflare also became one of the loudest public advocates for RPKI, publishing isbgpsafeyet.com to track which major ISPs do ROV. It is, in part, a campaigning tool — name and shame as a fix for an industry collective-action problem.

The broader lesson

BGP is a trust-based protocol designed in 1989 when the operators ran the network as a small club. A single typo, or a single optimiser misconfiguration, at a small ISP can ripple through to a meaningful fraction of the internet's edges in minutes — because the larger networks in the path treated the announcement as authoritative and re-told the lie at scale. The 2019 incident is one of the cleanest demonstrations of that structural weakness.

RPKI is the long-term cryptographic fix: prefix owners sign ROAs, routers verify them, invalid routes get dropped. The rollout has been steady but not fast. As of late 2024, ROV covers roughly 50% of internet routes by some measures — meaning about half of the default-free zone now drops invalid announcements automatically. The remaining half is still working off operational hygiene, prefix-lists, and luck.

For protocol depth — how BGP actually picks paths, what attributes propagate, why withdrawal takes minutes — see the BGP deep dive in the networking stack.

FieldValue
Start10:30 UTC, 24 June 2019
Peak impact~11:00 – 12:30 UTC
Restored~13:30 UTC
Total downtime~3 hours from leak onset to global convergence
Services affectedCloudflare, AWS, Linode, Google, Facebook, many others reachable via Verizon
Root causeBGP optimiser at DQE (AS33154) leaked ~20,000 more-specific prefixes to Verizon (AS701); Verizon had no prefix-list or max-prefix limit on the session
FixDQE withdrew the optimised announcements upstream around 12:39 UTC; Verizon withdrew the propagated routes; normal BGP convergence over the next 30 minutes

Further reading

Found this useful?