09 / 12

Stack / 09

BGP

BGP is the routing protocol between independent networks. Every ISP, every cloud provider, every CDN, every large enterprise speaks BGP to its peers and decides — based on policy and economics, not shortest path — which routes to accept and which to advertise. The internet is what emerges from tens of thousands of these networks running BGP at each other. It works remarkably well. When it doesn't, large parts of the internet stop.

Autonomous systems

An autonomous system (AS) is a network with a single coherent routing policy under one administrative entity. Comcast is an AS. Cloudflare is an AS. Your university is probably an AS. Each gets a 16- or 32-bit number from a regional registry; the number is the global identity used in BGP. There are roughly 80,000 ASes with at least one active route in the global table, and the graph that connects them — who hands traffic to whom — is the internet's actual shape. IP gives every host an address; BGP is what makes one AS able to reach an address that lives inside another.

Inside an AS, the operator runs interior routing — OSPF, IS-IS — by shortest path. Those are the interior routing protocols, and they answer "what is the cheapest way across my network." Between ASes the operators run BGP by policy. BGP doesn't pick "the shortest path"; it picks "the path my policy prefers". That single difference — interior routing optimises distance, exterior routing optimises money and control — is the reason BGP exists as a separate protocol.

BGP is a path-vector protocol. A distance-vector protocol tells neighbours "I can reach X at cost 5" and trusts the number. BGP instead carries the whole list of ASes a route passed through — the AS-path. Carrying the path, not just a metric, is what lets an AS detect loops (if it sees its own number in the path, it drops the route) and lets every operator apply policy to the actual chain of networks a packet would traverse rather than to an opaque cost.

Transit links (solid) run vertically — the lower AS pays the upper one. The peering link (dashed) is horizontal and free. This shape repeats at every level of the internet.

Route propagation

A BGP speaker advertises prefixes to its neighbours: "I can reach 8.8.8.0/24, and the AS-path to get there is [AS15169]". Each neighbour, if it accepts the announcement under its policy, prepends its own AS to the path and forwards the announcement onward.

AS15169 (Google) advertises 8.8.8.0/24                  → path: [15169]
AS3356 (Lumen) accepts, prepends:    8.8.8.0/24         → path: [3356, 15169]
AS6939 (HE.net) accepts from Lumen, prepends:           → path: [6939, 3356, 15169]
AS64500 (your edge) accepts from HE.net, prepends:      → path: [64500, 6939, 3356, 15169]

Notice what the path is doing: it grows by exactly one AS at each hop, and it always grows on the left. The leftmost AS is whoever told you about the route; the rightmost AS is the origin that actually owns the prefix. Read right-to-left and the path is a literal itinerary of networks a packet would cross. This is why a long AS-path is a soft signal of a less direct route, and why prepending your own AS several times is the standard trick for making a route look worse so neighbours prefer a different entry point.

The same advertisement, propagating right to left. The AS-path grows by one entry at every hop, recording the exact chain of networks the route came through.

When a router has multiple paths to the same prefix, it runs the BGP best-path algorithm: highest local preference, shortest AS-path, lowest origin code, lowest MED, eBGP over iBGP, lowest IGP cost to the next hop, lowest router ID. Each step is a tie-breaker. Most of the action is in the first two: local preference (set by your policy, often based on commercial agreements) and AS-path length.

eBGP and iBGP — the same protocol, two jobs

BGP runs in two modes, and conflating them is one of the more common reasons a freshly configured network half-works. eBGP (external BGP) runs between two ASes. It is what every diagram on this page is really showing: announcements crossing an administrative boundary, the AS-path growing, policy being applied. iBGP (internal BGP) runs between routers inside the same AS, carrying the externally learned routes from the edge routers that heard them to every other router that needs them.

The behaviour differs in two ways that matter. First, eBGP prepends the local AS to the path on each hop; iBGP does not, because the route has not left the AS. Second, and this trips people up: a route learned over iBGP is not re-advertised to another iBGP peer. The rule prevents loops inside the AS, but it means a naive iBGP deployment has to be a full mesh — every router peered with every other router — so that everyone hears every route directly. A full mesh of N routers needs N(N-1)/2 sessions, which is fine at five routers and unmanageable at fifty.

Two mechanisms break the full-mesh requirement. A route reflector is a router allowed to re-advertise iBGP routes to its clients, so the mesh collapses into a hub-and-spoke shape. Confederations split one AS into several sub-ASes that speak eBGP to each other internally while presenting a single AS number to the outside. Most networks pick route reflectors; they are simpler to reason about. Either way, the externally visible behaviour is identical — the rest of the internet sees one AS with one policy.

The economics — transit, peering, customer

Operators classify their BGP peers into three categories, and the categories drive policy more than any technical detail does:

Relationship	Money	What you advertise
Customer	They pay you	All routes you have, including transit-learned
Peer	No money (settlement-free)	Your customer routes only
Transit	You pay them	Your customer routes only

The "valley-free" property: a packet should never go customer → transit → customer (you'd be paying both upstreams unnecessarily) or peer → peer (you'd be carrying third-party traffic for free). The rules emerge from the policy each AS sets, not from the protocol.

Tier-1 networks. A small set of large ASes don't pay anyone for transit — they reach the rest of the internet purely through peering. Lumen, Cogent, AT&T, Verizon Business, NTT, Telia, GTT. Below them are tier-2 networks (big regional ISPs that buy some transit and peer where they can) and tier-3 networks (everyone else, paying for transit).

Communities

A BGP community is a 32-bit (or 64-bit extended, or 96-bit large) tag attached to an advertisement. Each operator defines what their communities mean; communities don't affect routing directly, but operators' policies act on them.

Common patterns:

Geo tags. "Originated in this region" — useful for local-pref policies.
Type tags. "Customer route" vs "peer route" vs "transit route" — drives the valley-free policy.
Action tags. "Don't export to AS X" or "prepend my ASN once before exporting" — let downstream operators steer your routes without you having to ask.

Operators publish their community catalogues in their NOC documentation. Anyone peering with them can use the action communities to influence outbound traffic engineering.

Route leaks and hijacks

BGP trusts what its neighbours tell it. If a neighbour announces a prefix they don't actually own, and your policy doesn't filter it, your router happily believes them and forwards traffic for that prefix toward them.

Two famous failure modes:

Route leak. An AS accidentally re-announces routes it learned from a peer to its other peers, in violation of valley-free. Traffic that was supposed to take a direct path now goes through the leaking AS, which often can't carry it. The 2019 incident where a small Pennsylvania ISP leaked routes that pulled traffic for Google and Cloudflare into its network was a route leak.
Route hijack. An AS announces a prefix it doesn't own, intentionally or accidentally. Other ASes accept it because they don't have a way to verify. Pakistan Telecom briefly hijacked YouTube's prefix in 2008 trying to block YouTube domestically, and the hijack escaped to the global internet for several hours.

RPKI — the partial fix

RPKI (Resource Public Key Infrastructure, RFC 6480 et al.) is a database that cryptographically proves which ASes are authorised to originate which prefixes. Each prefix holder publishes a Route Origin Authorization (ROA) signed by their regional registry: "AS15169 may originate 8.8.8.0/24".

A router doing RPKI route origin validation checks every received announcement against the ROA database and:

Valid — origin AS matches a published ROA. Accept normally.
Invalid — origin AS conflicts with a published ROA. Drop the announcement.
Not found — no ROA exists. Accept (transitional).

By 2024, large transit networks (NTT, Lumen, Telia, GTT, Hurricane Electric) drop RPKI-invalid routes by default. Cloudflare, Amazon, and Google all sign their prefixes with ROAs. The shared incentive is real: a hijack of an RPKI-protected prefix is now ignored by every operator who's enrolled.

RPKI doesn't solve route leaks (the AS-path can be tampered with even when the origin is valid). The follow-on, BGPsec, signs the path itself; deployment is minimal. ASPA (Autonomous System Provider Authorization) is a lighter-weight successor currently being adopted.

BGP in datacentres

Most hyperscalers run BGP not just between networks but inside their datacentre fabrics. RFC 7938 describes the Clos-topology pattern: every leaf and spine has its own AS number, and BGP carries reachability information across the fabric. ECMP at each layer fans traffic out across multiple equal-cost paths.

This reuses BGP's well-tested implementation instead of stretching IGPs (OSPF / IS-IS) to handle the scale. It also gives you BGP's mature traffic-engineering primitives (communities, route reflection, ECMP) for free. Most modern Kubernetes networking fabrics — Calico in BGP mode, Cilium with BGP, MetalLB BGP — extend this idea inside the cluster.

Tools

Tool / Site	Use for
RIPE Atlas	Run a BGP/traceroute probe from any of thousands of locations worldwide.
BGPmon / RouteViews	Looking glasses — see the global routing table from a particular vantage point.
bgp.tools	Quick AS info, peer relationships, prefix lookups. The first thing to type.
radar.cloudflare.com	Live internet observability, including BGP announcements and incidents.
peeringdb.com	The directory of who peers with whom and where. Indispensable for network engineers.
`show bgp neighbors` (Cisco), `show route receive-protocol bgp` (Juniper)	What your own router knows. Different syntax per vendor; same concepts.

Path attributes — how BGP picks one route

A real router with multiple paths to the same prefix has to pick one. BGP's selection algorithm runs through path attributes in a fixed order; the first attribute that differs decides. The Cisco implementation (which became the de-facto standard) goes roughly like this:

1. Weight. Cisco-specific local-only attribute. Highest wins. Used to force traffic out a specific interface regardless of everything else.

2. LOCAL_PREF. Local-preference within an AS. Higher wins. The operator's way to express "prefer this exit point". Set high on customer routes, lower on peer routes, lowest on transit — encodes the economic preference automatically.

3. Locally originated. Prefer routes the local router originated over routes learned from neighbours.

4. AS_PATH length. Shorter wins. The textbook BGP rule; in practice usually overridden by LOCAL_PREF before reaching this step.

5. Origin type. IGP-learned beats EGP-learned beats incomplete. Historical, rarely the deciding factor anymore.

6. MED (Multi-Exit Discriminator). Lower wins. A hint from one AS to another about which entry point to prefer for inbound traffic. Notoriously inconsistent behaviour across vendors; the "always-compare-med" knob varies.

7. eBGP over iBGP. Prefer paths learned from external peers over internal ones.

8. IGP metric to the next hop. Lower wins. The "hot-potato routing" step — exit traffic at the closest internal egress point.

9. Router ID as the tiebreaker. Lowest router ID wins. Arbitrary but deterministic.

Knowing this list cold is the BGP-operator interview filter. Knowing which knob to turn first when you have to influence a route — LOCAL_PREF for outbound preference, AS_PATH prepending or community tags for inbound preference — is the working skill.

Route convergence — why incidents take minutes, not seconds

BGP converges slowly by design. When a route disappears, the news propagates one AS-hop at a time, with hold-down timers, MRAI (Minimum Route Advertisement Interval) delays, and path-hunting where routers try increasingly long paths before declaring a prefix unreachable.

The defaults that govern convergence:

Hold timer (180s default). If no keepalive arrives for hold-time, the session is declared dead and the routes from that peer are withdrawn. Typical operator tuning: 30s hold, 10s keepalive.

MRAI (30s default for eBGP, 5s for iBGP). A router waits MRAI seconds before sending another update for the same prefix to the same peer. Damps update storms; slows convergence.

Path hunting. When the primary route to a prefix dies, the router tries alternative paths it has heard. If those also die, longer paths are tried. During a network-wide outage, paths can grow temporarily to 10+ AS hops before BGP gives up.

BGP graceful restart. A negotiated extension where a peer can restart its BGP daemon without bringing down the session — the data plane keeps forwarding while the control plane restarts. Saves convergence pain for routers that need to restart their BGP process for upgrades.

BFD (Bidirectional Forwarding Detection). A separate protocol that detects link failure in milliseconds and triggers BGP withdrawal. Without BFD, BGP waits the full hold timer (default 180s) before declaring a peer dead. Modern data centres run BFD with sub-second timers and converge much faster than BGP defaults suggest.

Famous BGP incidents — and what each one teaches

A short tour of incidents that every BGP operator knows:

YouTube hijack (Pakistan, 2008). Pakistan Telecom tried to block YouTube domestically by announcing a more-specific route (/24 covering YouTube IPs). Their upstream did not filter the announcement; it propagated to the global table; YouTube's traffic was blackholed worldwide for hours. Lesson: filter customer announcements aggressively.

Google outage (2017). Google's traffic was routed through Russia and China for 74 minutes due to a leak from a small Russian ISP. Lesson: RPKI helps, but route leaks (announcements via the wrong path, not falsified origins) are harder to detect.

Facebook 2021. Internal BGP config change caused Facebook's authoritative DNS servers to withdraw themselves from the global table. Result: every Facebook property was unresolvable for six hours. The DNS server's own remote management depended on the DNS that had gone away, locking out engineers from recovery. Lesson: out-of-band management matters; BGP changes affect everything.

Rogers (Canada, 2022). Failed router maintenance during a major upgrade led to a complete BGP withdrawal of Rogers' prefixes. Took down 911, banking, interac payments across Canada for hours. Lesson: change-control matters; failover plans for the failover plans matter.

The 2024 AS_PATH-length attribute overflow. A bug in a major vendor caused malformed updates with extremely long AS_PATHs to be propagated; several major ISP routers crashed processing them. Lesson: BGP messages are still parsed by C code that has occasional CVEs; defence in depth via diverse implementations matters.

BGP message types — what's on the wire

BGP uses TCP port 179. After the TCP handshake, four message types do all the work:

OPEN. The first message exchanged. Each side announces its AS number, router ID, hold time, and supported capabilities (multi-protocol BGP, route refresh, graceful restart, four-byte AS numbers, ADD-PATH).

KEEPALIVE. Empty 19-byte message sent periodically (default every 60 seconds). Tells the peer "I'm still here, don't tear down the session". If hold-timer elapses without one, the session is torn down.

UPDATE. The actual route advertisements. Contains withdrawn routes (prefixes to remove) and announced routes (prefixes with their path attributes). A single UPDATE message can carry hundreds of prefixes; large BGP tables generate millions of UPDATEs during convergence events.

NOTIFICATION. Sent before tearing down a session, with an error code explaining why. "Hold timer expired", "bad message length", "unsupported capability". Useful for debugging; the most common notification a BGP operator sees is "Cease" (administrative shutdown).

Memory footprint of a full table — what makes BGP heavy

The full IPv4 routing table in 2026 has ~970K prefixes; the IPv6 table has ~210K. A router holding the full table needs to:

One Adj-RIB-In per peer holds an unfiltered copy of what that neighbour sent. Best-path selection picks a single winner per prefix into the Loc-RIB, which programs the FIB the data plane forwards on. Three peers means three copies of the table in memory.

Store every Adj-RIB-In. One copy of the table per peer, before policy is applied. A router with 5 transit peers holds 5× the table.

Compute the Loc-RIB. The best-path selection across all Adj-RIB-In entries. CPU-heavy during convergence; trivial in steady state.

Maintain the FIB. The forwarding table actually used by the data plane. Typically hashed/longest-prefix-match-optimised. The FIB is what gets blown away when a router runs out of memory for the routing table — symptoms include specific destinations becoming unreachable while others work fine.

Practical numbers: a Cisco ASR9K router running full tables from 5 peers needs ~2-4 GB of RAM just for BGP state, plus FIB capacity. Cheap commodity routers (Ubiquiti EdgeRouter, older Mikrotik) cannot hold a full table without crashing; they run on default routes pointed at their upstream.

This is why prefix aggregation matters to everyone, not just the AS doing it. An operator that owns a contiguous block — say 198.51.0.0/16 — can announce that one /16 rather than 256 separate /24s, and every router on the internet then carries one entry instead of 256. Deaggregation, announcing the more-specific pieces, is sometimes deliberate (for inbound traffic engineering, since a more-specific route always wins regardless of AS-path) and sometimes sloppy. The global table grows a few percent a year, and a meaningful slice of that growth is needless deaggregation. Every operator pays for it in router memory, which is the quiet tragedy-of-the-commons running underneath BGP: there is no per-prefix cost to the AS that announces it, but a global cost to everyone who has to store it.

Common mistakes

Accepting full routes from a customer. A customer that bought a multi-homed setup might mistakenly try to advertise the entire internet table to you. Filter on the maximum-prefix limit and on prefix-list of expected customer routes.
Forgetting to publish ROAs after acquiring a prefix. Your prefix becomes RPKI-invalid relative to the previous owner's ROA, and large networks drop it. Watch ARIN / RIPE notifications when prefixes change hands.
Using IBGP without route reflectors at scale. Full mesh works for five routers; at fifty, you need route reflectors or BGP confederations.
Setting local-preference based on temporary conditions. Local preference is sticky; setting it during an incident and forgetting to revert is how you spend the next year explaining why traffic patterns are weird.
Trusting BGP without monitoring. A successful BGP session does not mean you're getting good routes. Alert on prefix counts, on AS-path distributions, on any prefix from your own AS appearing from somewhere else.

BGP

Autonomous systems

Route propagation

eBGP and iBGP — the same protocol, two jobs

The economics — transit, peering, customer

Communities

Route leaks and hijacks

RPKI — the partial fix

BGP in datacentres

Tools

Path attributes — how BGP picks one route

Route convergence — why incidents take minutes, not seconds

Famous BGP incidents — and what each one teaches

BGP message types — what's on the wire

Memory footprint of a full table — what makes BGP heavy

Common mistakes

Further reading