10 / 12

Stack / 10

Interior routing

BGP is what autonomous systems use to talk to each other. Inside an AS — your campus, your data centre, your ISP's backbone — you need something that reacts to a fibre cut in tens of milliseconds and computes optimal paths with no operator in the loop. Two link-state protocols have won that argument: OSPF, which dominates enterprises, and IS-IS, which every Tier-1 ISP runs at the core. They share the same mathematics — flood the topology, run Dijkstra, install the shortest paths — but differ on encoding, hierarchy, and how cleanly they bolt on the next decade of features.

Why interior routing exists

BGP, the protocol covered in the previous deep dive, decides how packets cross between autonomous systems. It's policy-driven, slow to converge by design, and answers a single question per prefix: "which neighbour AS do I hand this packet to?" Inside your own AS — between the routers you actually own — that's not enough. You need to compute the shortest path through the topology you built, react to a fibre cut without a human, and balance load across parallel links. That's the job of an interior gateway protocol (IGP).

Static routes are the simplest answer and work fine for three routers. They fall apart the moment the topology has more than one path, because someone has to hand-pick which one is "shortest" and update every router when a link fails. RIP, the first dynamic IGP, did better — routers exchanged "I can reach prefix X at cost N" messages — but it counts hops, ignores bandwidth, caps at 15 hops, and takes minutes to converge after a failure thanks to the count-to-infinity problem.

Two link-state protocols replaced RIP in production networks and have held that spot for thirty years. OSPF (Open Shortest Path First) was standardised in RFC 2328 in 1998 for IPv4, with RFC 5340 adding IPv6 in 2008. IS-IS (Intermediate System to Intermediate System) was ISO 10589, originally written for the OSI CLNP stack, then adopted by the IETF in RFC 1142 and extended to carry IP routes. They look almost identical in operation; the differences only matter at scale and at the edges of what each protocol can express.

Link-state versus distance-vector

Distance-vector protocols — RIP, EIGRP, BGP itself — tell each neighbour "here are the prefixes I can reach, and my cost for each." A router only ever knows what its neighbours have told it. It doesn't see the whole topology. When a link goes down, the router that noticed has to tell its neighbours, who tell theirs, and until the news propagates everyone is making decisions on stale data. Worse, routers can briefly form loops, each pointing at the other as the way out. The count-to-infinity problem is the classic case: router A loses its direct route to X, learns from B that B has a path to X (which routes through A, but A doesn't know that), and the cost climbs by one per advertisement round until it hits the protocol's infinity (16 in RIP).

Link-state protocols flip this around. Every router floods a description of its own links to every other router in the area. Every router ends up with the same link-state database (LSDB) — the same map of the topology. Then each router independently runs Dijkstra's shortest-path algorithm with itself as the root and computes the shortest path to every destination. Because every router works from the same map, there are no loops. The moment a router updates its LSDB and re-runs SPF, its routing table agrees with what every other router will compute from the same data.

The trade-off is simple. Link-state needs more memory (the full LSDB) and more CPU (Dijkstra every time the topology changes). Distance-vector needs almost nothing but converges slowly and is prone to loops. For a network with more than a few dozen routers, the CPU and memory budget is trivial and the fast, loop-free convergence is worth everything. Distance-vector survived inside BGP because BGP is policy-driven and runs at internet scale, where flooding the whole topology would be impossible.

Why BGP is still distance-vector. BGP is technically a path-vector protocol — each advertisement carries the full AS path, so loops are caught by inspection rather than computation. The reason BGP doesn't use link-state is scale: there are roughly 100,000 ASes and 950,000+ IPv4 prefixes in the global table. No router could hold a link-state database that big, let alone re-run SPF on every change.

How OSPF floods — LSAs and the LSDB

In OSPF, every router originates one or more Link State Advertisements (LSAs) describing what it knows. A Router-LSA (Type 1) lists the router's interfaces, their neighbours, and the cost of each link. When two routers on a broadcast segment elect a Designated Router, the DR generates a Network-LSA (Type 2) summarising who's on that segment. Routes from outside OSPF — redistributed from BGP or static — arrive as External-LSAs (Type 5) from Autonomous System Boundary Routers (ASBRs). Inter-area routes cross area boundaries as Summary-LSAs (Type 3 and 4).

LSAs flood reliably. When a router receives a new LSA, it acknowledges it and forwards a copy out every interface except the one it came in on. The next router does the same. Within an area, every router eventually holds every LSA. Because each LSA carries a sequence number and an age, routers can tell whether an LSA is newer than the one they already have. Stale LSAs age out at 3600 seconds (MaxAge) if their originator doesn't refresh them; live LSAs are refreshed every 1800 seconds (LSRefreshTime) just to prove they're still valid.

# OSPF adjacency state machine, abridged
Down  →  Init  →  2-Way  →  ExStart  →  Exchange  →  Loading  →  Full

# Once Full, the two routers have synchronised LSDBs and will
# exchange any new LSAs reliably. The hello timer (default 10s on
# point-to-point, 30s on broadcast) keeps adjacency alive; missing
# 4 hellos (RouterDeadInterval, default 40s) tears the adjacency
# down and triggers a topology update.

Once all routers in the area share the same LSDB, each one runs Dijkstra independently with itself as the root. The output is a shortest-path tree: for every other router and every prefix in the area, the next hop and the total cost. That tree becomes the router's contribution to the RIB.

Areas — how OSPF scales

Flooding LSAs to every router in a single area is fine up to a point. Past a few hundred routers, the bookkeeping starts to hurt: every link flap generates an LSA, every router has to flood it, every router has to re-run SPF. OSPF's answer is hierarchy. A network is split into areas; the backbone is Area 0, and every other area must connect to it through an Area Border Router (ABR). LSAs flood within an area but stop at area boundaries. The ABR generates Type-3 Summary-LSAs that say "I can reach prefix X at cost N" — the topology inside the source area stays invisible to the destination area.

This trades visibility for scale. A router in Area 5 doesn't know how many hops the ABR takes to reach a destination in Area 3; it only knows the ABR's advertised cost. The win is huge: a link flap in Area 3 doesn't make every router in every area re-run SPF, only the routers in Area 3. The cost is that ECMP across area boundaries breaks down — the destination-area routers see one ABR-generated summary, not the multiple equal-cost paths inside the source area.

OSPF gives operators several area types to suppress LSAs even more aggressively:

Area type	What it suppresses
Normal	Nothing — sees Type-1, 2, 3, 4, 5 LSAs
Stub	Drops Type-5 (external). ABR injects a default route.
Totally stubby	Drops Type-3, 4, and 5. Only the default route survives.
NSSA	Stub, but the area's own ASBR can inject Type-7 LSAs that the ABR translates to Type-5.

A small branch office with one uplink doesn't need the full external table — a default route to its ABR will do. Making it a totally stubby area shrinks its LSDB from thousands of entries to a handful, which matters on older hardware with limited memory.

LSA flooding visualised

The flood is reliable but not synchronous. A router receiving a new LSA acknowledges it, updates its LSDB, then forwards it out every interface except the one it arrived on. Each hop adds a few milliseconds. Across a continental network with a dozen hops, the full flood finishes in roughly 100 ms. Routers run SPF on a short hold-down timer (typically 50 ms) so they don't recompute for every LSA in a flurry — they batch a few together and run SPF once.

IS-IS — what ISPs prefer

IS-IS does the same job as OSPF — flood the topology, run Dijkstra, install routes — but the design choices land differently. The most important one: IS-IS runs directly over the data-link layer. There's no IP header on an IS-IS packet. It's encapsulated in an Ethernet frame with an LLC SAP of 0xFEFE, and the rest is IS-IS PDUs. So the control plane keeps running even when the IP control plane is broken — handy when a misconfiguration takes IPv4 down across a backbone and you still need the routing protocol to deliver the rollback. OSPF, by contrast, runs over IP protocol 89, so an IP outage takes OSPF with it.

IS-IS uses two levels instead of OSPF's many area types. Level 1 is intra-area, Level 2 is the backbone. A router can be L1, L2, or both (L1/L2). The L1/L2 router is roughly OSPF's ABR. There are no stub, totally-stubby, or NSSA variants — IS-IS keeps the hierarchy simpler and pushes more flexibility into how you redistribute between levels.

The real reason ISPs prefer IS-IS is encoding. IS-IS carries topology information in TLVs (Type-Length-Value records). When the IETF wanted IS-IS to carry IPv6 prefixes, they added a new TLV. When MPLS-TE needed traffic-engineering extensions, another TLV. When Segment Routing arrived, more TLVs. No protocol revision, no new packet type, no flag day. Routers that don't understand a TLV just skip it. OSPF, by contrast, needed a separate protocol (OSPFv3) for IPv6 — different LSA encoding, different adjacency state — because the original LSA format was IPv4-bound.

Every major Tier-1 ISP (AT&T, NTT, Telia, Lumen, Cogent) and most large service-provider networks run IS-IS at the core. The mix of cleaner extensibility, shorter LSP (Link State PDU) refresh intervals, and a control plane that survives an IP outage matters more at internet-backbone scale than at campus scale. Enterprises stuck with OSPF largely because every networking textbook used it, every CCNA course taught it, and once your operators know one protocol there's no compelling reason to switch.

Why no IS-IS in the enterprise. IS-IS uses ISO addresses (NSAPs) for router identity, which look alien to anyone raised on IP. The vendor docs and the open-source tooling lean heavily toward OSPF. None of this is a technical problem at campus scale, and either protocol will serve you fine. The choice is mostly cultural.

Dijkstra in production — the SPF computation

Both OSPF and IS-IS run the same algorithm. Dijkstra's shortest-path-first takes a weighted graph and a source vertex and returns the shortest distance from the source to every other vertex. The graph is the LSDB. The source vertex is the router itself. Edge weights are the link costs the operator configured (or that the router derived from interface bandwidth, a reference bandwidth divided by the interface speed).

# <a class="il" href="/simulators/dijkstra-pathfinding">Dijkstra</a>, in eight lines
def spf(graph, root):
    dist = {v: float('inf') for v in graph}
    dist[root] = 0
    pq = [(0, root)]
    while pq:
        d, u = heappop(pq)
        if d > dist[u]: continue
        for v, w in graph[u]:
            if dist[u] + w < dist[v]:
                dist[v] = dist[u] + w
                heappush(pq, (dist[v], v))
    return dist

On a 500-router network with a few thousand links, a full SPF computation takes roughly 50 ms on modern routing-engine hardware. That's fast in absolute terms, but it adds up when every link flap triggers a fresh recompute. Modern implementations use incremental SPF (iSPF): when a single link goes up or down, only the affected subtree of the shortest-path tree is recomputed. iSPF on the same 500-router topology for a single link event finishes in about 5 ms.

Link costs deserve thought. The OSPF default is reference-bandwidth (100 Mbps) divided by interface speed, capped at 1. A 100 Mbps interface costs 1; a 1 Gbps interface costs 1; a 100 Gbps interface costs 1. That's useless on any modern network. The fix is to raise the reference bandwidth on every router to something like 100 Gbps so the math means something. IS-IS uses a wide-metric mode (24-bit costs instead of the original 6-bit) for the same reason.

RIB and FIB — the two tables routers actually use

SPF produces routes; the routes have to get installed somewhere. There are two distinct tables, and conflating them is a common source of confusion. The Routing Information Base (RIB) is the kernel's view of all known routes — what OSPF computed, what BGP advertised, what static routes the operator pinned, what IS-IS learned. The RIB lives in the routing daemon's memory (or, in Linux, in the kernel routing tables you can dump with ip route). It's organised by destination prefix and carries administrative-distance metadata so the router can pick between two protocols that both claim the same prefix.

The Forwarding Information Base (FIB) is the data-plane copy. It lives in the line-card ASIC or in software fast-path code, optimised for one thing: the longest-prefix-match lookup that happens once per packet at wire speed. Modern routers store the FIB in TCAM, a trie, or some hybrid that can do 100+ million lookups per second. The FIB carries no administrative metadata; it's been reduced to the single best next-hop (or set of equal-cost next-hops) per prefix.

When OSPF recomputes, the routing daemon updates the RIB. The RIB-to-FIB programming step pushes the changes down to the data plane. On modern routers this is fast — a few milliseconds — but on older platforms or for very large FIBs, programming the hardware tables can take hundreds of milliseconds. FIB programming is often the slowest step in the whole convergence chain, not the SPF computation people assume is the bottleneck.

Table	Lives in	Lookup rate	Updated by
RIB	Control plane (CPU + DRAM)	N/A — not in data path	Routing daemons (OSPF, BGP, IS-IS)
FIB	Data plane (TCAM, NPU, or fast-path code)	100M+ lookups/sec	RIB-to-FIB programming

ECMP — equal-cost multipath

Dijkstra's classical formulation returns one shortest path per destination. Production implementations return all paths of the same minimum cost, and the router installs them all in the FIB. When a packet arrives, the FIB lookup returns a set of next-hops and the forwarding code picks one. To preserve TCP ordering (you really don't want packets of one connection arriving out of order), the choice is driven by a hash of the IP five-tuple — source IP, destination IP, protocol, source port, destination port — so every packet of a given flow takes the same path.

ECMP is the whole reason data-centre Clos fabrics work. In a typical leaf-spine topology with 32 spines, every leaf has 32 equal-cost paths to every other leaf. A flow hashes to one spine and stays there; the next flow hashes to a different spine; the aggregate bandwidth across the fabric is 32× a single link. Without ECMP you'd be limited to one path and 31 of the 32 spines would sit idle.

The hash matters. A poor hash gives uneven distribution — some links saturated, others empty. Operators usually pick a hash that includes the L4 ports so different TCP connections between the same hosts spread across paths. The most common production pain point: an "elephant" flow that pushes more bandwidth than a single ECMP path can carry. Some fabrics solve this with flowlet-based hashing (re-hash after a quiet interval) or per-packet load-balancing on lossless fabrics, but neither is universal.

BGP itself, by contrast, picks one best path per prefix through its decision process. To get ECMP from BGP you have to explicitly enable multipath and accept that paths from different ASes count as equal only under specific conditions. In a BGP-everywhere data-centre design (the Microsoft Azure and Meta pattern), the multipath knobs are on by default and the config explicitly forces every peer's paths to look identical so the multipath logic engages.

Convergence — the metric that actually matters

Operators don't care about the theoretical elegance of Dijkstra. They care about one number: how long is traffic black-holed when a link fails? "Black-holed" means the FIB still points at the dead link and packets are getting dropped. End-to-end convergence time is the sum of four phases, and any of them can dominate.

Phase	Typical time	Notes
Failure detection	50 ms (BFD) vs 40 s (default OSPF hello)	The default OSPF hello is 10 s and the dead interval is 4×. BFD sub-second timers are mandatory in any modern deployment.
LSA flood	~100 ms across a continent	Reliable flood with per-hop ack. Dominated by speed-of-light + per-hop processing.
SPF recompute	~10–50 ms (full) or ~5 ms (iSPF)	Modern routing engines run iSPF for single-link changes.
FIB programming	~100–500 ms on legacy hardware, <10 ms on modern	Pushing routes into TCAM is the slowest step on older platforms.

Add it up: without BFD you're looking at 40+ seconds of black-holing for a missed hello. With BFD enabled and modern hardware, sub-second convergence is achievable, but everything still waits for SPF to finish and the FIB to be reprogrammed.

Loop-Free Alternates (LFA), Topology-Independent LFA (TI-LFA), and Fast Reroute (FRR) sit one level lower and aim for sub-50 ms repair. Every router precomputes, for each primary next-hop, a backup next-hop that won't loop back. When the primary link goes down, the data plane switches to the backup without waiting for any SPF computation. Once SPF catches up and the RIB stabilises, the pre-computed backup is replaced by the new primary. TI-LFA plus Segment Routing makes the backup paths express-routable through arbitrary topology, which is what gets you to "the fibre cut is invisible to the application."

Areas, levels, ABRs, ASBRs

Every non-backbone area attaches to Area 0 through an ABR. Type-3 summary LSAs cross the boundary; Type-1 router LSAs don't. A stub area like Area 3 doesn't even get the external (Type-5) LSAs — the ABR injects a default route on its behalf. This is the pattern every textbook draws, because it's the pattern every production OSPF deployment uses.

When OSPF stops being enough

A single OSPF area scales to a few hundred routers before things get uncomfortable. Past that, every link flap floods to every router and triggers an SPF recompute on every one — even routers with no traffic anywhere near the failure. The LSDB grows past what older line cards can hold, the flood interval lengthens, convergence times climb past the operator's SLA, and someone has to make a decision.

Three escape valves, in roughly increasing severity:

Partition into more areas. The classic OSPF response. Take the oversized Area 0 and carve out functional regions as Area 1, Area 2, and so on. ABRs summarise. Each area's SPF is bounded by that area's size. Downside: ECMP across area boundaries breaks (the destination area sees only the ABR summary, not the full set of equal-cost paths inside the source area), and the topology gets less flexible — you can't add a link between routers in different areas without thinking about the hierarchy.

Migrate to IS-IS. IS-IS handles flat networks better than OSPF, largely because the encoding is more efficient and the LSP refresh interval (15 minutes) is shorter than OSPF's LSA refresh (30 minutes — but with a 1-hour MaxAge that forces work). For an ISP backbone where you want flat L2 across the country, IS-IS is the default and has been for two decades.

Drop the IGP and use BGP everywhere. The Microsoft Azure and Meta pattern. Every router runs eBGP with every neighbour; each rack is its own AS; the fabric is a Clos with BGP doing the path computation. No link-state flood, no SPF, no LSDB. Convergence runs at BGP's pace (slower than IS-IS for a single failure), but the protocol is so well understood at scale, and the data-centre topology so regular, that operators accept the trade. The reference for this design is RFC 7938 ("Use of BGP for Routing in Large-Scale Data Centers") and Dinesh Dutt's "BGP in the Data Center" (O'Reilly, 2017).

Common mistakes

Putting every link in Area 0. A 200-router Area 0 means every link flap floods to every router and triggers SPF on every router. Carve out functional areas; keep Area 0 to the routers that need full topology visibility.
Leaving hello/dead timers at the defaults. Default OSPF hello is 10 seconds with a 40-second dead interval. A fibre cut therefore black-holes traffic for up to 40 seconds before OSPF even notices. Enable BFD with 50 ms intervals so failure detection completes in 150 ms and convergence is bounded by SPF + FIB, not by the hello timer.
Redistributing the full BGP table into OSPF without filters. The LSDB grows by the number of redistributed routes; every router has to hold them and re-run SPF on every change. The router crashes or convergence dies. Redistribute only what's needed — typically a small set of summarised prefixes — and use route maps to bound the damage.
Not setting auto-cost reference-bandwidth on modern links. Default reference is 100 Mbps. Any interface at or above 100 Mbps gets cost 1, so a 10 Gbps link and a 100 Gbps link look identical to SPF. Bump the reference to 100,000 (100 Gbps) on every router so the arithmetic produces meaningful preferences.
Forgetting that ECMP needs symmetric configuration. Two paths only count as equal-cost if both ends of every link agree on the cost. A typo on one router that bumps a link cost from 10 to 100 silently disables ECMP on that path. Audit costs with show ip ospf database when convergence behaves surprisingly.
Assuming OSPF authentication is on. By default it isn't. A rogue router on the same segment can inject LSAs and quietly poison the LSDB. Enable MD5 or HMAC-SHA authentication on every adjacency in production.

Tools and commands worth knowing

Tool / command	Use for
`show ip ospf neighbor`	Cisco IOS — which adjacencies are Full and which are stuck in ExStart. The first thing to check when a router isn't seeing routes.
`show ip ospf database`	Dump the LSDB. Compare across routers to confirm they're synchronised.
`show isis adjacency`	IS-IS equivalent. Verify L1/L2 adjacency state.
`show ip route ospf`	Just the OSPF-learned routes in the RIB, with cost and next-hop.
`vtysh` (FRRouting)	The open-source routing suite. Used by Cumulus Linux, SONiC, and a lot of home labs. CLI feels Cisco-ish.
`BIRD`	Lightweight routing daemon, popular with route servers at IXPs. Configuration is text-based and pleasant.
`GNS3 / Containerlab`	Spin up a virtual OSPF or IS-IS topology to test convergence and config changes before touching production.
`tcpdump -i eth0 'proto ospf'`	Watch raw OSPF packets — hellos, LSAs, acks — on a single interface.

Interior routing

Why interior routing exists

Link-state versus distance-vector

How OSPF floods — LSAs and the LSDB

Areas — how OSPF scales

LSA flooding visualised

IS-IS — what ISPs prefer

Dijkstra in production — the SPF computation

RIB and FIB — the two tables routers actually use

ECMP — equal-cost multipath

Convergence — the metric that actually matters

Areas, levels, ABRs, ASBRs

When OSPF stops being enough

Common mistakes

Tools and commands worth knowing

Further reading