Day-0 → Month-5
Study path

Computer networking

A lot of networking courses end where the interesting problems start. You finish the textbook knowing what a TCP segment is, but the first time something breaks in production it's about retransmits or a TLS handshake or a DNS cache. This study path bridges that gap. It walks the textbook in order, then keeps going into the operational details that show up in incidents and design reviews. The reading list is the one working network engineers actually recommend, and the deep dives are protocol-by-protocol walkthroughs you can come back to when you need a specific answer.


Pick a starting tier
Twelve mental models

A small set of ideas that the rest of the field rests on.

Day-zero

Layered model

Link, network, transport, application. Every layer offers a service to the one above and uses the one below. The OSI seven-layer model is more granular than what actually ships in any production stack; the four-layer TCP/IP model is closer to how engineers usually think about the boundaries.

Day-zero

Packet

A finite chunk of bytes with a header (which says where it is going, where it came from, and which layer this is) and a payload (which carries the data for the next layer up). Headers nest: an HTTP request travels inside a TLS record, inside a TCP segment, inside an IP packet, inside an Ethernet frame.

Day-zero

Addressing

Three identifiers at three layers. The MAC address picks the next hop on the local network; the IP address picks the destination machine across networks; the port number picks the receiving socket on that machine. Each layer makes decisions without knowing the layers above.

Day-zero

Connections vs datagrams

TCP delivers an ordered, retransmitted, congestion-controlled byte stream — bytes go in, the same bytes come out the other end, in order, eventually. UDP delivers individual messages, best-effort: they might arrive, they might arrive out of order, they might be duplicated. A lot of higher-layer engineering is about turning one of these into something closer to the other.

Practitioner

Sockets

A socket is a kernel data structure exposed to programs as a file descriptor. The kernel demultiplexes incoming packets to the right socket using the five-tuple: protocol, source IP, source port, destination IP, destination port. No two TCP sockets on the same machine can share that five-tuple, and everything else is configurable.

Practitioner

TCP state machine

Eleven well-defined states with well-defined transitions. The happy path is SYN_SENT → ESTABLISHED → TIME_WAIT. CLOSE_WAIT means the peer has closed the connection and your application has not yet acknowledged that by closing on its side, which is almost always a bug. TIME_WAIT is normal — it's the state a socket sits in for around a minute after close, so the kernel can absorb any straggling packets.

Practitioner

Congestion control

TCP has to share each network path with every other connection on it without overwhelming the slowest link. Reno halves the window on a loss and grows linearly. CUBIC, the modern Linux default, scales the growth curve to high-bandwidth networks. BBR estimates the bottleneck bandwidth and minimum round-trip time directly and paces against those rather than reacting to loss. Which algorithm a connection uses tends to matter more for throughput than the hardware does.

Practitioner

TLS handshake

The two endpoints agree on a cipher suite, exchange certificates, derive a shared secret, and start encrypting. TLS 1.3 does this in a single round-trip; if the two have spoken before, 0-RTT is possible (with some replay caveats). The SNI extension tells the server which certificate the client wants; the ALPN extension picks which application protocol — usually HTTP/1.1 vs HTTP/2 — to speak inside the encrypted tunnel.

Practitioner

DNS resolution

Translates names to IP addresses using a tree of zones with explicit delegation between them. The stub resolver on the client asks a recursive resolver, which walks the tree from the root down to the authoritative nameserver for the name and caches what it learns along the way. Caches sit at every layer, which is both why DNS is fast on average and why "it sometimes returns the wrong answer" is a recurring production story.

Operator

Routing

Inside a single network, an interior protocol like OSPF or IS-IS computes shortest paths over a shared link-state database. Between networks, BGP carries reachability information between autonomous systems, choosing routes based on each operator's policy rather than shortest path. The internet works because the policies are mostly compatible; outages tend to be moments when they aren't.

Operator

NAT

Network Address Translation maps many private IP addresses to one public address by keeping a table of port mappings. Different flavours — full-cone, restricted-cone, port-restricted, symmetric — determine whether direct peer-to-peer connections will work or whether you need to relay through a third party. WebRTC, online games, and torrent clients all rely on STUN, TURN, and ICE to negotiate around NAT.

Operator

Load balancing

The general problem of distributing incoming connections across many backends. Layer-4 load balancing operates on the four-tuple in the kernel and is fast; layer-7 inspects the HTTP request itself and is more flexible. Anycast does the work at the routing layer; DNS load balancing does it in the resolver answer; consistent hashing distributes connections within a single load balancer in a way that tolerates backend changes. Each picks a different trade-off between speed, flexibility, and connection stability.

The reading list

Books worth keeping on the shelf.

  • Computer Networking: A Top-Down Approach
    Kurose & Ross
    The undergraduate textbook most CS programmes use. The top-down structure — start with HTTP, work down to the link layer — makes the motivation for each layer clear before its mechanics. The current edition covers HTTP/3 and QUIC properly.
  • TCP/IP Illustrated, Vol 1: The Protocols
    Stevens & Fall
    The slower, deeper read. Walks through packet captures byte by byte and explains every header field. Useful as a reference once you have a question about what's actually on the wire.
  • High Performance Browser Networking
    Ilya Grigorik
    Free online. Covers TCP, TLS, HTTP/2 and HTTP/3, and mobile networks in one volume, with the focus on how real applications use the network rather than on protocol mechanics in isolation.
  • BGP
    Iljitsch van Beijnum
    A friendly introduction to BGP. The RFC is dense and operational practice has moved beyond it; this book bridges the gap with examples and motivation.
  • BPF Performance Tools
    Brendan Gregg
    A general performance book rather than a networking one, but the TCP tracing chapters are the best reference for finding out why a particular connection is slow on Linux.
  • Routing TCP/IP, Volumes I & II
    Jeff Doyle
    For people running production routers. Covers OSPF, IS-IS, BGP, and MPLS at a depth and in a register that the vendor docs assume you already have.
  • Beej's Guide to Network Programming
    Brian Hall
    Free online. The most accessible introduction to the sockets API, with C examples that compile. If you've never written a TCP client by hand, this is the right starting point.
Courses and longer reads

Courses that work better than most.

  • Stanford CS144 — Introduction to Computer Networking
    Nick McKeown, Philip Levis
    The lectures are on YouTube and the lab assignments have you build a working TCP/IP stack in C++ over the course of the term. About six weeks of evenings if you do the labs alongside the videos. The closest you can get to a graduate-quality networking course for free.
  • MIT 6.829 — Computer Networks
    Hari Balakrishnan, Mohammad Alizadeh
    Graduate-level and paper-heavy. A reasonable second course after CS144 — it covers a lot of the research that became BBR, segment routing, and the modern internet-measurement literature.
  • Cloudflare's engineering blog
    Cloudflare engineers
    Not a course in any structured sense, but a deep archive of production-networking writeups. A useful exercise: pick five posts each on TLS, BGP, and DDoS, read them, and take notes on what surprised you.
  • Julia Evans' networking zines
    Julia Evans
    Short, illustrated, and well-pitched at the level of "I know something is wrong but don't know where to look". The TCP/IP, DNS, and HTTP zines are good starting points.
  • INE / Cisco Learning Network
    multiple authors
    Aimed at people working towards CCNA or CCNP certifications. Heavily Cisco-flavoured material, which is the right framing if you're heading into network engineering as a career.
The paper canon

Papers worth reading at least once.

  • End-to-End Arguments in System Design
    Saltzer, Reed, Clark (1984)
    The argument for keeping function (reliability, encryption, application semantics) at the endpoints rather than inside the network. Useful to reread every couple of years; the principle still informs current debates about middleboxes and observability.
  • Congestion Avoidance and Control
    Jacobson, Karels (1988)
    The paper that introduced slow-start and congestion avoidance to TCP after the 1986 congestion collapses. Every modern congestion-control algorithm — Reno, CUBIC, BBR — descends from the framework laid out here.
  • BBR: Congestion-Based Congestion Control
    Cardwell et al., Google (2016)
    Describes the model behind BBR — estimate the bottleneck bandwidth and minimum round-trip time, pace sends to fill the pipe without filling the buffer. Worth reading before tuning BBR's production knobs.
  • The QUIC Transport Protocol: Design and Internet-Scale Deployment
    Langley et al., Google (2017)
    A retrospective on QUIC after several years of production use at Google. RFC 9000 is the formal spec; this paper is where the design choices are motivated and the deployment lessons live.
  • Maglev: A Fast and Reliable Software Network Load Balancer
    Eisenbud et al., Google (2016)
    How Google's software L4 load balancer works. The interesting bit is the consistent-hashing scheme that survives backend churn without breaking established connections.
  • Jupiter Rising: A Decade of Clos Topologies and Centralized Control
    Singh et al., Google (2015)
    A high-level account of how Google built and evolved its datacentre network fabric over a decade. Reads well alongside RFC 7938 (BGP for data centres).
  • On the Resemblance and Containment of Documents
    Andrei Broder (1997)
    Best known for introducing MinHash, but the section on consistent hashing is the cleanest short explanation of the technique Maglev-style load balancers use to decide which backend handles which connection.
Talks worth watching
  • Van Jacobson — Congestion Avoidance and Control
    SIGCOMM 1988 (transcript online)
    The talk that paired with the 1988 paper above. Forty years on, it's still a useful starting point for understanding what TCP is doing to your throughput on a particular link.
  • Geoff Huston — The Death of Transit
    NANOG, multiple recent years
    On why the modern internet looks less and less like the textbook tier-1 / tier-2 / tier-3 model — hyperscalers building out their own backbones, the rise of peering, and the long economic decline of the pure-transit business.
  • Marek Majkowski talks — Cloudflare TV
    YouTube
    A long catalogue of production-networking talks delivered as post-mortems. The ones on connection coalescing and on TCP_FASTOPEN's real-world deployment are worth seeking out.
  • Brendan Gregg — Linux Performance Analysis
    YouTube, several years' worth
    The networking sections of these performance talks are a practical guide to answering "why is this connection slow" on Linux, with the eBPF and bpftrace tooling demonstrated.
Hands-on tools

Tools to spend time with.

Wireshark
The graphical packet inspector. Capture from any interface and dissect more or less every protocol the IETF has ever defined. If you have never opened a TCP handshake in Wireshark and clicked through the field tree, that's the highest-value first hour you can spend.
tcpdump
A lighter, terminal-friendly capture tool. Scripts well, runs anywhere there's a kernel, and produces pcap files you can download and open in Wireshark later. The right answer for "capture this on a remote box".
mininet
A network simulator that runs arbitrary topologies of virtual switches, hosts, and links inside a single Linux box. The CS144 labs run on it, and most academic SDN research uses it as the test environment.
GNS3 / EVE-NG
Simulators that boot real vendor router and firewall images so you can practise BGP, OSPF, and MPLS without buying hardware. Free editions are usable; commercial tiers add things like cluster management.
iperf3 / netperf
Bandwidth and latency measurement tools. The honest way to answer "is the network actually slow" before blaming it for your application latency.
BCC / bpftrace
eBPF-based tools for tracing kernel network paths. tcpconnect, tcplife, tcpretrans, sslsniff — useful for finding out which processes are connecting where, what's retransmitting, and what TLS parameters are being negotiated.
A packet, drawn

IPv4 + TCP — what's actually on the wire.

BIT OFFSET →0481216202428VerIHLDSCPTotal lengthIdentificationFlagsFragment offsetTTLProtocolHeader checksumSource IP address (32 bits)Destination IP address (32 bits)↑ IPv4 · 20 bytesSource portDestination portSequence number (32 bits)Acknowledgement number (32 bits)DOffRsvFlagsWindow sizeChecksumUrgent pointer↑ TCP · 20 bytespayload followsEvery packet starts here.40 bytes of overhead beforeyour first byte of payload.Add TLS, HTTP/2 frame, etc.— overhead climbs to 100+ B.
Numbers worth knowing

Latencies to keep in your head.

WhatValue
Same datacentre RTT~0.2–0.5 ms
Same metro RTT~1–3 ms
Cross-continent RTT (US east ↔ west)~60–80 ms
Trans-Atlantic RTT (NYC ↔ London)~70–80 ms
Trans-Pacific RTT (US ↔ Tokyo)~110–130 ms
TCP three-way handshake1 × RTT
TLS 1.3 handshake (full)1 × RTT
TLS 1.3 with 0-RTT data0 × RTT (with replay risk)
DNS lookup (cached)~1–10 ms
DNS lookup (uncached, 4 hops)~50–200 ms
Slow-start to fill a 100 Mbps × 80 ms pipe~5–7 RTTs

These are order-of-magnitude figures. The exact numbers depend on the path, the hardware on either end, and the time of day. Treat them as a sanity check — "is the latency I'm measuring in roughly the right neighbourhood" — rather than as targets to optimise against.

Common mistakes

Six that cost real outage time.

  1. Saying "the network is unreliable" without measuring it.
    It's a useful framing for design — every network call can fail, so your code should expect that. But during an incident it's a cul-de-sac. Reach for numbers instead. The retransmit rate per minute, the connection-setup time at p50 and p99, the per-region tail latency. Once you have a graph, you can compare it to last week and decide whether the network is actually misbehaving or whether something in your application changed.
  2. Copy-pasting TCP buffer sysctls from old blog posts.
    Linux has auto-tuned send and receive buffers since the 2.6 days. The values from 2008 tutorials are usually smaller than the current defaults, and applying them on a modern kernel can make throughput worse, not better. There are real cases where manual tuning helps — very high bandwidth-delay product paths, very lossy links — but they're narrow. Measure first, then tune; never the other way round.
  3. Assuming DNS lookups are instant.
    A warm cache on a healthy resolver returns in single-digit milliseconds. A cold lookup against a slow or misconfigured authoritative server can take seconds, and the latency you see is the worst case across whichever resolver your platform happens to ask. When a request "sometimes takes 5 seconds", DNS is one of the first places to look. Cache at every layer where it's safe to, and pick TTLs that match how often you actually plan to change records.
  4. Forgetting about NAT in service-to-service code.
    Inside a typical home or corporate network, your machine has a private address that the public internet can't route to. Code that reports its own address to a peer (so the peer can connect back) breaks the moment NAT is in the path. Web sockets, WebRTC, and peer-to-peer protocols all run into this. The standard answers are STUN (to discover your public address), TURN (to relay traffic when the NAT type makes direct connection impossible), and ICE (the algorithm that tries each in turn).
  5. Treating BGP as something that just works.
    It mostly does, but the failure modes are big. Facebook went offline globally in October 2021 because of an internal BGP config change. AWS has had multiple regional outages traced to BGP. The protocol trusts whatever its neighbours say, so a misconfigured advertisement from one network can pull traffic from many others through it. RPKI helps — networks that validate route origins drop the obviously-invalid ones — but only because enough operators have opted in. Worth knowing what your transit provider is actually doing.
  6. Treating cloud networking as if it were the same as physical networking.
    AWS VPC, GCP VPC, and Azure VNet each have their own model layered on top of standard IP. Security groups behave differently from network ACLs. Default MTUs vary, and an MTU mismatch on a path through a VPN gateway can break TLS handshakes for any payload above some threshold. Hairpinning (a request from inside that goes out to the load balancer and back in) is sometimes free, sometimes blocked. The textbook gets you most of the way; the cloud documentation is the rest.
Suggested sequences

Reading progressions

Three ordered paths through this material — pick the one that matches where you are now.

Path 01 · Beginner
The transport stack, bottom up

Start at the physical layer and climb to the application. Each deep dive builds directly on the one before it.

  1. Bytes on the wire — encoding & framing
  2. Sockets — the OS abstraction
  3. TCP — reliable delivery & congestion
  4. TLS 1.3 — the secure envelope
  5. DNS — resolving names to addresses
Path 02 · Intermediate
Protocols & production traffic

How the application-layer protocols you use every day are built on top of that transport stack.

  1. HTTP — request/response & versions
  2. HTTPS / TLS — securing the channel
  3. CDN — global edge & TLS termination
  4. Load Balancing — L4 vs L7
  5. HTTP Flow Simulator ↗
  6. TCP Congestion Simulator ↗
Path 03 · Infrastructure
Routing, BGP & global resilience

The systems that make the internet work at scale: autonomous systems, anycast, service meshes.

  1. BGP — inter-domain routing
  2. DNS — anycast & resolution at scale
  3. Service Discovery — Consul & Kubernetes
  4. Reverse Proxy — NGINX & Envoy
  5. API Gateway — edge enforcement
Where to go next

Keep going.

The deep dives below are protocol-by-protocol companions to this curriculum. The recommended pace is one or two a week alongside the textbook reading. That gives the spec time to make sense and lets the operational details settle in. The two adjacent study paths — distributed systems and API design — are useful next steps once the networking layer is comfortable.