IP
The Internet Protocol is the thin layer every other protocol assumes works. It
delivers a datagram from one host to another across any number of networks in
between — usually. It doesn't promise the packet will arrive, arrive in order, or
arrive only once. TCP, QUIC, and every reliability mechanism above it exist to fill
that gap. This page walks IPv4 and IPv6 side by side: the addresses, the headers, how
a host finds the next hop, why packets get fragmented, and what happens inside the
Linux kernel when you call send().
What IP actually does
IP delivers a single datagram from a source address to a destination address. That's the whole job. No connection, no acknowledgement, no retransmission, no ordering guarantee. The network can drop the packet, duplicate it, reorder it against other packets, or hold onto it long enough that it arrives after something you sent later. Every reliability property you associate with the internet lives in a higher layer.
This is on purpose. The 1974 Cerf and Kahn paper that became TCP/IP made a deliberate split: the network does the smallest job it can (route a datagram), and the endpoints do the rest (sequence numbers, retransmissions, flow control). That division — sometimes called the end-to-end principle — is why the internet scaled to billions of hosts. Routers keep no per-flow state. Add a new transport protocol on top (QUIC, SCTP) and nothing in the middle has to change.
ping and traceroute tell
you whether the IP layer is delivering at all. If those work but TCP doesn't, the
problem is in TCP, the firewall, or the application. Mixing layers in your head is how
you waste an afternoon.IPv4 addressing and CIDR
An IPv4 address is 32 bits, usually written as four decimal octets separated by dots:
192.0.2.5. That gives you about 4.3 billion addresses, which seemed plenty
in 1981 and ran out in practice on 3 February 2011, when IANA handed the last
/8 blocks to the regional registries. The RIRs drained their own free
pools over the next few years; ARIN (North America) ran out in 2015.
Addresses were originally classful — class A was a /8 (16 million
hosts), class B was a /16 (65 thousand), class C was a /24
(254 useable). This wasted huge swathes of address space; a university that needed
2,000 hosts had to take a full /16 and waste 60,000 addresses. CIDR
(RFC 1519, 1993; updated by RFC 4632) replaced that with variable-length prefixes. 10.0.5.0/24
means "the 24-bit prefix 10.0.5, then 8 bits of host", giving 256
addresses (254 useable — first is the network, last is the broadcast). A
/30 gives four addresses (two useable). A /16 gives 65,536.
Every bit shorter doubles the block; every bit longer halves it.
Three ranges are reserved as private (RFC 1918) and never appear on the public
internet: 10.0.0.0/8, 172.16.0.0/12, and
192.168.0.0/16. Every home router NATs hosts inside one of these onto a
single public IP. There's also the carrier-grade NAT range
100.64.0.0/10 (RFC 6598), used by ISPs that have run out of public IPv4
and now NAT many customers behind one address. Loopback is
127.0.0.0/8 — yes, a whole /8 for one machine to talk to
itself. Link-local autoconfiguration sits at 169.254.0.0/16, which is what
you see when DHCP fails.
10.0.0.0/8 16,777,216 addresses private (RFC 1918)
172.16.0.0/12 1,048,576 addresses private (RFC 1918)
192.168.0.0/16 65,536 addresses private (RFC 1918)
100.64.0.0/10 4,194,304 addresses CGNAT (RFC 6598)
127.0.0.0/8 16,777,216 addresses loopback
169.254.0.0/16 65,536 addresses link-local (APIPA)
224.0.0.0/4 268,435,456 addresses multicast
0.0.0.0/0 4,294,967,296 addresses "everything" (default route)IPv6 addressing
IPv6 is 128 bits — 2128 addresses, or roughly 3.4 × 1038. The
usual stat is that this is enough to give every grain of sand on Earth several
trillion addresses. The point isn't to use them all. It's that you can hand out a
/64 to every LAN and never worry about address scarcity again. A single
/64 has 264 addresses — about four billion times the entire
IPv4 address space.
Addresses are written as eight groups of four hex digits separated by colons:
2001:0db8:0000:0000:0000:0000:0000:0001. That's painful, so two
shorthands collapse it. Leading zeros in a group drop: 2001:db8:0:0:0:0:0:1.
A single run of all-zero groups collapses to ::, which gives
2001:db8::1. You only get to use :: once per address —
otherwise the parser can't tell how many groups you dropped.
A few prefixes worth recognising. fe80::/10 is link-local — every IPv6
interface gets one automatically, and it's only valid on that one link, so you have to
name the interface when you use it: ping6 fe80::1%eth0.
fc00::/7 (in practice fd00::/8) is the unique local address
range — IPv6's answer to RFC 1918. Global unicast lives in 2000::/3;
most addresses you see in the wild start with 2 or 3. The
documentation prefix is 2001:db8::/32 — use it in examples instead of
real space.
Hosts pick their own address through SLAAC (Stateless Address Autoconfiguration). The
router advertises the /64 prefix, the host appends an interface
identifier, and that's the address. Classically the interface ID was EUI-64 — the MAC
address with fffe stuffed in the middle and one bit flipped — which
leaked your MAC across every network you joined. Privacy extensions (RFC 4941) replace
that with a random ID that rotates daily, and every modern OS uses them by default. So
a laptop on a home network usually has two IPv6 addresses on the same interface: a
stable one and a temporary one, and outbound connections prefer the temporary one.
/64. SLAAC
requires it. Most networking gear assumes it. If you find yourself wanting a
/120 to "save space", stop — you have 264 addresses, the LAN
holds maybe 200 hosts, and saving address space in IPv6 is the one optimisation that
pays nothing and breaks tooling.Walking an IPv4 header
The IPv4 header is 20 bytes when there are no options, which there usually aren't. It's prepended to every packet that leaves your machine, and routers along the way look at maybe four of its fields — destination IP, TTL, header checksum, and total length — before forwarding.
0 1 2 3
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|Version| IHL | DSCP |ECN| Total Length |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| Identification |Flags| Fragment Offset |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| TTL | Protocol | Header Checksum |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| Source IP Address |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| Destination IP Address |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| Options (variable, rare) |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+Field by field: Version is 4 for IPv4. IHL is the header length in 32-bit words — normally 5, meaning 20 bytes. DSCP (originally Type of Service) is the QoS marking; routers can prioritise certain classes. ECN is the two-bit Explicit Congestion Notification, which lets routers mark packets instead of dropping them when queues fill. Total Length is header plus payload, max 65,535 bytes.
Identification, Flags, and Fragment
Offset are the fragmentation machinery — if a packet has to be split, all
fragments share the same ID, and the offset says where in the original this fragment
belongs. The flags include "more fragments" and "don't fragment". TTL
(time to live) is a hop counter that starts at some value (usually 64 or 128) and
decrements at every router; when it hits zero, the router drops the packet and sends
back an ICMP Time Exceeded. That's what traceroute exploits — it sends
packets with TTL 1, then 2, then 3, and reads the ICMP replies to map the path.
Protocol says what's in the payload: 6 is TCP, 17 is UDP, 1 is ICMP, 47 is GRE. Header Checksum covers the header only; the payload is the higher layer's problem. Because TTL changes at every hop, the checksum has to be recomputed at every hop — which is one of the reasons IPv6 dropped it.
Walking an IPv6 header
The IPv6 header is a fixed 40 bytes — 16 for the source address, 16 for the destination. The remaining 8 bytes carry the version, traffic class, flow label, payload length, next header, and hop limit. There are no inline options and no checksum, which makes header processing in a router measurably faster.
0 1 2 3
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|Version| Traffic Class | Flow Label |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| Payload Length | Next Header | Hop Limit |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| |
+ +
| Source Address (16 bytes) |
+ +
| |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| |
+ +
| Destination Address (16 bytes) |
+ +
| |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+Version is 6. Traffic Class is the DSCP/ECN equivalent. Flow Label is a 20-bit identifier the source can set so that load-balancers and ECMP routers keep packets in a flow on the same path without having to inspect L4 headers. Payload Length covers only the payload (not the header). Next Header serves the same role as IPv4's Protocol field, but it can also point at an extension header, which has its own Next Header, and so on — a linked list of optional headers between the IP header and the transport. Hop Limit is exactly TTL with a more honest name.
What's missing matters as much as what's there. No header checksum: the IPv6 spec assumes the link layer and the transport layer (TCP/UDP, which both fold a pseudo-header into their checksum) already do the work, so re-checksumming at every hop is wasted effort. No in-header fragmentation fields: fragmentation can only happen at the source, never at a router, and even there it lives in an extension header rather than the main one. No options: anything optional goes into an extension header that intermediate routers can skip.
ARP and NDP — finding the next hop's MAC
IP gives you a destination address. The Ethernet card needs a MAC address. Something has to translate between them. On IPv4 that's ARP (Address Resolution Protocol, RFC 826, 1982); on IPv6 it's NDP (Neighbor Discovery Protocol, RFC 4861), which rides on ICMPv6.
The IPv4 flow: the host wants to send to 10.0.0.7 and consults its
routing table to decide that this is on the local link, not via a gateway. It
broadcasts an ARP request to the Ethernet broadcast address
ff:ff:ff:ff:ff:ff: "who has 10.0.0.7, tell
10.0.0.5?". Every host on the segment receives it; the one that owns
10.0.0.7 replies unicast with its MAC. The requester caches the answer
in its ARP table, usually for a few minutes, and uses it for every subsequent packet
to that IP. arp -n (or ip neigh on modern Linux) shows the
table.
Gratuitous ARP is a host announcing its own MAC unprompted, usually when it boots or its address changes. It's how VRRP/keepalived signals a failover: the new primary sends a gratuitous ARP for the virtual IP so every switch and host updates its caches and starts sending traffic to the new MAC.
NDP does the same job over ICMPv6 messages. A Neighbour Solicitation (NS) is sent to a solicited-node multicast address — derived from the target's IPv6 address — so only hosts that might own that address pay attention. The owner replies with a Neighbour Advertisement (NA). Router Solicitation (RS) and Router Advertisement (RA) handle gateway discovery and SLAAC prefix announcement. Duplicate Address Detection (DAD) runs before a host claims a tentative address: it sends an NS for the address it wants, and if anybody answers, it picks a different one.
Fragmentation, MTU, and PMTUD
Every link has a Maximum Transmission Unit — the largest IP packet it can carry in one frame. Ethernet is 1500 bytes. Jumbo-frame Ethernet (when both ends and every switch in between agree) is 9000. PPPoE for DSL clips off 8 bytes for a 1492 MTU. Tunnels (GRE, IPsec, WireGuard, VXLAN) add encapsulation overhead and shrink the effective MTU further. The spec sets the minimum IPv6 path MTU at 1280 bytes; no IPv6 link is allowed to carry less.
What happens when a packet is bigger than the next link's MTU depends on the IP version. In IPv4, by default, the router fragments — it chops the packet into MTU-sized pieces, copies the IP header onto each with the fragment-offset field adjusted, and forwards them. The destination then waits for all the fragments and reassembles. This is slow, it doubles the chance of total loss (lose one fragment, lose the whole packet), and it hides packet-loss signals from TCP. In practice almost every modern stack sets the "don't fragment" bit and asks the network to complain instead.
That complaint is Path MTU Discovery (PMTUD, RFC 1191 for IPv4, RFC 8201 for IPv6). If a router can't forward because the packet is too big and DF is set, it drops the packet and sends back ICMP "Fragmentation Needed" (IPv4) or ICMPv6 "Packet Too Big" (IPv6). The ICMP message includes the MTU of the next link. The source caches that MTU for the destination, retransmits smaller, and the connection continues.
In IPv6 there is no in-path fragmentation at all. Routers never fragment. If a packet is too big, the router sends ICMPv6 Packet Too Big and that's it. Only the source can fragment, and even then only via an extension header — which modern stacks almost never do. They just lower the MTU and re-send.
How a packet leaves a Linux host
You call send() on a TCP socket. A lot has to happen before the bytes hit
the NIC, and the path is the same whether the data is HTTP, gRPC, or a custom protocol.
First the kernel copies the data into the socket's send buffer, and the TCP layer
frames it, attaches a TCP header with sequence numbers, and hands it down. The IP
layer asks the FIB (Forwarding Information Base — the in-kernel routing table) where
the destination lives: which output interface and which next-hop gateway. You can see
what the kernel would decide with ip route get 8.8.8.8, which returns the
interface, source address, and gateway it would pick for that destination.
With the next hop chosen, the kernel needs the MAC address. It checks the neighbour
cache (ip neigh). On a hit, it stamps the destination MAC onto the frame.
On a miss, it queues the packet and triggers ARP or NDP; when the reply arrives, the
packet is released. After that, the netfilter hooks fire — the points where iptables /
nftables rules get to inspect, modify, or drop the packet (PREROUTING, FORWARD,
POSTROUTING for transit; OUTPUT for locally generated traffic). Connection tracking
lives here, which is how stateful firewalls and NAT
work.
Finally the packet goes through traffic control (tc) for any queueing
discipline you have configured, then into the driver's transmit ring, and the NIC
DMA-pulls the frame out of memory and puts it on the wire. Inbound, the path runs in
reverse: NIC interrupt, driver, netfilter PREROUTING, routing decision (local?
forward?), netfilter INPUT or FORWARD, then either delivered to a socket or sent back
out via the FIB. tcpdump taps in between the netfilter hooks and the
driver, which is why it shows traffic that iptables would otherwise drop on the way in.
$ ip route get 8.8.8.8
8.8.8.8 via 192.168.1.1 dev wlan0 src 192.168.1.42 uid 1000
$ ip neigh
192.168.1.1 dev wlan0 lladdr a4:2b:b0:11:22:33 REACHABLE
fe80::a62b:b0ff:fe11:2233 dev wlan0 lladdr a4:2b:b0:11:22:33 router STALE
$ ip -s link show wlan0
2: wlan0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue
link/ether 14:ab:c5:de:ef:01
RX: bytes packets errors dropped overrun mcast
4128947131 3984512 0 12 0 18342IPv6 transition mechanisms
IPv6 has been a Proposed Standard since 1998 and a full Internet Standard since 2017, and adoption is still only about half. Google's IPv6 adoption graph shows roughly half of its traffic arriving over IPv6, with country-level numbers ranging from strong majorities (India, France) down to single digits (China). So the transition has stretched across decades and produced several ways for IPv4 and IPv6 to coexist.
Dual-stack is the default. The host runs both protocols at once, gets an IPv4 and an IPv6 address, and uses whichever the destination supports. DNS returns AAAA records (IPv6) and A records (IPv4), and the resolver hands both to the application. The application then decides which to try first, and what to do when one is slow or broken. Happy Eyeballs (RFC 8305) is the algorithm modern stacks use: fire off a connection attempt over IPv6, start a parallel IPv4 attempt after a short delay (typically 250 ms), and use whichever connects first. Without it, a broken IPv6 path makes every connection feel like it hangs for thirty seconds before falling back.
Older tunnel-based transition mechanisms — 6to4, Teredo, ISATAP — are largely dead.
6to4 (RFC 3056) wrapped IPv6 packets in IPv4 and routed them through anycast relays
at 192.88.99.1; the relays were unreliable and the prefix was deprecated
in 2015. Teredo wrapped IPv6 in UDP to cross NATs and was Microsoft's answer for
Windows XP; it's been off by default since Windows 10.
The modern answer is NAT64 with DNS64. The network is IPv6-only. When a client looks
up example.com and there's no AAAA record, the DNS64 resolver synthesises
one by prepending a well-known prefix (64:ff9b::/96, RFC 6052) to the
IPv4 address. The client connects to that synthesised IPv6 address; the NAT64 gateway
at the edge translates the IPv6 packet into IPv4 and forwards it to the real server,
then translates the replies back. T-Mobile US has run its mobile network this way
since 2014, and most large carriers in Asia followed. It's why Apple's App Store
requires apps to work on a NAT64/IPv6-only network.
Subnetting in practice
Most software engineers don't subnet by hand. Where it shows up is cloud VPCs, Kubernetes pod and service CIDRs, on-prem data centre layouts, and the occasional home lab with too many VLANs. The pattern is always the same: take a big block, cut it into smaller ones, assign one per zone or one per workload.
A typical AWS VPC is a /16 — 65,536 addresses. Split that into four
/18s for four availability zones, then carve each /18 into
public and private /19s. Inside each /19 you might break
off /24s for individual subnets. Every cut takes one bit off the prefix
length and halves the block. The math is just powers of two: /24 is 256
addresses, /23 is 512, /22 is 1024, and so on up to a
/16's 65,536.
Two edge prefixes are worth knowing. A /31 (RFC 3021) has exactly two
addresses and no network or broadcast — both are useable hosts. It's built for
point-to-point links, where you'd otherwise waste a /30's four-address
block on two ends of a wire. A /32 is a single host route, and it's what
VRRP, anycast services, and load-balancer VIPs advertise: "route this
one address to me", whatever subnet it would naturally belong to. BGP anycast for DNS root servers and Cloudflare's edge is all /32s (and
/128s for IPv6) announced from many sites.
/16, which is 65,536 IPs — fine for a small cluster, painfully tight
once you're running thousands of pods and half the IPs are stuck on dead deployments
waiting on garbage collection. Plan the pod CIDR like a real subnet, usually a
/14 or larger, and keep it separate from the VPC CIDR so you can change
one without renumbering the other.Common mistakes
- Assuming PMTUD always works. It depends on ICMP getting back to the sender. ICMP is filtered all over the place — corporate firewalls, careless ACLs, some misconfigured load balancers. Tunnels (WireGuard, IPsec, GRE) make it worse because the inner MTU is smaller than the outer one. When in doubt, MSS-clamp at the tunnel endpoint and stop fighting it.
- Forgetting IPv6 link-local scope.
fe80::1on its own is ambiguous — every interface has a link-local. You always need the scope:fe80::1%eth0on Linux,fe80::1%en0on macOS. Tools that forgot to support this for years (some old SSH clients, some HTTP libraries) just couldn't talk to link-local destinations. - Confusing CIDR with subnet mask.
/24and255.255.255.0are the same thing written two ways. CIDR counts bits of network prefix; the mask spells out the bit pattern./25is255.255.255.128,/26is255.255.255.192. If you don't have the table memorised, write the mask in binary and count the ones. - Picking a VPC CIDR that overlaps something else. Sooner or
later you'll want to peer this VPC with another, or VPN into a corporate
network that already uses
10.0.0.0/8, and the overlap will be painful to undo. Pick something obscure inside the private ranges (10.42.0.0/16,172.20.0.0/16) before you commit. - Treating the broadcast and network addresses as useable hosts.
In a
/24the first address (.0) is the network and the last (.255) is the broadcast. Some DHCP misconfigurations hand them out and then things go strange. IPv6 doesn't have broadcast at all — multicast replaces it — so this particular trap is IPv4-only. - Believing NAT is security. NAT exists because IPv4 ran out of addresses. It happens to have a side effect — unsolicited inbound connections from the public internet can't reach your hosts, because there's no port mapping for them — but stateful firewall rules give you the same property with a clearer model. IPv6 hosts have public addresses; what stops random people from connecting is the firewall, not the absence of NAT.
CIDR, visualised
One way to picture CIDR is a tree of nested blocks. A /16 contains
256 /24s; each /24 contains 256 individual addresses.
Cutting bits off the right of the prefix grows the block; adding bits shrinks it.
The diagram below shows one slice of that tree — 10.0.0.0/16 holds
10.0.5.0/24, which holds 10.0.5.42/32.
The same principle applies to IPv6. A /48, the usual allocation to a
site, contains 65,536 /64s. Each /64 is one LAN with
264 addresses on it. You almost never subnet below /64; the
address space is big enough that there's no need.
IPv4 vs IPv6 headers at a glance
Same core job, redesigned. IPv6 trades a 32-bit address for a 128-bit one, drops the header checksum, drops inline options, and adds a Flow Label that intermediate routers can use for ECMP without parsing into the TCP or UDP header.
Path MTU Discovery, visualised
The feedback loop only works if every intermediate router can return ICMP to the source. Drop it anywhere and the sender keeps trying full-size, the bottleneck router keeps dropping, and the connection stalls. This is the PMTUD black hole.
Tools — inspecting the IP layer
| Tool | Use for |
|---|---|
ip addr | Every address on every interface, IPv4 and IPv6, scope and lifetime. |
ip route / ip -6 route | The FIB. ip route get DEST shows what the kernel would do with a hypothetical packet. |
ip neigh | The neighbour cache — ARP entries for IPv4, NDP entries for IPv6. |
ping / ping6 | Confirm reachability and round-trip. ping -M do -s 1472 sets DF and a payload size to probe MTU manually. |
traceroute / mtr | Walk the path one hop at a time. mtr runs continuously and shows loss per hop. |
tcpdump -i any -n | Capture packets at the kernel level. Add -w file.pcap and load it in Wireshark for byte-level inspection. |
tracepath | PMTUD-aware traceroute. Reports the path MTU as it discovers each hop. |
ss -t -i | Per-socket info including the path MTU the kernel is using for that connection. |
bpftrace -e 'kprobe:ip_rcv { ... }' | Hook the kernel's IP receive path directly. The right tool when you've exhausted the usual ones. |
Further reading
- RFC 791 — Internet Protocol (IPv4) — Jon Postel, 1981. The original spec. Short and surprisingly readable; you can get through it in an afternoon.
- RFC 8200 — Internet Protocol, Version 6 — the current IPv6 specification, replacing RFC 2460. Worth reading alongside 4861 (NDP) and 4862 (SLAAC).
- RFC 4632 — Classless Inter-domain Routing — CIDR, the prefix-aggregation mechanism that bought IPv4 another two decades.
- RFC 8201 — Path MTU Discovery for IPv6 — and RFC 1191 for IPv4. The mechanism every TCP connection assumes works.
- RFC 8305 — Happy Eyeballs Version 2 — how dual-stack clients pick between IPv4 and IPv6 without making the user wait for a broken path to time out.
- RFC 6052 — IPv6 Addressing of IPv4/IPv6 Translators — the well-known prefix NAT64 and DNS64 use to encode IPv4 addresses inside IPv6.
- Stevens — TCP/IP Illustrated, Volume 1 — the second edition (Fall and Stevens, 2011) covers IPv6 alongside IPv4 and is still the canonical book for the layer.
- Comer — Internetworking with TCP/IP, Volume 1 — a textbook with more historical and architectural framing. Useful as a complement to Stevens.
- Geoff Huston — The ISP Column — monthly long-form on how the actual internet works at the routing and addressing layer. Indispensable for the political and operational side of IP.