05 / 12

Stack / 05

UDP

UDP is the smallest thing you can call a transport. Eight bytes of header, no connection, no retries, no flow control, no congestion control. You hand the kernel a datagram and the kernel hands it to IP. If it arrives, it arrives; if it doesn’t, you find out by not getting a reply. That sounds like a liability and often is — but it’s also what makes UDP the right envelope for DNS, QUIC, real-time video, multiplayer games, NTP, DHCP, and SIP. When reliability has to live in the application, UDP gets out of the way.

What UDP actually is

UDP — the User Datagram Protocol — was published as RFC 768 in August 1980 by Jon Postel. The RFC is three pages long. That isn’t a defect; it’s the whole spec. UDP adds the bare minimum on top of IP: a pair of port numbers so the kernel can route packets to the right socket, a length field, and an optional checksum. Everything else — connections, retries, ordering, flow control, congestion control — is somebody else’s problem.

TCP is what people reach for by default and what most application protocols sit on. UDP is what you reach for when TCP’s guarantees cost more than they’re worth. The interesting workloads on UDP today fall into two camps: protocols where the data is naturally one shot (a DNS query, an NTP request, a DHCP discover), and protocols that build their own reliability on top because the kernel’s TCP doesn’t fit (QUIC, RTP, custom game transports).

Why this matters. Calling UDP "unreliable" is technically true but misses the point. UDP is a substrate. Whether the system on top is reliable depends on what you build. QUIC over UDP is more reliable than TCP for many workloads; a naive UDP echo is exactly as reliable as the link beneath it.

The UDP header — eight bytes

The entire UDP header is four 16-bit fields. Source port, destination port, length, checksum. That’s it. Nothing about sequence numbers, windows, flags, options.

 0                   1                   2                   3
 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
+-------------------------------+-------------------------------+
|          Source Port          |       Destination Port        |
+-------------------------------+-------------------------------+
|             Length            |           Checksum            |
+-------------------------------+-------------------------------+
|                          payload  ...                          |
+----------------------------------------------------------------+

Source port and destination port are 16-bit unsigned integers — 0 to 65535. The destination port is the one the kernel uses to find the socket that should receive the datagram; the source port lets the receiver reply. Unlike TCP, the source port is sometimes 0 — set by an app that doesn’t care about replies.

Length covers the header and the payload together, so the minimum value is 8. The maximum on the wire is 65535 minus the IP header, which means the largest UDP payload is 65507 bytes over IPv4 — but you almost never want to get anywhere near that, for reasons covered below.

Checksum is computed over a "pseudo-header" (parts of the IP header), the UDP header, and the payload. On IPv4 the checksum is optional — a value of 0 means "not computed". This is a holdover from 1980 when CPUs were slow and IP itself had a header checksum, but it’s a long-standing sin we kept: a corrupted IPv4 UDP datagram with checksum 0 will be delivered to your application intact-looking but actually mangled. IPv6 removed the IP header checksum entirely and so made the UDP checksum mandatory; a UDP/IPv6 packet with checksum 0 is usually dropped (RFC 6935 carved out a narrow exception for tunnels).

The pseudo-header trick. The UDP checksum doesn’t cover the IP source and destination addresses directly — it covers a synthetic "pseudo-header" containing them. That way the checksum also catches packets misdelivered to the wrong host. It’s a small detail that NAT and IPsec implementations have to recalculate carefully.

UDP and TCP, side by side at the header level

The eight-byte UDP header is roughly the smallest transport header you can imagine. TCP’s base header is 20 bytes — and that’s before any options, which can push it to 60 bytes. The difference is visible at a glance:

Drawn to scale: UDP carries four fields, TCP carries ten plus optional options. The width is the byte count.

Twelve fewer bytes per packet sounds small until you’re shipping 100k packets per second through a NAT box — that’s roughly 1.2 MB/s of pure header overhead saved before you count the work the kernel doesn’t do (no connection table, no retransmit queue, no window calculation, no Nagle timer).

The datagram model

TCP is a byte stream: you call send with 1000 bytes, the peer might call recv three times and get 400, 300, 300. The kernel reserves the right to split, merge, and reorder writes; there are no message boundaries.

UDP is a datagram service. One sendto produces one IP packet on the wire (possibly fragmented by IP itself, but reassembled before delivery), and the other side gets exactly one datagram out of recvfrom or it gets nothing. There’s no concept of a partial datagram. If the receive buffer you pass is smaller than the datagram, the excess is silently truncated and you get a MSG_TRUNC flag (Linux) or just a quietly-shortened read (most other systems).

The datagram model means message framing is free. A 200-byte DNS query is one send, one recv. You don’t need length prefixes or delimiters or any of the protocol gymnastics that TCP forces. In exchange you accept that datagrams can be lost, reordered, or duplicated — sometimes all three at once — and that any retransmit logic is yours to write.

TCP hands you a continuous, in-order stream and hides loss. UDP hands you discrete messages and tells you nothing about the ones that vanished.

This is the real fork in the road, and it is worth saying plainly: the question is not "reliable or unreliable", it is who owns reliability. With TCP the kernel owns it. With UDP your application owns it, or nobody does. That ownership is exactly the thing you are choosing when you pick a transport — see the TCP vs UDP comparison for the decision laid out side by side, and the TCP deep dive for what the kernel is doing on your behalf when you let it.

Atomic, not reliable. Each datagram is atomic in the sense that it arrives whole or not at all. That’s not the same as reliable. A 4 KB UDP datagram that triggers IP fragmentation will be reassembled by the receiver kernel only if every fragment arrives; lose one fragment and the whole datagram is dropped, usually with no signal at all.

When you actually reach for UDP

Four recurring shapes of workload make UDP the right answer:

Latency matters more than reliability. A real-time voice call runs at 50 packets per second of 20 ms audio frames. If one frame is lost, you don’t want to retransmit it — by the time the retransmit arrives, that audio is ancient history and would just make the call sound choppy. Concealing the lost frame (interpolation, packet-loss concealment in the codec) is strictly better than waiting for it. Online games make the same trade for position updates: a stale position is worse than a missing one. This is the whole logic of real-time communication: when data has a short shelf life, dropping it beats waiting for it.

The data is self-describing and idempotent. A DNS query is a few hundred bytes that says "what is the A record for example.com". The response is a few hundred bytes that says "93.184.216.34". If the query or response is lost, you re-ask — no state to recover, no half-sent message, no connection to tear down. Setting up a TCP connection to ask a one-shot question is overkill; the handshake alone would double the latency.

You will handle reliability yourself. QUIC, the transport beneath HTTP/3, is implemented in userspace and runs over UDP. It does retransmission, ordering, flow control, congestion control, and TLS — but it does them in the application, not in the kernel. Sitting on UDP lets a QUIC deployment ship a new congestion controller every few months without waiting for a kernel upgrade on every machine in the world.

You can’t afford a handshake. NTP queries, DHCP discovery, SNMP polls, custom RPC frameworks like Google’s original Sun RPC — these are all single-message exchanges where the cost of a TCP handshake would swamp the work. UDP makes a one-shot request a one-shot request.

Why there is no congestion control

TCP’s congestion control isn’t an optional extra. It’s the reason the internet doesn’t collapse. Every TCP sender starts at a tiny congestion window, grows it until it sees loss, backs off, and grows it again. The collective effect is that flows share a bottleneck link roughly fairly and the queue at that bottleneck doesn’t grow without bound.

UDP has no equivalent. A program calling sendto 10 million times a second will happily try to put 10 million packets a second on the wire, no matter what’s happening downstream. The kernel will eventually start dropping from its own send queue once it fills, but UDP itself does nothing to slow the application down. Two consequences:

First, if you blast UDP at 10 Gbps you’ll just blast it. There’s no backoff signal that magically reaches the sender. A misconfigured UDP sender can saturate a link and starve every TCP flow sharing it, because the TCP flows will back off politely while the UDP flow does not. This is why most ISPs and exchanges either rate-limit or de-prioritise unclassified UDP at peak times.

Second, the IETF has a long-standing best-current-practice document — RFC 8085, "UDP Usage Guidelines" — that essentially says: if you build something on UDP, you must implement congestion control or a strict rate cap yourself. Real-world implementations: QUIC ships a TCP-equivalent congestion controller (CUBIC, BBR, etc.) at the application layer; WebRTC and most video-call stacks use RTP/RTCP feedback to ratchet down their encoder bitrate when loss is observed; Google’s QUIC deployments famously experiment with new congestion controllers on UDP because they don’t have to ship a kernel patch to do so.

The non-obvious rule. "UDP doesn’t do congestion control" is not the same as "your UDP-based application doesn’t have to". A production UDP service that doesn’t cap its sending rate is one outage away from being labelled abusive by every peering link it touches.

GSO and GRO — how Linux makes UDP fast

UDP looks simple, and at low rates it is — a quick lookup into the per-socket receive queue and you’re done. At high rates it stops being simple. A QUIC edge server handling a million concurrent connections might process more than a million UDP packets per second per core. Each packet costs a syscall round-trip, a kernel buffer allocation, a netfilter walk, a checksum, and a socket lookup. Multiply by a million and the core is gone.

Linux’s answer is segmentation and reassembly offload, originally built for TCP and extended to UDP around kernel 4.18 (2018).

UDP-GSO (Generic Segmentation Offload) lets an application submit one giant "super-packet" of up to 64 KB of payload and tell the kernel "segment this into chunks of N bytes". The kernel walks the socket layer once, runs the netfilter hooks once, and either splits into MTU-sized UDP datagrams itself at the very end of the stack or — better — hands the super-packet to the NIC and lets the hardware do the slicing. Either way the per-packet cost of getting into the stack is paid once.

UDP-GRO (Generic Receive Offload) is the inverse on the receive side. The NIC or kernel collects multiple incoming UDP packets that share the same 4-tuple and pass certain heuristic checks, glues them into one big "super-packet", and pushes that up the stack as a single delivery. The application gets back a single datagram of, say, 60 KB, and a hint telling it how to slice it back into individual messages.

A QUIC server can use both: GSO to write outgoing batches, GRO to read incoming batches, with a corresponding cut in per-packet CPU. The Cloudflare and Google QUIC deployments both lean on this heavily. You can check whether your NIC and kernel support hardware UDP segmentation with:

# show offload features
ethtool -k eth0 | grep udp

# enable hardware UDP segmentation
sudo ethtool -K eth0 tx-udp-segmentation on

# send with GSO from userspace (Linux)
setsockopt(fd, SOL_UDP, UDP_SEGMENT, &gso_size, sizeof gso_size);

Why this matters. Without UDP GSO/GRO, a QUIC server tops out at maybe 200–300k packets per second per core on modern hardware. With GSO/GRO enabled, the same core can push past 1 million pps. That’s the difference between needing four servers per edge and needing one.

Datagram size limits and fragmentation

The UDP length field is 16 bits, so the maximum datagram size is 65535 bytes (65507 of payload after subtracting the 8-byte UDP header and the 20-byte IPv4 header). You can send a datagram that big and it will work, but you almost never want to. Here is why.

The Ethernet MTU is 1500 bytes. Subtract the 20-byte IPv4 header and the 8-byte UDP header and you have 1472 bytes of UDP payload that fits in a single IP packet on a normal LAN. Send anything larger and IP must fragment the datagram into multiple smaller packets, each carrying a slice of the original UDP payload. The receiver kernel reassembles them before handing the result to the socket.

Reassembly is where it gets ugly. If any single fragment is lost or reordered beyond a small window, the whole datagram is dropped — silently, from the application’s perspective. Worse, fragmentation interacts badly with NAT (only the first fragment carries the UDP header with the port number; later fragments are just IP fragments and the NAT box can’t tell where they belong), with stateful firewalls (which often drop non-initial fragments by default), and with security middleboxes (which can’t see into a fragment).

Safe ceiling numbers to keep in mind:

Path	Safe UDP payload	Why
LAN, IPv4, no tunnels	1472 bytes	1500 MTU − 20 IPv4 − 8 UDP
Internet, IPv4	1432–1452 bytes	Allow for IPsec, GRE, PPPoE encapsulation
Internet, IPv6	1232 bytes	1280 minimum MTU − 40 IPv6 − 8 UDP
DNS over UDP, classic	512 bytes	RFC 1035 hard limit
DNS over UDP, EDNS0	up to 4096 bytes (often 1232 in practice)	RFC 6891 negotiation

DNS deserves a footnote. The original 1987 spec capped UDP responses at 512 bytes; if the answer didn’t fit, the server set the TC (truncated) bit and the client was supposed to retry over TCP. EDNS0 in 1999 let clients advertise a larger buffer (most resolvers offer 4096), but real-world middleboxes drop fragmented DNS often enough that the DNS Flag Day 2020 settled on a default of 1232 bytes — small enough to fit in one IPv6 packet, no fragmentation, no middlebox surprises.

A real-world UDP server in Python

The Python sockets API for UDP is even smaller than for TCP. There’s no listen, no accept, no per-connection FD. You bind a socket, call recvfrom in a loop, and reply with sendto to whichever address the datagram came from.

# echo_udp.py — receive a datagram, send it back to its sender
import socket

s = socket.socket(socket.AF_INET, socket.SOCK_DGRAM)
s.setsockopt(socket.SOL_SOCKET, socket.SO_REUSEADDR, 1)

# SO_REUSEPORT lets multiple processes bind the same port; the
# kernel hashes 4-tuples across them — the right way to scale a
# UDP server across cores.
s.setsockopt(socket.SOL_SOCKET, socket.SO_REUSEPORT, 1)

s.bind(("0.0.0.0", 5300))
print("listening on udp/5300")

while True:
    data, peer = s.recvfrom(65535)         # one datagram, whole or not at all
    print(f"got {len(data)} bytes from {peer!r}")
    s.sendto(data, peer)                   # reply to the same address

Try it: in one terminal run python3 echo_udp.py, in another echo "hello" | nc -u -w1 localhost 5300. You’ll see the message arrive, get echoed, and the netcat client print it back. There is no "connection". There is no "session". The server has no idea whether the same client will send another datagram a millisecond later or never again.

A few production-grade additions are worth knowing. SO_REUSEPORT (shown above) lets you run N copies of the server on the same port and let the kernel shard load across them by hashing the 4-tuple — this is how high-end UDP services scale linearly with cores. recvmmsg and sendmmsg (note the double-m) let you submit up to ~1000 datagrams in a single syscall, which together with GSO/GRO is how QUIC implementations push real packet rates. And IP_PKTINFO lets your server learn which local interface a datagram arrived on — important when one socket is bound to 0.0.0.0 but needs to reply with the right source address.

How DNS uses UDP

DNS is the canonical UDP success story. A typical query is 30–80 bytes; a typical response is 100–400 bytes. Setting up a TCP connection to ask for an A record would cost a round-trip handshake before the query even left, doubling the latency you feel. With UDP, the resolver sends one datagram and waits for one datagram back.

The retransmission strategy is built into the resolver, not the protocol. The classic behaviour: send a query, wait ~5 seconds, retry against the same server once, then try another server, then give up. Modern resolvers are more aggressive — BIND and Unbound will try parallel queries against multiple authoritative servers and take the first answer. Either way the timer lives in userspace, because UDP doesn’t have one.

Truncation is the other half. If a server has more data to send than fits in the negotiated UDP buffer, it sends what it can with the TC bit set in the DNS header. The client sees TC and immediately retries the same query over TCP. This is rare for normal queries but common for DNSSEC, where signature records can push responses well past 512 bytes.

DNS-over-HTTPS (DoH, RFC 8484) and DNS-over-TLS (DoT, RFC 7858) move DNS off UDP entirely, onto TCP/TLS or HTTP/2. The latency cost is real — an extra handshake on cold connections, larger payloads — but the operational benefits (privacy, no middlebox interference, no fragmentation worries, blends into normal web traffic) usually win, especially for resolver-to-resolver paths where the connection is long-lived. DNS-over-QUIC (DoQ, RFC 9250) brings DNS back to UDP, but as a QUIC stream rather than a raw datagram.

How games use UDP

Multiplayer games settled on UDP early. The architecture that John Carmack ironed out for QuakeWorld in 1996 — client prediction, server-side reconciliation, fire-and-forget position updates at ~30 Hz — is still the textbook approach today.

The shape: the client samples input (move forward, look left, fire) and sends the latest input state to the server roughly 30–60 times per second. The server runs the authoritative simulation, computes everyone’s state, and sends each client a snapshot of nearby entities, also at 30–60 Hz. Each snapshot is a single UDP datagram, ideally well under the MTU. If one is lost, the next one arrives 16–33 ms later carrying fresher state. Retransmitting a stale position would be worse than useless — it would arrive after the next snapshot and confuse the client.

On top of this the client does interpolation (smoothing between snapshots) and prediction (running the simulation locally so your own movement feels instant), and the server does reconciliation (comparing what the client predicted against the authoritative state and correcting drift). Modern engines (Unreal’s networking, Unity’s Netcode, Valve’s Source, Riot’s custom stack for Valorant) all share this skeleton. The reliable channel — chat messages, inventory changes, level loads — is usually a separate stream on the same UDP socket, with sequence numbers and ACKs reimplemented on top.

When to pick UDP over TCP for a game. If a single dropped packet would force the protocol to stall waiting for retransmit (TCP’s head-of-line blocking), you can’t use TCP for the realtime channel. A position update from 100 ms ago blocking the next one from being processed is catastrophic for feel. UDP sidesteps the problem entirely.

What sits on UDP today

UDP is the substrate for a surprising number of foundational protocols. A rough map:

One thin envelope, many protocols. Some use UDP for one-shot exchanges, some rebuild reliability on top.

Some of those protocols (DNS, DHCP, NTP) use UDP because they fit a one-shot request-response shape and can’t afford a handshake. Some (RTP/RTCP for WebRTC media, custom game traffic) use UDP because they need real-time delivery and rebuilt loss handling at the application layer. And a growing share — notably QUIC — use UDP as a generic, kernel-bypass-friendly envelope while rebuilding everything TCP gave them in userspace.

How modern QUIC sits on UDP

QUIC is the most interesting thing on UDP today. Standardised as RFC 9000 in 2021 (after a long Google deployment under the same name from 2012), QUIC is a connection-oriented, reliable, ordered, multi-stream, encrypted transport. In practice it’s a TCP+TLS replacement. But it runs entirely in userspace, on top of UDP.

The reasoning is practical. TCP lives in the kernel; shipping a new congestion-control algorithm or a new TLS feature means a kernel patch, and waiting years for it to roll out across the internet. Middleboxes (NATs, carrier-grade NATs, "TCP accelerators") have ossified what TCP can do — adding a new TCP option often gets the packet dropped because some box decided it was suspicious. UDP is undeferential enough that middleboxes mostly pass it through unchanged.

By moving the connection state, the reliability logic, the congestion controller, and the TLS handshake into a userspace library, QUIC deployments (Google’s, Cloudflare’s, Meta’s, the major CDNs) can iterate on transport behaviour at the cadence of an application deploy. That’s why HTTP/3 — which is just HTTP semantics over QUIC streams — exists. The UDP envelope underneath it isn’t doing much; it’s providing the eight bytes of port and length and otherwise getting out of the way.

QUIC is a long story in its own right — 0-RTT resumption, stream multiplexing without head-of-line blocking, connection migration across IP changes — and it’s the next deep dive in this stack. The QUIC page picks up exactly where this one stops.

Common mistakes

Assuming UDP is reliable on a LAN. On an idle gigabit switch, dropped UDP datagrams are vanishingly rare and a sloppy service appears to work fine. The first time the switch sees a microburst, or the first time the receive queue overflows, or the first time someone enables QoS marking, you discover the gap. If correctness depends on every datagram arriving, you needed retries from day one.
Sending datagrams larger than the path MTU. A 4 KB UDP datagram fragments into three IP packets. Lose any one and the receiver drops the whole thing silently. Worse, lots of stateful firewalls drop non-initial fragments by default, so the loss isn’t even on your path — it’s halfway across the internet. Keep payloads at or below 1232 bytes (IPv6-safe) or 1472 bytes (IPv4-LAN) and you’ll never have to debug this.
Forgetting to handle EAGAIN on a non-blocking socket. When you set O_NONBLOCK, recvfrom returns EAGAIN/EWOULDBLOCK if there’s no datagram waiting. New code sometimes treats this as a fatal error and crashes the server under any load that empties the queue between epoll events. It’s normal; loop back to epoll.
Ignoring RFC 8085 and shipping without a rate cap. A UDP service that sends as fast as it can is a denial-of-service generator waiting to happen — for the network and for yourself. At minimum, cap your per-flow rate. If you’re doing anything sustained, implement a real congestion controller (or use a library that already has one).
Using one socket and one thread for the whole server. A single UDP socket on a single thread is a packet-rate bottleneck — you’ll cap at maybe 100k packets per second per core. SO_REUSEPORT with one socket per CPU is usually a one-line change for a 10x throughput improvement.
Trusting the source address. The source IP and port in a UDP datagram come from the sender, who can lie. Without DTLS, QUIC, or some application-layer authentication, you can’t prove that the datagram you’re replying to actually came from the address it claims. This is what makes UDP attractive for reflection-amplification attacks — and what makes any UDP service that returns a much larger response than it received a future participant in one.

UDP

What UDP actually is

The UDP header — eight bytes

UDP and TCP, side by side at the header level

The datagram model

When you actually reach for UDP

Why there is no congestion control

GSO and GRO — how Linux makes UDP fast

Datagram size limits and fragmentation

A real-world UDP server in Python

How DNS uses UDP

How games use UDP

What sits on UDP today

How modern QUIC sits on UDP

Common mistakes

Further reading