04 / 12

Stack / 04

TCP

TCP turns IP's "best-effort packet delivery" into "ordered, retransmitted, congestion-aware byte streams". Most of the production behaviour that surprises people — slow connection ramp-up, sudden retransmit storms, sockets stuck in TIME_WAIT — is TCP doing exactly what the spec asks. Reading the spec is what makes it click. This page is the working summary.

What TCP actually guarantees

IP gives you one thing: it will try to deliver a packet to an address. It makes no promise that the packet arrives, that it arrives once, that it arrives in order, or that what arrives matches what was sent. Packets can be dropped by a full router queue, duplicated by a retransmitting link, reordered by taking different paths, or corrupted in transit. TCP sits on top of that and presents the application with something far easier to program against: a reliable, ordered stream of bytes that flows in both directions.

Three words in that sentence carry the weight. Reliable means every byte you write either arrives or the connection breaks and you find out; TCP never silently loses data. Ordered means the bytes come out the far end in exactly the order they went in, even if the packets carrying them took different routes and showed up scrambled. Byte stream means there are no message boundaries: if you write "hello" then "world", the reader may receive "hello", or "helloworld", or "hel" then "loworld". TCP preserves the sequence of bytes, not the framing of your writes. If your protocol needs messages, you add length prefixes or delimiters yourself. Forgetting this is one of the most common bugs in hand-rolled network code: people assume one send() equals one recv(), and it does not.

Everything below is the machinery that produces those three guarantees over a network that offers none of them. Sequence numbers and acknowledgements give ordering and detect loss; retransmission gives reliability; the receive window stops a fast sender from drowning a slow receiver; congestion control stops every sender together from drowning the network.

The three-way handshake

Before either side sends any data, they exchange three packets that synchronise sequence numbers and confirm the path is open in both directions. The reason it takes three and not two is that each direction has to prove it can both send and receive, and each side has to learn the other's starting sequence number:

Client                                  Server
  | --- SYN  seq=x ----------------------> |   active open
  | <-- SYN-ACK seq=y, ack=x+1 ----------- |   passive open
  | --- ACK  ack=y+1 --------------------> |   established
  |                                        |
  |     ... data ...                       |

The client picks a random initial sequence number x and sends a SYN. The server picks its own random y, acknowledges the client's number with ack=x+1, and sends its own SYN in the same packet — that is the SYN-ACK. The client acknowledges the server's number with ack=y+1, and the connection is established. The initial numbers are randomised on purpose: a predictable starting sequence lets an off-path attacker inject packets into someone else's connection, so the kernel seeds them from a secret keyed on the four-tuple.

Three packets, one round-trip. Each side moves through its own states, and data can only flow once both reach ESTABLISHED.

The cost is one full round-trip before any data can flow, and that round-trip is paid on every new connection. TCP Fast Open (RFC 7413) lets the client include data in the initial SYN if it has a server-issued cookie from a previous connection, saving the round-trip. It is supported but rare in practice, partly because middleboxes mangle unfamiliar SYNs. QUIC's 0-RTT is the real successor — see the QUIC deep dive.

An accepted incoming connection lives in the kernel's accept queue until the application calls accept(). If the queue overflows, new SYNs are dropped, the tcp_listen_overflows counter ticks, and clients see RST or timeout. Watch ss -lnt's Send-Q/Recv-Q on listening sockets if you're seeing connection-refused under load.

The state machine

Eleven states. Most of them you'll never see. Four matter:

State	Meaning	What it tells you
ESTABLISHED	Connection open, data flowing	The happy path
TIME_WAIT	You closed; waiting 2×MSL before reusing the port	Many of these = many short connections; usually fine
CLOSE_WAIT	Peer closed; you haven't called close()	Application bug. File descriptor leak.
FIN_WAIT_2	You closed; peer hasn't yet	Peer is slow or buggy; usually transient

The TIME_WAIT story is worth knowing. After your side closes, the kernel keeps the socket around for about 60 seconds (2 × MSL = 2 × 30s by default on Linux) so it can absorb any straggling packets from the closed connection. A web server doing tens of thousands of short connections per second can have a hundred thousand TIME_WAITs at any moment. This is normal. Tuning net.ipv4.tcp_tw_reuse can help in a few specific cases; tcp_tw_recycle was always a footgun and was removed in Linux 4.12.

CLOSE_WAIT is your problem. The peer told you they're done, you haven't acknowledged it by closing. Sockets in CLOSE_WAIT consume memory and a file descriptor and never recover on their own. Almost always an app bug — usually "forgot to close in a code path" or "library that doesn't close on error".

Sequence numbers and acknowledgements

Every byte of every TCP stream has a 32-bit sequence number. The sender includes the sequence number of the first byte in each segment; the receiver acknowledges the sequence number of the next byte it expects. ACKs are cumulative: an ack=5000 means "I have everything up to and including byte 4999, send me 5000 next." A single ACK can therefore confirm many segments at once, and a lost ACK is harmless as long as a later one gets through, because the later number covers everything the earlier one did.

This is the entire basis for both ordering and loss detection. If segments arrive out of order, the receiver holds the later ones in a buffer and waits for the gap to fill before handing anything above the gap to the application — that is what produces in-order delivery from out-of-order packets. And because the acknowledged number stops advancing the moment a byte goes missing, the sender can tell from the ACK stream alone that something was lost, without any explicit "I didn't get it" message.

Reliability — retransmission, RTO, and fast retransmit

There are two ways the sender decides to resend a segment, and they fire on different timescales. The first is the retransmission timeout (RTO). When the sender transmits a segment it starts a timer; if no ACK covering that segment arrives before the timer fires, it assumes the segment was lost and resends it. The timeout is not a fixed value — TCP continuously measures the round-trip time and computes the RTO from a smoothed average plus a margin for variance (RFC 6298). If the network's latency jumps around, the RTO grows to avoid spurious resends. Each time a segment times out and is resent, the RTO doubles (exponential backoff), so a path that has gone dark backs off rather than hammering it.

Waiting for a timeout is slow — an RTO is often a second or more, far longer than the actual round-trip — so TCP has a faster path. If a single segment goes missing but later segments keep arriving, the receiver keeps acknowledging the same byte (the one before the gap) over and over. These are duplicate ACKs. On the third duplicate ACK, the sender concludes the segment was lost rather than merely delayed and retransmits it immediately, without waiting for the timer. This is fast retransmit (RFC 5681), and it is the common case for isolated loss on an otherwise flowing connection. The timeout is the fallback for when so much is lost that there is no returning ACK stream to count duplicates in.

Selective Acknowledgement (SACK, RFC 2018) sharpens this further. A plain cumulative ACK can only point at the first hole; if bytes 1000–1999 are missing but 2000–5000 arrived, a cumulative ACK still just says "send 1000," and a naive sender might resend everything from 1000 onward. SACK lets the receiver name the ranges it does have, so the sender resends only the actual gaps and leaves the rest alone. Modern Linux negotiates SACK by default; it matters most on paths with multiple losses per window. Make sure your stack is using it.

Flow control — the receive window

Reliability handles loss in the network. Flow control handles a different problem: the receiver might be slower than the sender. If a server streams data faster than the client application reads it, the client's kernel buffer fills up, and without a brake the sender would keep transmitting bytes the receiver has nowhere to put. TCP's brake is the receive window. In every ACK, the receiver advertises how much free buffer space it has left — "I can accept this many more bytes right now." The sender is not allowed to have more unacknowledged data in flight than that advertised window. As the application drains the buffer, the window reopens and the sender may continue.

The window is the span of bytes the sender may have outstanding. As bytes are acknowledged on the left, the window slides right and new bytes become sendable.

There is a subtlety on high-bandwidth, high-latency paths. The window field in the TCP header is only 16 bits, capping it at 65,535 bytes, which is far too small to fill a fast long-distance link. The window scaling option (RFC 7323) fixes this by negotiating a left-shift factor at handshake time, multiplying the effective window up to a gigabyte. It is on by default everywhere modern, but it is negotiated only in the SYN, so a middlebox that strips the option from the handshake silently caps the connection at 64 KB and tanks throughput on long paths. That failure mode is worth knowing because it looks like a slow server when it is really a broken option.

Two windows, one cap. The receive window protects the receiver; the congestion window (next section) protects the network. The sender may have in flight no more than the smaller of the two. If throughput is low, the first question is which window is the binding constraint — ss -tin shows both.

Congestion control — slow start and AIMD

Flow control keeps one sender from overrunning one receiver. Congestion control solves the harder collective problem: keeping every sender on a shared network from overrunning the routers in the middle. There is no central coordinator. Each connection has to guess, on its own, how much the path can carry, and adjust as conditions change. The signal it uses for that guess is loss. When a router's queue fills, it drops packets, and TCP reads a drop as "I am sending too fast" and backs off. The whole scheme is built around the assumption that loss means congestion.

The sender keeps a congestion window (cwnd) — its own private estimate of how many bytes the network can absorb — separate from the receive window the other side advertises. A new connection has no idea what the path can carry, so it does not start fast. Slow start begins with a small window (Linux uses 10 segments, about 14.6 KB) and doubles it every round-trip. Doubling sounds gentle but it is exponential, so the window climbs quickly toward the path's capacity. The first phase of every connection is this ramp, which is why a fresh connection is slow even on a fast link.

Slow start cannot double forever. Once the window reaches a threshold (ssthresh) or the first loss appears, the sender switches to congestion avoidance, where it grows the window by roughly one segment per round-trip instead of doubling it. This is the additive-increase half of AIMD: additive increase, multiplicative decrease. The probing is slow and linear; the response to a loss is sharp. When a loss is detected, the sender cuts the window — classically in half — and resumes the slow linear climb. Repeated, that produces the characteristic sawtooth: a gentle ramp up until a packet drops, an abrupt cut, then another ramp. The connection is constantly nudging upward to find the ceiling and backing off the moment it hits it.

The congestion window over time: an exponential ramp during slow start, then the additive-increase / multiplicative-decrease sawtooth. Each tooth is a probe up to the limit and a cut on loss.

Two windows compose to set the real limit: the receive window (advertised by the receiver, "I have room for N more bytes") and the congestion window (chosen by the sender, "I think the network can handle N bytes"). The minimum of the two is the actual cap on bytes in flight. Watch this interact with the congestion-control simulator to see the sawtooth form under different loss rates.

How the sender chooses the congestion window — exactly how fast it grows and how hard it cuts — is the congestion control algorithm, and several are in use:

Reno (1990). Slow-start to fill the pipe (cwnd doubles each RTT until a loss); on loss, halve cwnd and grow linearly. Simple, ubiquitous, but conservative on long-fat networks.
CUBIC (2008). A cubic-curve growth function instead of linear, so the window grows faster when you're far from the last loss point. Default on Linux since 2.6.18; still default on most distros today.
BBR (2016). Doesn't use loss as the signal at all. Estimates the bottleneck bandwidth and minimum round-trip time, then paces sends to fill the pipe without filling buffers. Used by Google for YouTube and most of GFE. Better throughput on lossy paths; more aggressive against loss-based competitors, which is occasionally controversial.

How to find out what you're running. sysctl net.ipv4.tcp_congestion_control shows the current default; cat /proc/sys/net/ipv4/tcp_available_congestion_control shows what's available. Per-socket, setsockopt(TCP_CONGESTION). Many production stacks now set BBR on egress for video and large object transfer.

Slow start and the bandwidth-delay product

A new TCP connection doesn't use the full pipe right away. It starts with a small congestion window (Linux default 10 segments, ~14.6 KB) and doubles it each RTT until it hits a loss or the receive window. On a 100 Mbps × 80 ms transcontinental path, that's about 7 RTTs to reach steady state — half a second of ramp-up before you're using the pipe.

Bandwidth-delay product (BDP) = bandwidth × RTT. It's the amount of data "in flight" when the pipe is full:

Path	Bandwidth × RTT = BDP
1 Gbps × 0.5 ms (datacentre)	~62 KB
1 Gbps × 80 ms (cross-country)	~10 MB
10 Gbps × 100 ms (transatlantic)	~125 MB

Your TCP buffers must be at least BDP-sized for the connection to actually fill the path. Linux auto-tunes (net.ipv4.tcp_rmem / tcp_wmem); the default upper bound is usually adequate, but very high-BDP paths benefit from raising it.

Connection teardown — FIN and TIME_WAIT

Closing a TCP connection is not symmetric with opening it. Each direction of the stream is shut down independently, because either side might still have data to send after the other has finished. When a side is done sending, it sends a FIN. The peer acknowledges that FIN and may keep sending its own data; when it too is done, it sends its own FIN, which the first side acknowledges. That is four packets in the general case, often collapsed to three when the ACK of one FIN rides along with the other FIN.

Client                                  Server
  | --- FIN ----------------------------> |   client done sending
  | <-- ACK ----------------------------- |
  | <-- FIN ----------------------------- |   server done sending
  | --- ACK ----------------------------> |   -> TIME_WAIT (client)
  |        (wait 2 x MSL, then close)     |

The side that sends the last ACK lands in TIME_WAIT and stays there for twice the maximum segment lifetime — about 60 seconds on Linux. This is not a bug or a leak; it serves two real purposes. It lets the final ACK be retransmitted if it was lost (otherwise the peer would sit in LAST_ACK forever), and it stops a stray, delayed segment from the old connection from being mistaken for data on a brand-new connection that happens to reuse the same four-tuple. The cost is that the port pair is held for that window.

A busy client or proxy opening tens of thousands of short connections per second can accumulate a hundred thousand sockets in TIME_WAIT. That is usually harmless, but it can exhaust the local port range when one machine makes many connections to a single destination. The right fixes are net.ipv4.tcp_tw_reuse=1 (safe; lets the kernel reuse a TIME_WAIT socket for a new outbound connection when timestamps make it safe), connection pooling so you stop churning connections, or widening ip_local_port_range. The old tcp_tw_recycle knob was a footgun that broke clients behind NAT and was removed in Linux 4.12 — do not reach for it.

TIME_WAIT is on the side that closes first. If your servers are drowning in TIME_WAIT, it usually means the server is initiating the close. Having the client close first moves the burden off your fleet, which is one reason HTTP keep-alive and connection reuse matter.

Nagle and delayed ACK — the classic 200ms stall

Nagle's algorithm batches small writes: a TCP segment with less than a full MSS won't be sent if there's already an unacknowledged segment in flight. Delayed ACK batches acknowledgements: a receiver that gets one segment will wait up to ~40-200 ms before ACKing in case more arrive (so it can ACK them together).

Each behaviour is sensible on its own. Together, they produce the classic 200 ms stall: the client sends a small segment, the server has nothing to ACK against (no outstanding segment going the other way), waits for the delayed-ACK timer, finally ACKs, and the client unblocks Nagle and sends the next segment. RPC over TCP without explicit tuning often hits this.

The fix is one of:

setsockopt(TCP_NODELAY) on the sender — disable Nagle.
setsockopt(TCP_QUICKACK) per-syscall on the receiver — disable delayed ACK for this exchange.
Use a protocol that batches at the application level (HTTP/2, gRPC) so segments are never small.

Head-of-line blocking and why HTTP/3 moved to QUIC

The in-order guarantee that makes TCP easy to program against has a sharp edge. Because the receiver hands bytes to the application strictly in order, a single dropped segment in the middle of a stream forces everything behind it to wait, even though those later bytes have already arrived and are sitting in the buffer. The application cannot see them until the hole is filled by a retransmit. That is head-of-line blocking: one missing packet holds the whole line hostage.

On a single logical stream this is the cost of ordering and there is nothing to do about it. The problem gets worse when you multiplex many independent streams over one TCP connection, which is exactly what HTTP/2 does to load a page's many resources in parallel. HTTP/2 has its own stream IDs, but they all ride one TCP byte stream, and TCP does not know they are independent. One lost packet belonging to one image stalls every other stream on that connection, because TCP refuses to deliver any byte past the gap. HTTP/2 solved request multiplexing at the application layer but inherited TCP's single-stream ordering underneath it.

This is why HTTP/3 abandoned TCP and runs over QUIC, which is built on UDP. UDP gives no ordering and no reliability at all, so QUIC rebuilds both — but it rebuilds them per stream rather than for the whole connection. QUIC carries many streams, each with its own sequence space and its own reliability, inside one encrypted UDP flow. A lost packet that affects stream A is retransmitted and delivered to stream A, while stream B, whose packets arrived fine, is handed to the application without waiting. The head-of-line block is contained to the stream that actually lost data. QUIC also folds the TLS handshake into the transport handshake, cutting connection setup to one round-trip or zero on resumption. The full picture is in the QUIC deep dive, and the trade-offs between the two transports are laid out in TCP vs UDP.

Tools — what to reach for when something is wrong

Tool	Use for
`ss -tin`	Per-socket state, RTT, cwnd, retransmits — start here
`tcpdump -i eth0 -w cap.pcap`	Packet capture; open in Wireshark
`nstat -a TcpRetransSegs`	Cluster-wide retransmit counter
`bpftrace tools/tcpretrans.bt`	Per-flow retransmits in real time
`iperf3 -c host -t 30`	Steady-state throughput between two hosts
`mtr host`	Per-hop latency and loss; the right first tool for "is it me or the path"

A surprising amount of production debugging is "open ss, find the socket, look at the cwnd and retransmit count". Slow-start hasn't finished → cwnd is small → throughput is capped. Lots of retransmits → loss somewhere → the pipe isn't really as wide as you think.

Common mistakes

Tuning tcp_rmem / tcp_wmem by copying from a 2008 blog post. Linux auto-tunes; the manual values from that era are usually smaller than current defaults. Verify before pasting.
Setting SO_REUSEADDR for performance. It addresses TIME_WAIT, which is rarely the actual problem. The right fix is usually tcp_tw_reuse=1 or solving the application-level reason you're churning connections.
Closing one side and forgetting shutdown(SHUT_WR). The peer can still send to you; ignoring it leads to RST.
Believing "the network is dropping packets" without measuring. Datacentre paths drop tens of packets per million. A 1% retransmit rate in your app probably means something else: a hot link, a tail-drop ECMP path, an MTU mismatch, a buggy NIC offload.

TCP

What TCP actually guarantees

The three-way handshake

The state machine

Sequence numbers and acknowledgements

Reliability — retransmission, RTO, and fast retransmit

Flow control — the receive window

Congestion control — slow start and AIMD

Slow start and the bandwidth-delay product

Connection teardown — FIN and TIME_WAIT

Nagle and delayed ACK — the classic 200ms stall

Head-of-line blocking and why HTTP/3 moved to QUIC

Tools — what to reach for when something is wrong

Common mistakes

Further reading