TCP Congestion Control Visualizer: the classic sawtooth.

Congestion control is how a TCP sender guesses how fast the network can go without overloading it, adjusting its window every round trip. Slow start ramps up fast, congestion avoidance creeps, a loss halves the window, then it climbs again. That cycle draws the sawtooth shape of every well-behaved TCP flow.

cwnd
1
ssthresh
16
Phase
slow start

cwnd over time · packet bursts
020
Slow start · cwnd × 2 per RTT
Congestion avoidance · cwnd + 1 per RTT
Loss · cwnd halved or reset to 1

What you're looking at

The graph plots the congestion window (cwnd) over time, one dot per round trip, coloured by phase — accent for slow start, green for congestion avoidance, plum for a loss event. The readouts up top show the current cwnd, the slow-start threshold (ssthresh), and which phase the flow is in. Tick advances one round trip by hand, Inject loss forces a drop, and Auto runs it continuously. The legend spells out the rule each phase follows: double per RTT, add one per RTT, or cut on loss.

Hit Auto and let it run a while. Early on cwnd doubles each tick — that is slow start, which despite the name is the fastest the algorithm ever opens. Once it crosses ssthresh the growth flattens to plus-one per RTT. Then watch a loss land: cwnd is roughly halved (or, less often, slammed back to 1) and ssthresh drops to match. What should surprise you is the shape that emerges on its own — the classic sawtooth — and that the flow never settles on a fixed rate. It is always probing upward until something breaks, then backing off and climbing again.


What is TCP congestion control?

Don't send faster than the pipe.

Imagine ten people downloading large files through one shared internet uplink. Each computer wants to send as fast as it can; the uplink can carry only a fixed number of bytes per second. Without any coordination, every sender pumps packets into a router queue that fills, overflows, and starts dropping. Drops trigger retransmits. Retransmits add to the queue. The queue stays full forever. Throughput collapses, latency balloons, and the link delivers a fraction of its rated capacity. This is the original "congestion collapse" problem.

The trouble is that no individual sender can see the bottleneck directly. The router does not send a friendly "please slow down" message — it just silently drops packets when its buffer is full. The sender has to infer that congestion is happening from the only signals it can see: round-trip time and acknowledgements that never arrive. Each TCP connection has to play this guessing game on its own, in real time, while staying roughly fair to other connections sharing the same path.

TCP's answer is the congestion window — usually shortened to cwnd — a per-connection cap on how many bytes the sender is willing to have in flight before any acknowledgement comes back. Start small. After every successful round trip, raise it. The moment a packet is lost, cut it sharply. The result is the famous "sawtooth" pattern visible in the simulator above: cwnd ramps up, hits a ceiling, drops by half, ramps up again, hits a slightly higher ceiling, drops, and so on. Two flows on the same bottleneck eventually settle into roughly equal shares because each obeys the same rules. The protocol is, at heart, a feedback controller built from drop signals.

A few numbers anchor the scale. The default initial window in modern Linux and Windows is 10 segments — about 14.6 KB on the first round trip. After 4 RTTs of slow start it has grown to 160 segments — 234 KB. On a 50 ms link that takes 200 ms; on a transcontinental 100 ms link, 400 ms. This is why a fresh HTTPS connection feels slow even on gigabit fibre: TCP is still climbing the cwnd ladder. Reusing a warm connection (HTTP keep-alive, HTTP/2 multiplexing) skips that climb and is the largest single win in front-end performance work. The simulator above lets you watch the climb live; the slow-start segment is the steep early ramp before the line bends into the gentler congestion-avoidance phase.

CWND OVER TIME · THE SAWTOOTHcwndtimecapacityslow start×2 / RTTLOSScongestion avoidance · +1 / RTTLOSSClimb until something breaks; halve and try again. The shape is the algorithm.

Origins of congestion control — the 1986 Internet meltdown

A protocol born from a meltdown.

In October 1986 the academic internet between Lawrence Berkeley Lab and UC Berkeley — a 400-yard hop — collapsed from a measured throughput of 32 kbps to 40 bits per second, a thousandfold drop that the operators could not explain. The same pathology spread across the wider ARPANET and NSFNet over the following months. A handful of saturated paths drove every flow into endless retransmission, the network's effective capacity disappeared, and the early internet briefly looked unviable as a research platform. The phenomenon was named congestion collapse.

Van Jacobson, then at LBL, working with Mike Karels at Berkeley, diagnosed the problem and patched it in the BSD kernel through 1987 and 1988. The resulting paper — Congestion Avoidance and Control, presented at ACM SIGCOMM 1988 in Stanford — remains the founding document of TCP congestion control. Jacobson identified four mechanisms that the original TCP (Cerf and Kahn, 1974, RFC 793) had been missing: slow start, congestion avoidance, fast retransmit, and exponential RTO backoff. The combination, deployed as TCP Tahoe in 4.3BSD-Tahoe (1988), pulled the network back from the brink.

TCP Reno (1990) added fast recovery — after fast-retransmit, skip slow start and resume from a halved window. TCP NewReno (RFC 6582, originally 2782, 1999) refined behaviour during multiple losses in one window. SACK (Selective Acknowledgement, RFC 2018, 1996) added a precise hint about which segments arrived, removing Reno's guesswork. The combined Reno + NewReno + SACK was the de-facto standard for two decades. RFC 5681 (Allman, Paxson, Blanton, 2009) is the current canonical specification of slow start, congestion avoidance, fast retransmit, and fast recovery — the essence of Jacobson's original algorithm with twenty years of clarification.

The intellectual move that mattered most was Jacobson's framing: TCP must infer network state from end-to-end signals, because no router will tell it directly. A retransmission timeout is interpreted as severe congestion. Three duplicate ACKs are interpreted as a single dropped segment. The window grows on each ACK, shrinks on each apparent loss. The protocol is a feedback controller built from the only signals available, and the entire subsequent literature is variations on which signals to use and how to weight them. It is worth pausing on the academic context: the paper appeared the same year as the Morris worm, before Tim Berners-Lee's first proposals for the web, when the internet user base measured in tens of thousands of researchers and the BSD socket layer was newer than C++. Decisions made in 1988 shaped every later round-trip on the planet. There is a school of network research devoted to "would we still pick this if we were starting from scratch", and the consistent answer is that the AIMD spine is hard to improve on without changing what signals are available.


AIMD — additive increase, multiplicative decrease

Additive increase, multiplicative decrease.

TCP congestion control is the algorithm a sender uses to avoid overwhelming the network. Van Jacobson introduced the modern approach (slow start, AIMD, congestion avoidance) in 1988 after the 1986 Internet "meltdown." Today's stacks ship CUBIC by default; Google's BBR (2016) takes a model-based approach instead. Every TCP connection on the internet runs through one of these.

The Reno strategy in one phrase: add 1 to cwnd on success, halve on loss. Two flows sharing a bottleneck converge to equal share regardless of starting point. This is why TCP is "fair" — and why a single non-AIMD flow (UDP, naive QUIC implementations, an aggressive proprietary stack) can crowd out compliant ones. The algorithm is a social contract enforced by the operating system, not a network feature.

The mathematical proof is elegant. Chiu and Jain's 1989 Analysis of the Increase and Decrease Algorithms for Congestion Avoidance in Computer Networks showed that under linear growth and proportional decrease, both flows trend toward equality on the line cwnd₁ = cwnd₂; any other increase/decrease pair (additive/additive, multiplicative/multiplicative) either fails to converge or oscillates. Linear-additive plus multiplicative-cut is the only stable point on the design space. Every modern congestion controller starts from this AIMD spine and adds heuristics for one or another shortcoming of the original signal model.

The fairness guarantee depends critically on the bottleneck queue absorbing the bursts produced by the cwnd dynamics. If the queue is too short (a small router buffer), bursts cause loss before fairness has time to converge. If it is too long (the bufferbloat case), latency balloons before loss happens at all. The Goldilocks zone is roughly one bandwidth-delay product of buffer; less drops too aggressively, more delays too much. Modern routers tune to this rule of thumb either explicitly or implicitly via active queue management.

A second, less-discussed property: AIMD is RTT-unfair. Two flows on the same bottleneck with different round-trip times converge to bandwidth shares proportional to the inverse of their RTTs, not to equal shares. A 10 ms flow on the same link as a 100 ms flow takes roughly ten times the throughput at steady state. CUBIC partially repairs this by replacing the linear additive increase with a function that scales independently of RTT. BBR sidesteps it entirely by targeting bandwidth and latency directly. The Reno fairness story is "fair to flows of similar RTT"; once you broaden the comparison to globally-mixed RTTs, the deviation from equal-share gets uncomfortable, and a long-haul TCP user pays a real penalty against a co-located one on the same link.

CWND OVER TIME · SLOW START → CONGESTION AVOIDANCE → LOSScwndtimessthreshslow start (×2/RTT)LOSScongestion avoidance (+1/RTT)LOSSAIMD SAWTOOTH

Slow start — why a fresh connection is slow

Why a fresh connection is slow.

Every new TCP connection begins with a tiny congestion window. The original 1988 algorithm started at one segment; RFC 3390 (2002) raised it to 2–4; RFC 6928 (Chu, Dukkipati, Cheng, Mathis, 2013) raised it to 10 segments after Google's experiments at the time showed faster page loads with no measurable harm. Linux adopted the larger initial window in 2.6.39 (2011); Windows in 2012; macOS shortly after. The 10-segment default is now universal across the major server OSes, though older middleware sometimes still negotiates lower values.

Each successful round-trip doubles cwnd: 10 → 20 → 40 → 80 → 160. Despite the name, slow start is exponential — the fastest opening the algorithm allows. The doubling continues until either ssthresh is reached (then we transition to congestion avoidance, with linear additive increase) or a packet is lost (then we cut back). The cost shows up in user-visible latency. With an MSS of 1460 bytes and cwnd 10, the first round-trip can carry 14.6 KB; after four RTTs, roughly 234 KB. A 1 MB file over a 50ms link wants about seven RTTs (350 ms) to hit a steady-state window. That is why HTTP keep-alive matters: reusing a warm connection saves several RTTs of warm-up, and reusing a long-warm one saves the slow-start phase entirely.

RTTcwnd (segments)Cumulative bytes (1460 MSS)
01014.6 KB
12043.8 KB
240102.2 KB
380219 KB
4160453 KB

Two important refinements limit slow start's downside. Hybrid Slow Start (HyStart, Ha and Rhee 2008) exits slow start when ACK trains start to compress, indicating the path's capacity is near; this dramatically reduces the spike of loss that historically marked the slow-start-to-congestion-avoidance transition. Initial Window Validation resets cwnd back toward the initial value if the connection has been idle for more than one retransmission timeout, on the reasoning that the path may have changed. Both are enabled by default in Linux's CUBIC and BBR implementations.


Loss-based vs model-based — Reno, CUBIC, BBR

Loss-based vs model-based.

Reno was designed for an internet where loss meant congestion. On modern wireless and long-fat-pipes, loss often means radio interference or random buffer overflow — neither of which warrants halving the window. Two responses to this gap define modern congestion control. CUBIC stayed loss-based but reshaped the growth curve. BBR abandoned loss as a primary signal and modelled the bottleneck directly.

CUBIC (Sangtae Ha, Injong Rhee, Lisong Xu, ACM SIGOPS 2008) replaces Reno's linear cwnd growth with a cubic function of time-since-last-loss centred at the previous maximum window. The shape is concave near the maximum (cautious as the window approaches its previous ceiling), then convex past it (aggressive when probing higher). The result is fairness across long-running flows that share a bottleneck even at very different RTTs — the longer-RTT flow does not get starved by the shorter-RTT one as it does under Reno. CUBIC has been Linux's default since kernel 2.6.19 (November 2006); Windows adopted it as the default in Windows 10 1709 (Fall Creators Update, 2017). With both major OS families defaulting to it, CUBIC carries most of the public internet's TCP traffic.

BBR (Bottleneck Bandwidth and Round-trip propagation time, Cardwell, Cheng, Gunn, Yeganeh, Jacobson — yes, the same Van Jacobson — published in ACM Queue, October 2016) ignores loss as a primary signal. It probes the path continuously: every few RTTs it sends a small burst above its current rate to measure the actual bottleneck bandwidth, and it tracks the minimum RTT it has observed in a sliding window to estimate base propagation delay. The pacing rate is set to the measured bandwidth; the cwnd is set to bandwidth × min-RTT (the bandwidth-delay product). BBR is the default on Google's edge serving google.com and YouTube, on Spotify, on parts of Cloudflare, and on the Linux kernel for any sysadmin who flips the sysctl.

The arithmetic of BBR's advantages can be dramatic. Cardwell's 2016 paper reported BBR delivering up to 2700× better throughput than CUBIC on lossy paths (e.g., transcontinental with 1% random loss), while delivering similar throughput at lower latency on clean paths. The cost is that BBRv1 was famously aggressive against CUBIC neighbours on shared bottlenecks — measured 16:1 throughput skew in some lab setups, because BBRv1 ignored the implicit "back off" signal that loss provides. BBRv2 (2019) and BBRv3 (2023) added explicit fairness mechanisms, including reaction to ECN marks and a soft response to loss, narrowing the unfairness gap. The internet is still finishing the migration.

Specialised variants exist for specific environments. DCTCP (Alizadeh et al, SIGCOMM 2010, designed at Microsoft Research and Stanford for datacenter use) couples ECN marks with a fine-grained cwnd response, achieving sub-millisecond latencies in datacenter fabrics where Reno or CUBIC would suffer. Compound TCP (Microsoft, Vista era, 2007) ran a Reno-style window in parallel with a delay-sensitive window and used the larger of the two; it was the Windows default through Windows 10's adoption of CUBIC. HighSpeed TCP (RFC 3649, Floyd, 2003) was an early proposal for raising the window aggressively on very-high-bandwidth paths, largely superseded by CUBIC.

BBR · BANDWIDTH × MIN-RTT · PROBE PHASESratetimeest. BtlBw×1.25×1.25×0.75 drain×1.25PROBE BW · PROBE RTT · IGNORE LOSS UNTIL DELAY GROWS

RTT, bandwidth, queue — three numbers that decide everything

Three numbers that decide everything.

Three measurable quantities determine the behaviour of any modern TCP flow: the bandwidth-delay product, the depth of the path's bottleneck buffer, and whether explicit congestion notification is honoured along the way.

The bandwidth-delay product is the amount of data in flight on the wire at steady state. A 100 Mbps link with 50 ms RTT carries about 625 KB at any instant; a 1 Gbps transcontinental at 100 ms RTT carries 12.5 MB. Your cwnd has to reach this number to saturate the link, which is why the initial-window-of-10 default leaves so much capacity untouched on long-haul paths. Linux's net.ipv4.tcp_rmem autotuning raises receive buffers up to 6 MB on modern kernels; for bandwidth-delay products above that, you need to tune higher. The Pittsburgh Supercomputing Center's tcptune documentation has the canonical recipe.

Bufferbloat, named by Jim Gettys in his 2011 series of blog posts and ACM Queue articles, describes routers and modems that absorb congestion into oversized buffers instead of dropping packets. Loss-based controllers never get the back-pressure signal; latency balloons by 100–1000 ms while throughput appears nominal. Voice calls become unintelligible during a simultaneous file transfer; gaming becomes unplayable. The home-router industry shipped multi-megabyte buffers throughout the 2000s, in the well-meaning belief that bigger buffers were better. CoDel (Controlled Delay, Nichols and Jacobson, ACM Queue 2012) and PIE (Proportional Integral controller Enhanced, RFC 8033, 2017) are active queue management algorithms that drop or mark packets early to give TCP the feedback it needs. Both are widely deployed in modern routers, in Linux's fq_codel and cake qdiscs, in DOCSIS 3.1 cable modems, and in OpenWrt.

Explicit Congestion Notification (RFC 3168, Ramakrishnan, Floyd, Black, 2001) lets routers mark packets with a "congestion experienced" bit instead of dropping them. The TCP receiver echoes the mark back to the sender, which responds the same way it would to a loss (halve cwnd) without any actual packet being lost. ECN was deployed slowly because of buggy middleboxes that scrambled the ECN bits and broke connections; usable deployments only became common after 2015. Apple enabled ECN by default in iOS 9 and macOS 10.11 (2015); Linux supports it but does not enable it by default; path support has improved steadily but remains far from universal. ECN's modern incarnation is L4S (Low Latency, Low Loss, Scalable, RFC 9330–9332, 2023), which combines a more aggressive ECN signal with a tightly-paced congestion controller (TCP Prague or similar) to deliver sub-millisecond latencies on the open internet — the future Apple, Comcast, and Nokia have committed to.

Algorithm Year Signal Fairness Where
Tahoe1988loss → cwnd=1RTT-fairhistorical
Reno1990loss → cwnd/2RTT-fairlegacy
NewReno1999loss · partial ACKsRTT-fairRFC 6582 default
CUBIC2008loss · cubic curveRTT-independentLinux + Windows default
BBR (v1/v2/v3)2016+bandwidth · min-RTTmodel-basedGoogle, YouTube, Spotify
DCTCP2010ECN marks (fine-grained)datacenter onlyAzure, hyperscale fabrics
Why bufferbloat hides

Loss-based controllers only react when packets drop. Oversized router buffers absorb every burst silently, so cwnd keeps growing while latency climbs into the seconds. Throughput dashboards show "fine"; voice calls fall apart. Test by running ping during a saturating download — if RTT grows by hundreds of milliseconds, your path has bloat.


TCP tuning knobs an operator owns — initcwnd, sysctls, qdiscs

The knobs an operator owns.

Tuning TCP for a real fleet is a small set of high-use decisions. Initial Window 10 (RFC 6928) is the right default for any modern server; the Linux sysctl is net.ipv4.tcp_init_cwnd set indirectly via the route metric. Pacing spreads outgoing segments across the RTT instead of bursting them; under CUBIC, Linux's tcp_pacing_ca_ratio controls how aggressively, and BBR paces by design. Pacing reduces tail-drop incidents at downstream routers and is one of the largest practical wins for any high-throughput sender.

The choice of congestion controller is a sysctl. net.ipv4.tcp_congestion_control = bbr on Linux switches the default; tcp_available_congestion_control lists what is loaded; modules can be loaded for specialised needs (vegas, westwood, illinois, dctcp, scalable, hybla). Per-route settings via ip route let you specialise behaviour by destination prefix, useful when shipping bulk data to a specific cloud while keeping CUBIC for the rest.

TFO (TCP Fast Open, RFC 7413, 2014) lets the client send data in the SYN, eliminating one round-trip on repeat connections. Adoption has been uneven because some middleboxes drop SYNs containing payload; a fallback is required. RACK (Recent Acknowledgement, RFC 8985, 2021) replaces fast-retransmit's "three duplicate ACKs" rule with a time-based loss detector, dramatically reducing spurious retransmits at high speeds. BBR's combination with QUIC moves the congestion controller into user space, where it can iterate faster than the kernel — the protocol stack is no longer a single timeline.

Diagnosis is mostly about latency under load. The Stanford Linear Accelerator Center's ping under load test, the WaveForm Bufferbloat test, and Cloudflare's speed.cloudflare.com all measure RTT while saturating the link; a healthy connection adds tens of milliseconds, a bufferbloated one adds hundreds. tcpdump, ss -tin, and BPF tools (Brendan Gregg's BPF Performance Tools, Addison-Wesley 2019) reveal cwnd, ssthresh, retransmit counts, and RTT estimates at runtime. The combination of a load test and on-host telemetry is enough to characterise most production TCP behaviour.

# Linux: which controllers are loaded, which is active
$ sysctl net.ipv4.tcp_available_congestion_control
net.ipv4.tcp_available_congestion_control = reno cubic bbr

$ sysctl net.ipv4.tcp_congestion_control
net.ipv4.tcp_congestion_control = cubic

# Switch to BBR, persist across reboots
$ sudo modprobe tcp_bbr
$ sudo sysctl -w net.ipv4.tcp_congestion_control=bbr
$ echo "net.ipv4.tcp_congestion_control = bbr" |     sudo tee -a /etc/sysctl.d/99-tcp.conf

# Per-route override (BBR for one prefix only)
$ sudo ip route change 10.0.0.0/8 via 10.0.0.1 \
    congctl bbr

# Inspect cwnd, ssthresh, retx live
$ ss -tin state established '( dport = :443 )'

Congestion control in user space — QUIC, eBPF, BBRv2

Congestion control in user space.

The most important shift in congestion control since BBR is not a new algorithm but a new platform. QUIC (RFC 9000, 2021) moves transport from kernel TCP into user-space libraries, which means the congestion controller is no longer a feature of the operating system. Cloudflare's quiche, Google's quiche (different code base), Microsoft's MsQuic, and Cloudflare's Rust-based quinn all ship CUBIC, BBRv1/v2/v3, and Reno variants that the application can pick at runtime. Iteration cycles dropped from kernel release pace to library release pace; new algorithms ship in months rather than years.

QUIC also changes which signals are available. Per-stream framing eliminates head-of-line blocking but means packet loss can no longer be inferred from in-order receive gaps. The QUIC congestion controller works on packet numbers, not stream sequence numbers, and tracks loss per acknowledgement frame rather than per duplicate ACK. The arithmetic looks similar; the underlying machinery is rebuilt. RFC 9002 specifies recommended QUIC congestion control (a CUBIC variant), but implementations are free to swap.

Research directions worth following. Machine-learned congestion control — Indigo (Yan et al, NSDI 2018), PCC and PCC-Vivace (Dong et al, NSDI 2015 and 2018), and the broader Pantheon framework (Yan et al, USENIX 2018) — has shown reinforcement-learned controllers outperforming hand-designed ones on specific path classes, though deployment remains rare. Multipath TCP (RFC 8684, 2020, deployed in iOS Siri and various Korean carriers) splits a single connection across multiple paths and runs a coupled controller across them. Real-time streaming controllers (Google Congestion Control for WebRTC, NADA, SCReAM) target media flows where loss tolerance is asymmetric — they will sacrifice video frames to keep audio smooth, a trade-off TCP's stream model cannot express.

What has not changed is the fundamental insight from 1988: a congestion controller is a feedback loop driven by signals available at the endpoint, balancing throughput against fairness against latency. Every algorithm picks different signals and weights them differently. Reno picks loss; CUBIC picks loss with a smarter growth curve; BBR picks measured bandwidth and minimum RTT; DCTCP picks ECN marks. The signal model is the algorithm; understanding which signals are reliable on which paths is most of what an operator needs to know to choose wisely.


Further reading on TCP congestion control

Primary sources, in order.

Found this useful?