What is the difference between token bucket and leaky bucket?

Token bucket allows bursts up to bucket capacity, then enforces an average rate. Leaky bucket smooths traffic to a constant rate, dropping or queuing anything faster — no burst allowance. Token bucket is what most APIs ship (it's more user-friendly under bursty traffic); leaky bucket is what hard-real-time systems ship (it gives the downstream a predictable rate).

Which rate limiting algorithm should I use?

For an API that needs to allow short bursts but enforce an average rate: token bucket. For traffic shaping toward a backend that can't tolerate spikes: leaky bucket. For regulatory caps that must never be exceeded in any rolling window: sliding window log. Avoid plain fixed-window — it silently allows up to 2× the limit at the boundary between windows.

How does Stripe's rate limiter work?

Stripe published a 2017 engineering post describing four rate limiters used together: a request rate limiter (token bucket per API key), a concurrency limiter (semaphore), a fleet usage load shedder, and a worker utilisation load shedder. The point: rate limiting is rarely one algorithm; it's a stack of limiters protecting different resources.

New to this? · ELI5 · 1 min Read Rate limiting explained simply, in plain English

Rate Limiter Simulator: four ways to say "slow down."

A rate limiter caps how many requests a client can make in a window and rejects or delays the rest. Token bucket, leaky bucket, fixed window, and sliding log each enforce that cap differently. Feed them one arrival rate and watch the fairness curves diverge under burst.

Pass rate

100%

Accepted

Denied

Algorithm

Rate / burst

State

10.0 / 10 tokens

refilling at 5/s

Recent

— quiet —

What you're looking at

Two rows of buttons pick the algorithm and the rate/burst settings; the State panel below then shows whatever that algorithm tracks — tokens left for token bucket, queue depth for leaky, the per-window count for fixed, the timestamps-in-the-last-second for sliding. The "Send 1", "Send burst (8)", and "Auto-load" buttons are your traffic generator, and the recent log marks each request OK or DENY. The counters up top tally accepted, denied, and the running pass rate.

Pick Token bucket and hit "Send burst (8)": the first batch sails through on the stored tokens, then later bursts get denied until the bucket refills. Switch to Fixed window and send a burst right as the window timer is about to reset — fire again the instant it flips and you can push nearly double the rate through in one short span. That boundary doubling is the bug practitioners trip over. Then compare it against Sliding log on the same traffic, which counts the real last second and never lets the spike through.

What is rate limiting?

A free public API. Then the bot shows up.

Rate limiting is the practice of capping how many requests a client can make in a given time window — to protect a backend from abuse, a shared resource from monopolisation, or a partner from itself. The four canonical algorithms are token bucket, leaky bucket, fixed window, and sliding window; the simulator above runs all four on the same traffic so you can see which one a given pattern actually needs. Stripe, Cloudflare, Kong, AWS API Gateway, and Envoy all ship variants of these four.

Imagine you ship a small public API. Every GET /weather hits a downstream service that costs you a tenth of a cent and twenty milliseconds. With ten users sending one request each per second the cost is invisible. Then a researcher discovers your endpoint, points a Python loop at it, and starts hitting you twenty thousand times per second. In one minute they've cost you a thousand dollars; in one hour they've crashed the downstream service. Your other users see 502s.

The first instinct — “just say no after the budget runs out” — is correct in spirit and very wrong in detail. If you count requests per minute and reset the counter on each minute boundary, a clever client can hammer you for one second across the boundary and double their effective rate. If you queue every excess request and drain the queue at a fixed rate, you accidentally introduce minutes of latency on a slow day. If you keep one counter shared across ten replicas, every request becomes a network call to a coordinator and your latency budget evaporates. If you don't, each replica independently lets ten thousand requests through and the downstream service still dies.

This is the problem a rate limiter solves. Concretely: a small piece of code in front of your API that says “this client gets at most N operations per T time” and rejects the rest with HTTP 429 Too Many Requests plus a Retry-After header. The contract is simple; the implementation has surprising depth. The problem is older than the public web — packet networks in the 1980s needed exactly the same primitive to police bursty traffic onto fixed-rate links — and the four canonical algorithms (token bucket, leaky bucket, fixed window, sliding window) were all formalised before HTTP existed.

Concrete numbers anchor the design space. A 100-RPS budget over a minute is six thousand operations; storing every request timestamp gives exact accounting at the cost of six thousand floats per key. A two-bucket sliding-window counter gets you within roughly 3% accuracy at the cost of two integers per key — three orders of magnitude cheaper. A token bucket with a 100-RPS refill and a 200-token burst absorbs short spikes without ever rejecting a well-behaved client; that's the algorithm Stripe, NGINX, and AWS API Gateway all use as their default.

The simulator above lets you try all four algorithms against the same incoming traffic. Pick a request rate, pick a budget, and watch which requests get accepted versus 429'd. The boundary-burst flaw of the fixed window is visible immediately. The smooth output of the leaky bucket vs the bursty output of the token bucket is visible immediately. Once you've seen those four shapes, the rest is engineering.

Origins — token buckets from ATM cells to API gateways

From ATM cells to API gateways.

Rate limiting is older than the public web. The token bucket and leaky bucket were formalised in the 1980s for traffic shaping on packet networks, when carriers needed deterministic ways to police the bursty output of newly statistical-multiplexed lines. The leaky bucket appears in Jonathan Turner's 1986 paper New directions in communications and in ITU-T's I.371 traffic-management recommendation; the token bucket appears in the same generation of work and was later canonised by RFC 2697 (single rate three-color marker, 1999) and RFC 2698 (two rate three-color marker, 1999) for IP differentiated services. Both algorithms answered the same question: given a stream of packets and a contracted rate, which packets are conforming and which are excess?

The most disciplined formalisation is the Generic Cell Rate Algorithm, defined by the ATM Forum in 1996 as part of UNI 3.1 and 4.0 traffic-control specs. GCRA tracks a single floating-point variable — the "theoretical arrival time" of the next conforming cell — and computes the verdict per arrival in constant time and constant memory. It is provably equivalent to a leaky-bucket dual, and is the algorithm that Stripe's blog post by Paul Tarjan, Scaling your API with rate limiters (2017), still recommends as the basis for an HTTP rate limiter today. Two integers per key, microsecond decisions, no clock skew between bucket leak and request arrival because there is no separate timer.

The web inherited these primitives, then re-invented half of them. The HTTP-shaped instance is RFC 6585 (April 2012), which standardised 429 Too Many Requests alongside 428 Precondition Required and 431 Request Header Fields Too Large. The 429 status is paired with Retry-After from RFC 7231 §7.1.3 — either an integer of seconds or an HTTP-date — to tell well-behaved clients when to come back. RFC 8594 added Sunset for deprecation; the IETF draft RateLimit Header Fields for HTTP (Polli, Martinez et al, ongoing) standardises RateLimit-Limit, RateLimit-Remaining, and RateLimit-Reset headers that Twitter, GitHub, and Discord have shipped in slightly different shapes for years. Underneath all of that, the four algorithms in the simulator above are the same four algorithms a switch fabric ran in 1992.

Same load, four algorithms — very different curves

Same load, very different curves.

The token bucket models a reservoir that fills at rate tokens per second up to a cap of burst. Each request takes one token; if the reservoir is empty, the request is denied. Token bucket is the kind algorithm — short bursts above the steady-state rate are allowed, smoothing bursty real-world traffic without painful per-request denials. Almost every modern API gateway exposes it as the default policy. NGINX's limit_req zone=foo burst=20 nodelay is exactly this; AWS API Gateway's "burst capacity" and "steady-state rate" are exactly this; Cloudflare's per-zone rule editor surfaces the same two knobs.

The leaky bucket models a fixed-size FIFO queue drained at a constant rate. Bursts are absorbed by buffering, not by acceptance — incoming requests wait their turn, and only when the queue overflows are they rejected. The output is dead steady. This is the algorithm of choice when downstream stability matters more than perceived latency: payment processors, SMS gateways, anything that must not be hammered. Note the dual: a leaky bucket of size B draining at rate r is mathematically equivalent to a token bucket of size B filling at r, with one shaping output and the other policing input.

The fixed window counts requests per wall-clock minute (or second, or hour). It is the algorithm everyone writes first, in three lines of Redis, and abandons by week three. Memory is one integer per key. The flaw — explored in Part 03 — is that two adjacent windows can stack, doubling the effective ceiling at the boundary. Cloudflare's documentation calls this the “burst at the boundary” pitfall and shows real customer graphs with the characteristic spike.

The sliding window log stores the timestamp of every recent request in a per-key sorted structure, then counts how many fall inside the last window milliseconds. Memory is O(rate · window) per key. Accuracy is exact. Cost is real: at 1000 RPS per key with a 60-second window you store 60,000 timestamps. The cheaper relative is the sliding window counter, which keeps only the previous and current bucket counts and weights them by elapsed fraction; this gives O(1) memory with worst-case ~3% over the contract — the algorithm Cloudflare's Counting things, a lot of different things blog post (2017) describes powering its edge.

# Token bucket. r = refill rate (tokens/s); B = burst cap.
def allow(key, now):
    t, last = state.get(key, (B, now))    # full bucket on first hit
    t = min(B, t + (now - last) * r)      # lazy refill
    if t &lt; 1:
        state[key] = (t, now)
        return False, retry_after = (1 - t) / r
    state[key] = (t - 1, now)
    return True, 0

Why fixed windows silently allow 2× the limit

The fixed-window flaw, in one example.

Pick a 100-requests-per-minute fixed window. At 12:00:59.500 a client fires 100 requests in 200 ms — exactly the contract. At 12:01:00.001 the window resets and the client fires another 100. Inside a one-second wall-clock interval (12:00:59.500 to 12:01:00.500) the client got 200 requests through. The contract was 100 per minute; the client achieved the equivalent of 12,000 per minute over that one second.

This is not a hypothetical. Cloudflare's edge data shows it routinely. The fix has three forms. First, the sliding window log: store every timestamp and recompute. Exact, expensive — O(rate) memory per key. Second, the weighted sliding window counter: keep only two buckets, the current and the previous, and on lookup combine prev_count × (1 - elapsed_fraction) + curr_count. The math assumes uniform distribution within the previous bucket, so the worst case is a few percent over — but the typical case is excellent and the memory is one integer pair per key. Third, randomised jitter: instead of resetting all client windows at the same instant, shift each client's reset point by a per-key hash. Cloudflare published this technique under the name token bucket with random alignment in the same 2017 post.

A subtler version of the same flaw is the thundering herd at the next boundary. If a system rate-limits with a fixed-window minute boundary, every client whose request was rejected at second 59 retries together at second 60. The retry storm hits the next window simultaneously, exhausts the limit instantly, and forces another wave of retries against the next boundary. The classical mitigation is jitter on Retry-After: instead of returning Retry-After: 1, return Retry-After: 1 + uniform(0, 1). Marc Brooker at AWS wrote about this in his 2015 piece Exponential Backoff and Jitter; the Amazon Builder's Library “Timeouts, retries, and backoff with jitter” chapter is the canonical write-up.

If you take only one rule from this article: never set Retry-After to a fixed integer without jitter. The fixed window is a fragile contract; the synchronised retry is the load test you didn't ask for.

Distributed rate limiting — a single bucket across many machines

A single bucket across many machines.

The token bucket in this simulator lives in one process. With ten replicas behind a load balancer and round-robin, each replica sees roughly 1/10 of the traffic — so a per-replica limit of N effectively allows 10N globally. To enforce a real global limit, the bucket has to be shared. The classic shared store is Redis: atomic INCR on a key like rl:user-123:60s with a one-minute TTL gives you fixed-window globally, in constant memory per key, in well under a millisecond per call. Add a Lua script and the whole “test, decrement, log” sequence becomes a single round trip.

Stripe's 2017 post is the most-cited example: four limiters layered in series, each in Redis, each on a different key — request rate, concurrency, fleet usage, and load shedding. The clever bit is that the limiters all live on the same Redis cluster but read different keys, so the additional cost of the layered policy is one extra round trip rather than four. Envoy's ratelimit service (Lyft, open-sourced 2017) takes the same shape: a gRPC service in front of Redis, with declarative descriptors so an Envoy filter can ask “is this request allowed?” in a single RPC. AWS API Gateway, Google Cloud's quota system, and Apigee all run distributed token buckets fronted by sharded counters.

The pure-Redis approach has two failure modes worth naming. First, hot keys. A single high-traffic tenant funnels all of its rate decisions through one key on one Redis shard, and that shard becomes the bottleneck. The fix is sharded counters: pick a hash of the request, take the modulo, and route to one of N sub-buckets each holding 1/N of the budget. You lose precision (a tenant might get 0.99x or 1.01x the contract on the boundary), gain throughput, and avoid concentrating risk. Second, Redis round-trip latency. At 100 µs per call, a service that does five rate-limit checks per request adds 500 µs of pure overhead. The mitigation is local pre-allocation: each replica leases a fraction of capacity for one second, decrements locally, and only consults Redis on lease refresh. NGINX Plus and Envoy local-rate-limit filters work this way.

For very large scales there is a third option: don't use a counter at all. Instead, use an approximate streaming algorithm — a count-min sketch, perhaps backed by a HyperLogLog for cardinality. Cloudflare uses this for its bot-management heuristics, sacrificing exactness for the ability to track millions of distinct attacker fingerprints in fixed memory. The Stripe four-limiter post explicitly trades approximation for precision in three of its four layers; only the topmost one (the contract-bound RPS) uses an exact counter.

-- Lua, atomic in Redis: sliding-window-log limiter
-- KEYS[1] = key, ARGV[1] = now_ms, ARGV[2] = window_ms, ARGV[3] = limit
local now = tonumber(ARGV[1])
local window = tonumber(ARGV[2])
local limit = tonumber(ARGV[3])
redis.call('ZREMRANGEBYSCORE', KEYS[1], 0, now - window)
local count = redis.call('ZCARD', KEYS[1])
if count >= limit then return 0 end
redis.call('ZADD', KEYS[1], now, now)
redis.call('PEXPIRE', KEYS[1], window)
return 1

Rate limiting dimensions — per-IP, per-user, per-tenant, all of the above

Per-IP, per-user, per-tenant, all of the above.

“100 RPS” is meaningless without a key. The interesting design question is which dimension you limit on, and the canonical answer in production is “all of them, in series.” Per-IP is blunt: it protects against single hosts but is trivially defeated by botnets and punishes shared NATs (corporate offices, mobile carriers, Tor exits). Cloudflare data shows that single residential IPs sometimes carry hundreds of legitimate users from a single ISP CGNAT pool, so a strict per-IP cap of even a few hundred RPS can break a small ISP's customers. Per-user is precise but requires authentication; it cannot help on the unauthenticated path where attacks usually live. Per-tenant is the typical B2B SaaS layer (“your account gets X RPS, distribute it among your users”). Per-endpoint protects expensive routes specifically — a 10 RPS cap on /admin/export is reasonable; the same cap on /health would break monitoring.

Production systems run several limiters in series: Cloudflare typically applies an edge-IP limit, then a tenant limit, then an endpoint limit; Stripe's post adds a concurrency limit and a fleet-usage limit on top. Each catches a different attack vector; together they cover the matrix without any single one being too aggressive. The composition is multiplicative: a request must pass all of them to be accepted, but if it fails any one it is rejected with a 429 that names which limiter denied it (typically via a custom header like X-RateLimit-Reason).

The cardinality of keys matters more than people expect. A per-IP limiter on IPv6 must cope with /64-sized address pools per residential customer; the canonical fix is to limit on a /48 or /56 prefix rather than a single /128 address. A per-API-key limiter has bounded cardinality (you have a manageable number of customers); a per-IP limiter on a public-facing service has unbounded cardinality and must use TTLs aggressively to avoid a memory leak. Redis with EXPIRE handles this; an in-process map without expiry is a slow, certain outage. Discord's 2018 engineering post on rate-limiting describes how their original Python in-memory limiter survived for two years and then ate a node's RAM in 90 seconds when a small bot army hit a forgotten endpoint.

One more pattern: the shared budget. If two endpoints are both expensive and both used by the same client class, a single bucket covering both — rather than two independent buckets — produces saner backpressure. The client cannot fan out across endpoints to defeat the limit; the limiter works on cumulative spend, not per-route. Stripe describes this as “weighted credits” where each endpoint has a cost and the bucket holds a budget in those units.

Algorithm	Memory/key	Boundary burst	Used by
Fixed window	1 int	2× on the boundary	naive Redis tutorials
Sliding window log	O(rate · window)	none	low-cardinality APIs
Sliding window counter	2 ints	~3% over	Cloudflare edge (2017)
Token bucket	2 floats (lazy refill)	tunable burst	NGINX, AWS API Gateway
Leaky bucket	queue + drain timer	none (smooth output)	payment gateways
GCRA (virtual scheduling)	1 timestamp	none	ATM, redis-cell module

Always jitter Retry-After

Every client whose request fires Retry-After: 1 retries on the same wall-clock second, exhausting the next window in microseconds. Marc Brooker's Exponential Backoff and Jitter (AWS, 2015) is the canonical fix: delay = base + uniform(0, jitter). The Amazon Builder's Library has the full treatment.

Rate limiting in production — Stripe, Cloudflare, Kong, Envoy, AWS

What real gateways actually run.

NGINX ships limit_req as a leaky-bucket policer. The directive limit_req_zone $binary_remote_addr zone=mylimit:10m rate=10r/s; reserves 10 MB of shared memory (about 160,000 IP slots) and sets a 10 RPS contract. The companion limit_req zone=mylimit burst=20 nodelay; turns the leaky bucket into a token-bucket equivalent by allowing a 20-request burst with no queueing delay. NGINX also has limit_conn for concurrency, often combined with the above. The zone shared-memory size is the only real tuning knob — too small and you start evicting active limiters; too large and you waste memory. NGINX's documentation gives the rule of thumb: each tracked entity needs about 64 bytes.

Envoy separates local and global rate limiting. The local_ratelimit filter is a token bucket per worker thread, configured directly in the listener; it costs nothing per request because there is no network call. The ratelimit filter consults an external gRPC service (Lyft's open-source envoyproxy/ratelimit, backed by Redis), which is the path you use when you need a fleet-wide quota. Most production deployments use both: local for cheap, global for correct. Lyft's original implementation runs in Go, talks to a Redis cluster, and reportedly handles millions of decisions per second across their fleet.

AWS API Gateway exposes “throttling” with two knobs per stage: steady-state rate and burst capacity. Underneath, this is a token bucket. The default account limit is 10,000 RPS with a 5,000-token burst; both can be raised via support ticket. The dollar cost is interesting: API Gateway charges per request whether or not it was throttled, so an unprotected gateway in front of a small Lambda function can run up bills under attack. The mitigation is a Web Application Firewall ahead of the Gateway, which can do its own rate limiting and reject without the per-request charge — the AWS WAF v2 rate-based rule is exactly this primitive.

Cloudflare exposes per-zone, per-path, and per-Worker rate limiting; the marquee feature is “Advanced Rate Limiting”, which keys on arbitrary request fields (any header, any cookie, any URL component). Internally Cloudflare runs the algorithm described in their 2017 post: a sliding-window counter with consistent-hash sharding across the edge, with eventual consistency tolerable because the windows are short. GitHub's public REST API uses a 5,000-requests-per-hour authenticated limit and 60-per-hour unauthenticated; the limit is exposed via the X-RateLimit-Limit, X-RateLimit-Remaining, and X-RateLimit-Reset response headers — the de-facto standard the IETF draft is now codifying.

Where rate limiters make outages worse

Where rate limiters make outages worse.

A rate limiter is a load-shedding device, and a poorly implemented one can convert a small overload into a self-amplifying outage. The thundering herd at the boundary is the most famous failure: clients denied at t = 59.9s retry at t = 60.0s, exhaust the new window in milliseconds, retry again at t = 60.001s, and so on. The system spends the entire window saturated by retries rather than by useful work. The fix is jittered backoff, ideally exponential — and ideally on the server side via a randomised Retry-After rather than relying on every client to have implemented backoff correctly. The Marc Brooker / AWS Builder's Library article on this is required reading.

The second classic failure is limiter as critical dependency. If the limiter is in the request path and Redis goes down, you have a choice: fail open (accept all traffic, including any in-progress attack) or fail closed (reject all traffic, including legitimate users). Most production systems fail open with logging, on the theory that an outage of the limiter is worse than a brief overload. But fail-open under attack is exactly the wrong default. The discipline is to put the limiter on the request path with a tight timeout (5–10 ms) and fail open only after that timeout has been exceeded twice in a row, treating the limiter outage as a separate operational event. Envoy's failure_mode_deny flag controls this behaviour; the default is fail-open, deliberately.

The third is retry amplification. A client that retries on 429 with no backoff will just hammer the limiter; a client that retries on 5xx with no backoff will hammer the upstream. Some services confuse the two: a 429 means “you, slow down”; a 503 means “we, are overloaded”. Treating 5xx as a rate-limit signal is the right behaviour for the client; but it requires the server to actually return 503 when overloaded rather than letting the request hang. Brendan Gregg's USE method is helpful here: monitor the limiter's saturation, not just its throughput, so you can see when retries are becoming a self-load.

The fourth, subtler failure is contract drift. A documented limit of “100 RPS per key” with a sliding-window-counter implementation that allows up to 103 RPS on the boundary will quietly let through 3% extra traffic forever. Most users never notice. But every now and then a customer engineers their workload right up to the contract and starts seeing intermittent denials when their actual RPS is 99. Document the algorithm, not just the rate; or pick an algorithm whose worst case matches its advertised case. GCRA is one of the few limiters whose contract and implementation are mathematically identical.