A distributed rate limiter
Algorithmically tiny, operationally interesting. The token-bucket math fits on a napkin; what makes this an interesting question is the engineering around the math — shaving the Redis round-trip with local pre-deduction, surviving a Redis flake without letting bad actors through, handling hot keys without overprovisioning the cluster, and the fail-open-vs-fail-closed choice that defines whether your gateway is a security tool or just a politeness layer.
1 · Clarifying questions
| Where does the limiter run? | API gateway, in front of every public-facing service. ~50K req/s aggregate. |
| What's the limit dimension? | Multiple — per IP, per user, per API key, per (user, endpoint). Rules are configurable, not hard-coded. |
| Strict or eventually-correct? | Strict on average, allow brief overshoot under burst. Internet-scale rate limiters are not transactions. |
| Latency budget? | P99 ≤ 5 ms added by the limiter. Above that and the limiter becomes part of the latency story. |
| Fail-open or fail-closed? | Per-rule. Anti-abuse rules fail-closed; "fairness" rules fail-open. The product owns the choice; the design exposes it. |
| What happens when blocked? | HTTP 429 with Retry-After header and standard rate-limit headers (RFC draft 9). Consistent shape across all services. |
| Multi-region? | Yes. Each region has its own limiter; rules are global config but counters are regional. |
| Burst handling? | Token bucket with capacity ≥ rate. A user with rate=10/s and capacity=100 can send 100 in a burst then refill at 10/s. |
2 · Capacity math, on a napkin
| Number | Calculation | Result |
|---|---|---|
| API gateway QPS | given | ~50K (peak ~150K) |
| Active rate-limit keys | 1 per active user × ~10M concurrent + 100K API keys | ~10M keys |
| Bytes per key (counter + meta) | ~64 B | — |
| Hot working set | ~5% of keys at peak × 64 B | ~32 MB (trivial for Redis) |
| Per-request Redis ops | 1 LUA script (atomic CAS-and-decrement) | 1 op |
| Per-pod local cache hit rate | ~80% (sticky-ish traffic, short windows) | — |
| Effective Redis QPS | peak × (1 − 0.8) | ~30K Redis ops/s |
| Redis cluster | 3 shards × 3 replicas (RAM 32 GB) | Headroom 100× |
| Concurrency, limiter | 150K × 0.005 s | ~750 in-flight |
The math says one Redis cluster easily handles 50K QPS — even with no local pre-deduction. The local-cache-with-sync pattern below is for two reasons: (1) shave the 0.5–1 ms RTT from the hot path, (2) keep working when Redis is unhappy.
3 · API and data model
Limiter check (RPC, called by the gateway)
# Check + atomically deduct
RateLimit.Check(key, n=1) → { allowed, remaining, reset_at }
# A request matches multiple rules — check each in order; deny on first deny.
# Example rules for a single request:
# ip:1.2.3.4 rate=1000/min capacity=2000
# user:u_abc rate=100/min capacity=200
# user_endpoint:u_abc:POST /v1/posts rate=10/min capacity=20
# Response when blocked:
HTTP/1.1 429 Too Many Requests
RateLimit-Limit: 100
RateLimit-Remaining: 0
RateLimit-Reset: 12 # seconds until next token
Retry-After: 12Storage
# Token bucket state, per (rule, key) pair
# Lives in Redis; sharded by hash(rule:key)
bucket:rule_id:key_hash → {
tokens: FLOAT, # current token count
last_refill: TIMESTAMP, # last refill time (ms)
capacity: INTEGER, # max tokens
rate: FLOAT # tokens / second
}
TTL: 2 × window # auto-expires when key goes cold
# Rule definitions live in config (etcd / consul), pushed to gateways.
rules → {
rule_id: "ip-default",
match: {dimension: "ip", scope: "global"},
rate: 1000,
per: "minute",
capacity: 2000,
on_fail: "fail_open" | "fail_closed",
enabled: true
}4 · The four algorithms, ranked
| Algorithm | How it works | Pros | Cons |
|---|---|---|---|
| Token bucket | Bucket of capacity C, refills at rate R. Each request takes 1 token. | Allows bursts up to C. Cheap to compute (2 numbers). | None significant. The default choice. |
| Leaky bucket | Queue of capacity C; processes at rate R; new requests appended. | Smooths traffic to a constant rate. | Can't burst. Adds latency. Usually the wrong product behaviour. |
| Fixed window counter | Increment a counter per window; reset at boundary. | Trivial. | Allows 2× burst at window boundary (last second of W1 + first second of W2). |
| Sliding window log / sliding window counter | Track timestamps in window; or weight previous + current windows. | Avoids the boundary burst. | Log version is memory-heavy; counter version is the standard fix. |
Use token bucket. It models the product behaviour most engineers actually want — "you can do up to N per second, with a burst budget of M". Sliding window counter is a fine alternative when bursts are not allowed.
5 · The hard part — distributed atomicity
Two gateway pods see two requests for the same user simultaneously. They both read the bucket, both find 1 token, both write back 0 tokens, both allow. Counter under-counts; user gets to send 2 instead of 1.
The fix is atomic check-and-decrement. Three options:
| Option | How | Trade-off |
|---|---|---|
| Redis Lua | One EVAL script computes refill + decrement + write atomically. Single round-trip per check. | Standard answer. Lua scripts run single-threaded on the Redis main event loop. ~50 µs per script. |
| Redis WATCH/MULTI | Optimistic — read, compute, write within a transaction; retry on conflict. | Two round-trips on success; more on conflict. Almost never the right answer over Lua. |
| Database row-level lock | SELECT FOR UPDATE on a Postgres row. | Works but ~10× slower than Redis. Use only if Redis is unavailable in your stack. |
The Lua script
-- KEYS[1] = bucket key
-- ARGV[1] = capacity
-- ARGV[2] = refill rate (tokens/sec)
-- ARGV[3] = now (ms since epoch)
-- ARGV[4] = requested tokens (default 1)
local bucket = redis.call("HMGET", KEYS[1], "tokens", "last")
local tokens = tonumber(bucket[1]) or tonumber(ARGV[1])
local last = tonumber(bucket[2]) or tonumber(ARGV[3])
local elapsed = (tonumber(ARGV[3]) - last) / 1000.0
tokens = math.min(tonumber(ARGV[1]), tokens + elapsed * tonumber(ARGV[2]))
local need = tonumber(ARGV[4])
local allow = (tokens >= need) and 1 or 0
if allow == 1 then tokens = tokens - need end
redis.call("HMSET", KEYS[1], "tokens", tokens, "last", ARGV[3])
redis.call("EXPIRE", KEYS[1], 60)
return { allow, math.floor(tokens) }~25 lines, runs in ~50 µs on the Redis main loop. The atomicity is "Redis processes one Lua call at a time on a single shard" — exactly what we need.
6 · The local-cache-with-sync pattern
The Lua call adds 0.5–1 ms of network RTT to every API request. At 50K QPS, that's a lot of round-trips. The optimisation: each gateway pod holds a local "lease" on a fraction of each user's tokens, and only goes to Redis when its lease runs out.
- On first request for user U: pod calls Redis, reserves K tokens (e.g., 10), gets back
{ tokens, leaseExpires }. - On subsequent requests: decrement the local lease in process memory. No network call.
- When lease runs out or expires: ask Redis for K more.
- On pod shutdown: return remaining lease tokens.
Trade-off: a pod crashing loses its in-flight lease — users effectively get a small burst they didn't earn. Bound the loss by leasing small chunks (5–10 tokens) and short TTLs (1–5 s).
7 · Hot-key sharding
Some keys are very hot — the IP rule for a NAT gateway, the API key for a popular downstream service. Single-shard hot-keying causes one Redis node to saturate while the others idle.
| Approach | How | Notes |
|---|---|---|
| Sub-key sharding | For known-hot keys, split into N sub-keys: ip:1.2.3.4:[0..N]; each request picks one randomly. | Aggregate behaviour matches; per-request limit can drift if N is too large. |
| Local-cache-only | For hot IPs, skip Redis entirely; rely on the per-pod cache with longer leases. | Looser consistency but Redis-bypass when Redis can't keep up. |
| Tier-up to dedicated Redis | Hot keys get pinned to a dedicated cluster. | Operational complexity; rarely worth it under ~1M QPS per key. |
8 · Fail-open vs fail-closed
When the limiter itself fails (Redis down, network partition), what does the gateway do? The two answers solve different problems.
- Fail-open (allow). Use for fairness rules — "you get 100/min". Brief overshoot is fine; refusing all traffic is worse than serving a bit too much.
- Fail-closed (block). Use for anti-abuse rules — "users with this auth get 1/sec". A flood while the limiter is down is worse than blocking briefly.
Make this per-rule, not per-system. The right rule design specifies on_fail
explicitly, not by convention.
9 · Failure modes & runbook
| Failure | Symptom | Mitigation |
|---|---|---|
| Redis primary down | Lua scripts fail; pod falls back to local cache | Per-rule fail-open/closed kicks in. Failover to replica (~30 s). |
| Redis cluster partition | Some shards reachable, others not | Rules whose key hashes into reachable shards work; others use the fallback. Track per-shard error rates. |
| Lua script slow under load | P99 limiter latency above SLO | Script optimisation (reduce HMGETs); shard further; cache rule definitions per-pod. |
| Hot key DOS | One Redis shard at 100% CPU | Sub-key sharding; tier-up the key to a dedicated cluster; tighten the offending rule. |
| Lease pool starvation | Pod's lease pool exhausted, every request goes to Redis | Increase lease size; alert on lease-refresh QPS as a leading indicator. |
| Rule misconfiguration | Whole region returns 429 | Config rollback; rules deployed via canary + rollback automation. |
| Bypass header leaked | Test/admin tokens reach prod traffic | Audit; rotate credentials; add per-bypass-token logging and rate-limit-on-rate-limiter (limit how often a bypass can be used). |
10 · Cost & SLOs
| Line | Estimate | Note |
|---|---|---|
| Redis cluster (3 shards × 3 replicas, 32 GB) | ~$3K / month | Single managed cluster per region |
| Limiter lib in gateway pods | ~0 | Already running on the gateway fleet |
| Config-distribution (etcd) | ~$500 / month | Reused infrastructure |
| Total (per region) | ~$3.5K / month | Cheapest design in the playbook |
SLOs
- P99 added latency: 5 ms. 80% local cache hit + 1 ms Redis budget for the rest.
- Availability: 99.99%. Per-rule fail-open/closed dictates what "available" means.
- Configuration propagation: P99 ≤ 60 s. Rule changes reach all gateways within a minute.
- Accuracy: ≤ 5% drift over a 60-second window. Local-cache-with-sync trades exact accuracy for latency; bound the drift in the SLO.
11 · Trade-offs & "what would you change at 10×"
| If… | Then… |
|---|---|
| 10× QPS (500K/s) | Shard Redis further; bump lease sizes; consider colocating limiter state on the gateway pod (tradeoff: less accurate global view). |
| Strict rate limit (no overshoot) | Drop the local cache. Take the Redis RTT on every request. P99 drifts to 2–3 ms. |
| Global single-bucket (cross-region) | Coordinator service (Redis-on-one-region or Spanner-style) becomes a SPOF. The honest answer is "we don't do this; we use per-region rules and assume regions are independent". |
| Multi-tenant SaaS with per-tenant quotas | Tenant ID becomes another dimension; rules cascade (per-tenant + per-user + per-IP). Watch the rule-count explosion. |
| Rate-limit by request weight (cost-based) | Each request has a "cost" parameter; bucket holds cost units. The Lua script's need parameter already supports this. |
| "What would a more senior answer add?" | The control plane: dynamic rule updates from real-time abuse signals, automated rule rollback on 429-spike alarms, the integration with the security team's incident response. The static rate limiter is the design above; the adaptive one is the next layer up. |
Further reading
- Stripe — "Scaling your API with rate limiters". The canonical engineering writeup. Token bucket, fail-open semantics, the operational details.
- Cloudflare — "How we built rate limiting capable of scaling to millions of domains". The CDN-edge view; useful for the highest-scale shape.
- GitHub — "GitHub's Engineering Team Has Joined Slack" — actually, look up "How we built the GitHub.com rate limiter". A different shape (per-API-key) at meaningful scale.
- Lyft Engineering — "Envoy's local rate limiting". The colocated-on-the-gateway design point — useful counterpoint.
- Marc Brooker — "Exponential Backoff and Jitter". The client-side companion to a rate limiter. Without jitter, retries synchronise and re-saturate.
- RFC draft 9 — "RateLimit Header Fields for HTTP". The standard headers; use them so clients can be polite back.
- Adjacent: API gateway. Where the limiter actually lives.
- Adjacent: Redis. The substrate.