06 / 19

Playbook / 06

A distributed rate limiter

Algorithmically tiny, operationally interesting. The token-bucket math fits on a napkin; what makes this an interesting question is the engineering around the math — shaving the Redis round-trip with local pre-deduction, surviving a Redis flake without letting bad actors through, handling hot keys without overprovisioning the cluster, and the fail-open-vs-fail-closed choice that defines whether your gateway is a security tool or just a politeness layer.

1 · Clarifying questions

Where does the limiter run?	API gateway, in front of every public-facing service. ~50K req/s aggregate.
What's the limit dimension?	Multiple — per IP, per user, per API key, per (user, endpoint). Rules are configurable, not hard-coded.
Strict or eventually-correct?	Strict on average, allow brief overshoot under burst. Internet-scale rate limiters are not transactions.
Latency budget?	P99 ≤ 5 ms added by the limiter. Above that and the limiter becomes part of the latency story.
Fail-open or fail-closed?	Per-rule. Anti-abuse rules fail-closed; "fairness" rules fail-open. The product owns the choice; the design exposes it.
What happens when blocked?	HTTP 429 with `Retry-After` header and standard rate-limit headers (RFC draft 9). Consistent shape across all services.
Multi-region?	Yes. Each region has its own limiter; rules are global config but counters are regional.
Burst handling?	Token bucket with capacity ≥ rate. A user with rate=10/s and capacity=100 can send 100 in a burst then refill at 10/s.

2 · Capacity math, on a napkin

Number	Calculation	Result
API gateway QPS	given	~50K (peak ~150K)
Active rate-limit keys	1 per active user × ~10M concurrent + 100K API keys	~10M keys
Bytes per key (counter + meta)	~64 B	—
Hot working set	~5% of keys at peak × 64 B	~32 MB (trivial for Redis)
Per-request Redis ops	1 LUA script (atomic CAS-and-decrement)	1 op
Per-pod local cache hit rate	~80% (sticky-ish traffic, short windows)	—
Effective Redis QPS	peak × (1 − 0.8)	~30K Redis ops/s
Redis cluster	3 shards × 3 replicas (RAM 32 GB)	Headroom 100×
Concurrency, limiter	150K × 0.005 s	~750 in-flight

The math says one Redis cluster easily handles 50K QPS — even with no local pre-deduction. The local-cache-with-sync pattern below is for two reasons: (1) shave the 0.5–1 ms RTT from the hot path, (2) keep working when Redis is unhappy.

3 · API and data model

Limiter check (RPC, called by the gateway)

# Check + atomically deduct
RateLimit.Check(key, n=1) → { allowed, remaining, reset_at }

# A request matches multiple rules — check each in order; deny on first deny.
# Example rules for a single request:
# ip:1.2.3.4 rate=1000/min capacity=2000
# user:u_abc rate=100/min capacity=200
# user_endpoint:u_abc:POST /v1/posts rate=10/min capacity=20

# Response when blocked:
HTTP/1.1 429 Too Many Requests
RateLimit-Limit: 100
RateLimit-Remaining: 0
RateLimit-Reset: 12 # seconds until next token
Retry-After: 12

Storage

# Token bucket state, per (rule, key) pair
# Lives in Redis; sharded by hash(rule:key)

bucket:rule_id:key_hash → {
 tokens: FLOAT, # current token count
 last_refill: TIMESTAMP, # last refill time (ms)
 capacity: INTEGER, # max tokens
 rate: FLOAT # tokens / second
}
TTL: 2 × window # auto-expires when key goes cold

# Rule definitions live in config (etcd / consul), pushed to gateways.
rules → {
 rule_id: "ip-default",
 match: {dimension: "ip", scope: "global"},
 rate: 1000,
 per: "minute",
 capacity: 2000,
 on_fail: "fail_open" | "fail_closed",
 enabled: true
}

4 · The four algorithms, ranked

Algorithm	How it works	Pros	Cons
Token bucket	Bucket of capacity C, refills at rate R. Each request takes 1 token.	Allows bursts up to C. Cheap to compute (2 numbers).	None significant. The default choice.
Leaky bucket	Queue of capacity C; processes at rate R; new requests appended.	Smooths traffic to a constant rate.	Can't burst. Adds latency. Usually the wrong product behaviour.
Fixed window counter	Increment a counter per window; reset at boundary.	Trivial.	Allows 2× burst at window boundary (last second of W1 + first second of W2).
Sliding window log / sliding window counter	Track timestamps in window; or weight previous + current windows.	Avoids the boundary burst.	Log version is memory-heavy; counter version is the standard fix.

Use token bucket. It models the product behaviour most engineers actually want — "you can do up to N per second, with a burst budget of M". Sliding window counter is a fine alternative when bursts are not allowed.

Interactive · Token-bucket simulator

10.0 / 10

↓ refill 5 / s

Capacity 10

Refill rate / s 5

Arrivals / s 8

3 rps over budget

Allowed 0

Denied 0

Allow % 0%

Recent requests

Try this. Set arrivals = refill — every request fits. Now bump arrivals above refill — denials start. Big capacity means a brief burst is absorbed (the bucket starts full); a small capacity smooths to the rate immediately.

5 · The hard part — distributed atomicity

Two gateway pods see two requests for the same user simultaneously. They both read the bucket, both find 1 token, both write back 0 tokens, both allow. Counter under-counts; user gets to send 2 instead of 1.

The fix is atomic check-and-decrement. Three options:

Option	How	Trade-off
Redis Lua	One `EVAL` script computes refill + decrement + write atomically. Single round-trip per check.	Standard answer. Lua scripts run single-threaded on the Redis main event loop. ~50 µs per script.
Redis WATCH/MULTI	Optimistic — read, compute, write within a transaction; retry on conflict.	Two round-trips on success; more on conflict. Almost never the right answer over Lua.
Database row-level lock	SELECT FOR UPDATE on a Postgres row.	Works but ~10× slower than Redis. Use only if Redis is unavailable in your stack.

The Lua script

-- KEYS[1] = bucket key
-- ARGV[1] = capacity
-- ARGV[2] = refill rate (tokens/sec)
-- ARGV[3] = now (ms since epoch)
-- ARGV[4] = requested tokens (default 1)

local bucket = redis.call("HMGET", KEYS[1], "tokens", "last")
local tokens = tonumber(bucket[1]) or tonumber(ARGV[1])
local last = tonumber(bucket[2]) or tonumber(ARGV[3])

local elapsed = (tonumber(ARGV[3]) - last) / 1000.0
tokens = math.min(tonumber(ARGV[1]), tokens + elapsed * tonumber(ARGV[2]))

local need = tonumber(ARGV[4])
local allow = (tokens >= need) and 1 or 0
if allow == 1 then tokens = tokens - need end

redis.call("HMSET", KEYS[1], "tokens", tokens, "last", ARGV[3])
redis.call("EXPIRE", KEYS[1], 60)
return { allow, math.floor(tokens) }

~25 lines, runs in ~50 µs on the Redis main loop. The atomicity is "Redis processes one Lua call at a time on a single shard" — exactly what we need.

6 · The local-cache-with-sync pattern

The Lua call adds 0.5–1 ms of network RTT to every API request. At 50K QPS, that's a lot of round-trips. The optimisation: each gateway pod holds a local "lease" on a fraction of each user's tokens, and only goes to Redis when its lease runs out.

On first request for user U: pod calls Redis, reserves K tokens (e.g., 10), gets back { tokens, leaseExpires }.
On subsequent requests: decrement the local lease in process memory. No network call.
When lease runs out or expires: ask Redis for K more.
On pod shutdown: return remaining lease tokens.

Trade-off: a pod crashing loses its in-flight lease — users effectively get a small burst they didn't earn. Bound the loss by leasing small chunks (5–10 tokens) and short TTLs (1–5 s).

Why this matters. Without it, every API request hits Redis; that's a P99 floor of ~1 ms. With it, the local cache absorbs ~80% of requests at zero cost; Redis QPS drops 5×. At very high scale this is the difference between one Redis cluster and three.

7 · Hot-key sharding

Some keys are very hot — the IP rule for a NAT gateway, the API key for a popular downstream service. Single-shard hot-keying causes one Redis node to saturate while the others idle.

Approach	How	Notes
Sub-key sharding	For known-hot keys, split into N sub-keys: `ip:1.2.3.4:[0..N]`; each request picks one randomly.	Aggregate behaviour matches; per-request limit can drift if N is too large.
Local-cache-only	For hot IPs, skip Redis entirely; rely on the per-pod cache with longer leases.	Looser consistency but Redis-bypass when Redis can't keep up.
Tier-up to dedicated Redis	Hot keys get pinned to a dedicated cluster.	Operational complexity; rarely worth it under ~1M QPS per key.

8 · Fail-open vs fail-closed

When the limiter itself fails (Redis down, network partition), what does the gateway do? The two answers solve different problems.

Fail-open (allow). Use for fairness rules — "you get 100/min". Brief overshoot is fine; refusing all traffic is worse than serving a bit too much.
Fail-closed (block). Use for anti-abuse rules — "users with this auth get 1/sec". A flood while the limiter is down is worse than blocking briefly.

Make this per-rule, not per-system. The right rule design specifies on_fail explicitly, not by convention.

The graceful degradation play. When Redis is unreachable, fall back to the local-cache lease (with no replenishment). For fail-open rules, eventually allow everything; for fail-closed rules, refuse once leases run out. Document this and rehearse it — most teams discover their rate-limiter behaviour under Redis failure during the actual outage.

9 · Failure modes & runbook

Failure	Symptom	Mitigation
Redis primary down	Lua scripts fail; pod falls back to local cache	Per-rule fail-open/closed kicks in. Failover to replica (~30 s).
Redis cluster partition	Some shards reachable, others not	Rules whose key hashes into reachable shards work; others use the fallback. Track per-shard error rates.
Lua script slow under load	P99 limiter latency above SLO	Script optimisation (reduce HMGETs); shard further; cache rule definitions per-pod.
Hot key DOS	One Redis shard at 100% CPU	Sub-key sharding; tier-up the key to a dedicated cluster; tighten the offending rule.
Lease pool starvation	Pod's lease pool exhausted, every request goes to Redis	Increase lease size; alert on lease-refresh QPS as a leading indicator.
Rule misconfiguration	Whole region returns 429	Config rollback; rules deployed via canary + rollback automation.
Bypass header leaked	Test/admin tokens reach prod traffic	Audit; rotate credentials; add per-bypass-token logging and rate-limit-on-rate-limiter (limit how often a bypass can be used).

10 · Cost & SLOs

Line	Estimate	Note
Redis cluster (3 shards × 3 replicas, 32 GB)	~$3K / month	Single managed cluster per region
Limiter lib in gateway pods	~0	Already running on the gateway fleet
Config-distribution (etcd)	~$500 / month	Reused infrastructure
Total (per region)	~$3.5K / month	Cheapest design in the playbook

SLOs

P99 added latency: 5 ms. 80% local cache hit + 1 ms Redis budget for the rest.
Availability: 99.99%. Per-rule fail-open/closed dictates what "available" means.
Configuration propagation: P99 ≤ 60 s. Rule changes reach all gateways within a minute.
Accuracy: ≤ 5% drift over a 60-second window. Local-cache-with-sync trades exact accuracy for latency; bound the drift in the SLO.

11 · Trade-offs & "what would you change at 10×"

If…	Then…
10× QPS (500K/s)	Shard Redis further; bump lease sizes; consider colocating limiter state on the gateway pod (tradeoff: less accurate global view).
Strict rate limit (no overshoot)	Drop the local cache. Take the Redis RTT on every request. P99 drifts to 2–3 ms.
Global single-bucket (cross-region)	Coordinator service (Redis-on-one-region or Spanner-style) becomes a SPOF. The honest answer is "we don't do this; we use per-region rules and assume regions are independent".
Multi-tenant SaaS with per-tenant quotas	Tenant ID becomes another dimension; rules cascade (per-tenant + per-user + per-IP). Watch the rule-count explosion.
Rate-limit by request weight (cost-based)	Each request has a "cost" parameter; bucket holds cost units. The Lua script's `need` parameter already supports this.
"What would a more senior answer add?"	The control plane: dynamic rule updates from real-time abuse signals, automated rule rollback on 429-spike alarms, the integration with the security team's incident response. The static rate limiter is the design above; the adaptive one is the next layer up.

A distributed rate limiter

1 · Clarifying questions

2 · Capacity math, on a napkin

3 · API and data model

Limiter check (RPC, called by the gateway)

Storage

4 · The four algorithms, ranked

5 · The hard part — distributed atomicity

The Lua script

6 · The local-cache-with-sync pattern

7 · Hot-key sharding

8 · Fail-open vs fail-closed

9 · Failure modes & runbook

10 · Cost & SLOs

SLOs

11 · Trade-offs & "what would you change at 10×"

Further reading

Notification system