06 / 19
Playbook / 06

A distributed rate limiter

Algorithmically tiny, operationally interesting. The token-bucket math fits on a napkin; what makes this an interesting question is the engineering around the math — shaving the Redis round-trip with local pre-deduction, surviving a Redis flake without letting bad actors through, handling hot keys without overprovisioning the cluster, and the fail-open-vs-fail-closed choice that defines whether your gateway is a security tool or just a politeness layer.


1 · Clarifying questions

Where does the limiter run?API gateway, in front of every public-facing service. ~50K req/s aggregate.
What's the limit dimension?Multiple — per IP, per user, per API key, per (user, endpoint). Rules are configurable, not hard-coded.
Strict or eventually-correct?Strict on average, allow brief overshoot under burst. Internet-scale rate limiters are not transactions.
Latency budget?P99 ≤ 5 ms added by the limiter. Above that and the limiter becomes part of the latency story.
Fail-open or fail-closed?Per-rule. Anti-abuse rules fail-closed; "fairness" rules fail-open. The product owns the choice; the design exposes it.
What happens when blocked?HTTP 429 with Retry-After header and standard rate-limit headers (RFC draft 9). Consistent shape across all services.
Multi-region?Yes. Each region has its own limiter; rules are global config but counters are regional.
Burst handling?Token bucket with capacity ≥ rate. A user with rate=10/s and capacity=100 can send 100 in a burst then refill at 10/s.

2 · Capacity math, on a napkin

NumberCalculationResult
API gateway QPSgiven~50K (peak ~150K)
Active rate-limit keys1 per active user × ~10M concurrent + 100K API keys~10M keys
Bytes per key (counter + meta)~64 B
Hot working set~5% of keys at peak × 64 B~32 MB (trivial for Redis)
Per-request Redis ops1 LUA script (atomic CAS-and-decrement)1 op
Per-pod local cache hit rate~80% (sticky-ish traffic, short windows)
Effective Redis QPSpeak × (1 − 0.8)~30K Redis ops/s
Redis cluster3 shards × 3 replicas (RAM 32 GB)Headroom 100×
Concurrency, limiter150K × 0.005 s~750 in-flight

The math says one Redis cluster easily handles 50K QPS — even with no local pre-deduction. The local-cache-with-sync pattern below is for two reasons: (1) shave the 0.5–1 ms RTT from the hot path, (2) keep working when Redis is unhappy.

3 · API and data model

Limiter check (RPC, called by the gateway)

# Check + atomically deduct
RateLimit.Check(key, n=1) → { allowed, remaining, reset_at }

# A request matches multiple rules — check each in order; deny on first deny.
# Example rules for a single request:
# ip:1.2.3.4 rate=1000/min capacity=2000
# user:u_abc rate=100/min capacity=200
# user_endpoint:u_abc:POST /v1/posts rate=10/min capacity=20

# Response when blocked:
HTTP/1.1 429 Too Many Requests
RateLimit-Limit: 100
RateLimit-Remaining: 0
RateLimit-Reset: 12 # seconds until next token
Retry-After: 12

Storage

# Token bucket state, per (rule, key) pair
# Lives in Redis; sharded by hash(rule:key)

bucket:rule_id:key_hash → {
 tokens: FLOAT, # current token count
 last_refill: TIMESTAMP, # last refill time (ms)
 capacity: INTEGER, # max tokens
 rate: FLOAT # tokens / second
}
TTL: 2 × window # auto-expires when key goes cold

# Rule definitions live in config (etcd / consul), pushed to gateways.
rules → {
 rule_id: "ip-default",
 match: {dimension: "ip", scope: "global"},
 rate: 1000,
 per: "minute",
 capacity: 2000,
 on_fail: "fail_open" | "fail_closed",
 enabled: true
}

4 · The four algorithms, ranked

AlgorithmHow it worksProsCons
Token bucket Bucket of capacity C, refills at rate R. Each request takes 1 token. Allows bursts up to C. Cheap to compute (2 numbers). None significant. The default choice.
Leaky bucket Queue of capacity C; processes at rate R; new requests appended. Smooths traffic to a constant rate. Can't burst. Adds latency. Usually the wrong product behaviour.
Fixed window counter Increment a counter per window; reset at boundary. Trivial. Allows 2× burst at window boundary (last second of W1 + first second of W2).
Sliding window log / sliding window counter Track timestamps in window; or weight previous + current windows. Avoids the boundary burst. Log version is memory-heavy; counter version is the standard fix.

Use token bucket. It models the product behaviour most engineers actually want — "you can do up to N per second, with a burst budget of M". Sliding window counter is a fine alternative when bursts are not allowed.

Interactive · Token-bucket simulator
10.0 / 10
refill 5 / s
10
5
8
3 rps over budget
Allowed 0
Denied 0
Allow % 0%
Recent requests
Try this. Set arrivals = refill — every request fits. Now bump arrivals above refill — denials start. Big capacity means a brief burst is absorbed (the bucket starts full); a small capacity smooths to the rate immediately.

5 · The hard part — distributed atomicity

Two gateway pods see two requests for the same user simultaneously. They both read the bucket, both find 1 token, both write back 0 tokens, both allow. Counter under-counts; user gets to send 2 instead of 1.

The fix is atomic check-and-decrement. Three options:

OptionHowTrade-off
Redis Lua One EVAL script computes refill + decrement + write atomically. Single round-trip per check. Standard answer. Lua scripts run single-threaded on the Redis main event loop. ~50 µs per script.
Redis WATCH/MULTI Optimistic — read, compute, write within a transaction; retry on conflict. Two round-trips on success; more on conflict. Almost never the right answer over Lua.
Database row-level lock SELECT FOR UPDATE on a Postgres row. Works but ~10× slower than Redis. Use only if Redis is unavailable in your stack.

The Lua script

-- KEYS[1] = bucket key
-- ARGV[1] = capacity
-- ARGV[2] = refill rate (tokens/sec)
-- ARGV[3] = now (ms since epoch)
-- ARGV[4] = requested tokens (default 1)

local bucket = redis.call("HMGET", KEYS[1], "tokens", "last")
local tokens = tonumber(bucket[1]) or tonumber(ARGV[1])
local last = tonumber(bucket[2]) or tonumber(ARGV[3])

local elapsed = (tonumber(ARGV[3]) - last) / 1000.0
tokens = math.min(tonumber(ARGV[1]), tokens + elapsed * tonumber(ARGV[2]))

local need = tonumber(ARGV[4])
local allow = (tokens >= need) and 1 or 0
if allow == 1 then tokens = tokens - need end

redis.call("HMSET", KEYS[1], "tokens", tokens, "last", ARGV[3])
redis.call("EXPIRE", KEYS[1], 60)
return { allow, math.floor(tokens) }

~25 lines, runs in ~50 µs on the Redis main loop. The atomicity is "Redis processes one Lua call at a time on a single shard" — exactly what we need.

6 · The local-cache-with-sync pattern

The Lua call adds 0.5–1 ms of network RTT to every API request. At 50K QPS, that's a lot of round-trips. The optimisation: each gateway pod holds a local "lease" on a fraction of each user's tokens, and only goes to Redis when its lease runs out.

  1. On first request for user U: pod calls Redis, reserves K tokens (e.g., 10), gets back { tokens, leaseExpires }.
  2. On subsequent requests: decrement the local lease in process memory. No network call.
  3. When lease runs out or expires: ask Redis for K more.
  4. On pod shutdown: return remaining lease tokens.

Trade-off: a pod crashing loses its in-flight lease — users effectively get a small burst they didn't earn. Bound the loss by leasing small chunks (5–10 tokens) and short TTLs (1–5 s).

Why this matters. Without it, every API request hits Redis; that's a P99 floor of ~1 ms. With it, the local cache absorbs ~80% of requests at zero cost; Redis QPS drops 5×. At very high scale this is the difference between one Redis cluster and three.

7 · Hot-key sharding

Some keys are very hot — the IP rule for a NAT gateway, the API key for a popular downstream service. Single-shard hot-keying causes one Redis node to saturate while the others idle.

ApproachHowNotes
Sub-key shardingFor known-hot keys, split into N sub-keys: ip:1.2.3.4:[0..N]; each request picks one randomly.Aggregate behaviour matches; per-request limit can drift if N is too large.
Local-cache-onlyFor hot IPs, skip Redis entirely; rely on the per-pod cache with longer leases.Looser consistency but Redis-bypass when Redis can't keep up.
Tier-up to dedicated RedisHot keys get pinned to a dedicated cluster.Operational complexity; rarely worth it under ~1M QPS per key.

8 · Fail-open vs fail-closed

When the limiter itself fails (Redis down, network partition), what does the gateway do? The two answers solve different problems.

  • Fail-open (allow). Use for fairness rules — "you get 100/min". Brief overshoot is fine; refusing all traffic is worse than serving a bit too much.
  • Fail-closed (block). Use for anti-abuse rules — "users with this auth get 1/sec". A flood while the limiter is down is worse than blocking briefly.

Make this per-rule, not per-system. The right rule design specifies on_fail explicitly, not by convention.

The graceful degradation play. When Redis is unreachable, fall back to the local-cache lease (with no replenishment). For fail-open rules, eventually allow everything; for fail-closed rules, refuse once leases run out. Document this and rehearse it — most teams discover their rate-limiter behaviour under Redis failure during the actual outage.

9 · Failure modes & runbook

FailureSymptomMitigation
Redis primary downLua scripts fail; pod falls back to local cachePer-rule fail-open/closed kicks in. Failover to replica (~30 s).
Redis cluster partitionSome shards reachable, others notRules whose key hashes into reachable shards work; others use the fallback. Track per-shard error rates.
Lua script slow under loadP99 limiter latency above SLOScript optimisation (reduce HMGETs); shard further; cache rule definitions per-pod.
Hot key DOSOne Redis shard at 100% CPUSub-key sharding; tier-up the key to a dedicated cluster; tighten the offending rule.
Lease pool starvationPod's lease pool exhausted, every request goes to RedisIncrease lease size; alert on lease-refresh QPS as a leading indicator.
Rule misconfigurationWhole region returns 429Config rollback; rules deployed via canary + rollback automation.
Bypass header leakedTest/admin tokens reach prod trafficAudit; rotate credentials; add per-bypass-token logging and rate-limit-on-rate-limiter (limit how often a bypass can be used).

10 · Cost & SLOs

LineEstimateNote
Redis cluster (3 shards × 3 replicas, 32 GB)~$3K / monthSingle managed cluster per region
Limiter lib in gateway pods~0Already running on the gateway fleet
Config-distribution (etcd)~$500 / monthReused infrastructure
Total (per region)~$3.5K / monthCheapest design in the playbook

SLOs

  • P99 added latency: 5 ms. 80% local cache hit + 1 ms Redis budget for the rest.
  • Availability: 99.99%. Per-rule fail-open/closed dictates what "available" means.
  • Configuration propagation: P99 ≤ 60 s. Rule changes reach all gateways within a minute.
  • Accuracy: ≤ 5% drift over a 60-second window. Local-cache-with-sync trades exact accuracy for latency; bound the drift in the SLO.

11 · Trade-offs & "what would you change at 10×"

If…Then…
10× QPS (500K/s)Shard Redis further; bump lease sizes; consider colocating limiter state on the gateway pod (tradeoff: less accurate global view).
Strict rate limit (no overshoot)Drop the local cache. Take the Redis RTT on every request. P99 drifts to 2–3 ms.
Global single-bucket (cross-region)Coordinator service (Redis-on-one-region or Spanner-style) becomes a SPOF. The honest answer is "we don't do this; we use per-region rules and assume regions are independent".
Multi-tenant SaaS with per-tenant quotasTenant ID becomes another dimension; rules cascade (per-tenant + per-user + per-IP). Watch the rule-count explosion.
Rate-limit by request weight (cost-based)Each request has a "cost" parameter; bucket holds cost units. The Lua script's need parameter already supports this.
"What would a more senior answer add?"The control plane: dynamic rule updates from real-time abuse signals, automated rule rollback on 429-spike alarms, the integration with the security team's incident response. The static rate limiter is the design above; the adaptive one is the next layer up.

Further reading

  • Stripe — "Scaling your API with rate limiters". The canonical engineering writeup. Token bucket, fail-open semantics, the operational details.
  • Cloudflare — "How we built rate limiting capable of scaling to millions of domains". The CDN-edge view; useful for the highest-scale shape.
  • GitHub — "GitHub's Engineering Team Has Joined Slack" — actually, look up "How we built the GitHub.com rate limiter". A different shape (per-API-key) at meaningful scale.
  • Lyft Engineering — "Envoy's local rate limiting". The colocated-on-the-gateway design point — useful counterpoint.
  • Marc Brooker — "Exponential Backoff and Jitter". The client-side companion to a rate limiter. Without jitter, retries synchronise and re-saturate.
  • RFC draft 9 — "RateLimit Header Fields for HTTP". The standard headers; use them so clients can be polite back.
  • Adjacent: API gateway. Where the limiter actually lives.
  • Adjacent: Redis. The substrate.
Found this useful?