Designing a Scalable Rate Limiter

May 19, 2025

Delivering a reliable rate-limiting service is key to protecting backend systems and enforcing usage policies. Below is a straightforward, production-ready design that balances performance, accuracy, and operational simplicity.

1. Goals & Requirements

Functional

Enforce per-key quotas (e.g. per user or API client)
Allow controlled bursts and steady rates
Support dynamic updates to rate rules

Non-Functional

Handle millions of requests per second
Keep p99 latency under 50 ms
Scale horizontally across availability zones
Survive node- and zone-level failures
Maintain 60–70% average utilization, with autoscaling

2. High-Level Architecture

Load Balancer
Routes each client to a consistent limiter node, enabling cache locality.
Rate Limiter Nodes
- In-memory cache of rate rules, updated via Pub/Sub.
- On each request: fetch rule → invoke Redis Lua script → return allow or reject.
- Emit metrics (latency, rejects, cache hits).
Redis Cluster
- Sharded by key-hash (e.g. 128 shards × 3 replicas).
- Stores minimal per-key state (token counts or window counters).
- Scripts perform refill, check, and consume in one atomic call.
Pub/Sub Stream
- Publishes rule changes (limit, burst size).
- Nodes subscribe and refresh cache entries by version or timestamp.

3. Counter Storage & Sharding

Raw footprint:
1 billion keys × 1 KB each = 1 × 10¹² bytes ≈ 1 TB
Optimized:
Pack key (8 B) + counter (8 B) → 16 B/entry → ~16 GB raw
Replication: 3× → ~48 GB per region
Shard plan:
- 128 logical shards → ~125 million keys/shard → ~2 GB/shard
- Deploy ~4 shards per node → ~32 nodes total

4. Core Algorithms

Recommended: Token-Bucket or Sliding-Window Counter with a single Redis Lua script for atomic refill, check, and consume.

5. Atomic Request Flow

Lookup rule in local cache.
Redis Lua script:
- Refill tokens or rotate counters based on timestamp
- Compare against limit
- Decrement and return result in one call
Response: HTTP 200 (allow) or 429 (reject) with headers:
- X-RateLimit-Limit
- X-RateLimit-Remaining
- X-RateLimit-Reset

6. Availability & Scaling

Masterless data store: Redis Cluster or DynamoDB global tables for zero-downtime failover.
Multi-AZ replicas: each shard spans ≥ 3 zones.
Client retries: exponential backoff + configurable “fail-open” for noncritical paths.
Autoscaling: trigger on p99 latency or reject-rate thresholds.

7. Observability & Alerts

Key metrics:
- Reject rate (%)
- p50/p95/p99 latency
- Cache hit ratio
- Pub/Sub lag for rule updates
Alerts:
- Reject rate > 1% sustained for 5 minutes
- p99 latency > 50 ms
- Pub/Sub lag > 1 s

Crafting Code Chronicles

Discussion about this post

Ready for more?