14. Rate Limiting
Rate limiting controls the number of requests a client can make to a service within a given time window. It protects services from abuse, ensures fair usage, and prevents cascading failures.
Why Rate Limiting?
| Concern | How Rate Limiting Helps |
|---|---|
| DDoS protection | Limits damage from malicious traffic floods |
| Resource protection | Prevents one client from monopolizing resources |
| Cost control | Limits expensive API calls (e.g., third-party APIs) |
| Fair usage | Ensures all users get reasonable access |
| Stability | Prevents cascading failures from traffic spikes |
| Compliance | Enforces SLA-based usage tiers |
Rate Limiting Algorithms
1. Token Bucket
A "bucket" holds tokens; each request consumes a token. Tokens are added at a fixed rate.
Bucket capacity: 10 tokens
Refill rate: 2 tokens/second
Time 0: [■ ■ ■ ■ ■ ■ ■ ■ ■ ■] 10 tokens — bucket full
5 requests arrive: [■ ■ ■ ■ ■ □ □ □ □ □] 5 tokens remain
1 second passes: [■ ■ ■ ■ ■ ■ ■ □ □ □] 7 tokens (refilled 2)
8 requests arrive: All 7 accepted, 1 rejected
Parameters:
- Bucket size: Maximum burst capacity.
- Refill rate: Sustained request rate.
Pros:
- Allows bursts up to the bucket size.
- Smooth, well-understood behavior.
- Memory efficient — one counter and timestamp per key.
Cons:
- Two parameters to tune.
Used by: AWS API Gateway, Stripe, many production systems.
2. Leaky Bucket
Requests enter a FIFO queue (bucket). The queue is processed at a fixed rate. If the queue is full, new requests are rejected.
Requests arrive (variable rate): ■■■■■■■
↓
┌─────────────┐
│ Queue (10) │ ← Overflow → Rejected
└──────┬──────┘
↓
Fixed outflow rate: 2/sec
↓
[Processed]
Parameters:
- Bucket (queue) size: Maximum queued requests.
- Outflow rate: Fixed processing rate.
Pros:
- Perfectly smooth output rate.
- Ideal for services that need constant throughput.
Cons:
- No burst handling — even legitimate bursts are queued or dropped.
- Old requests in the queue may cause latency.
Used by: Nginx (limit_req with nodelay), network traffic shaping.
3. Fixed Window Counter
Divide time into fixed windows (e.g., 1 minute). Count requests in each window. Reject when the count exceeds the limit.
Window: 1 minute | Limit: 100 requests
12:00:00 - 12:01:00 → 100 requests allowed
12:01:00 - 12:02:00 → Counter resets, 100 more allowed
The boundary problem:
Limit: 100 requests per minute
12:00:30 - 12:01:00 → 100 requests (allowed — within window 1)
12:01:00 - 12:01:30 → 100 requests (allowed — within window 2)
Result: 200 requests in 1 minute (12:00:30 - 12:01:30) — DOUBLE the limit!
Pros:
- Very simple to implement.
- Memory efficient (one counter per window).
Cons:
- Boundary spike problem — can allow 2x the rate at window edges.
4. Sliding Window Log
Keep a timestamped log of all requests. For each new request, count requests in the past window. Reject if over limit.
Limit: 100 requests per minute
Request at 12:01:30:
→ Count all requests from 12:00:30 to 12:01:30
→ If count < 100: Accept, add timestamp to log
→ If count >= 100: Reject
Pros:
- Precise — no boundary spike problem.
- Accurate per-client rate limiting.
Cons:
- High memory: Must store timestamp for every request.
- Expensive computation: Must scan logs for counting.
5. Sliding Window Counter
A hybrid of fixed window counter and sliding window log. It uses two adjacent windows and a weighted count.
Limit: 100 requests per minute
Previous window (12:00:00 - 12:01:00): 80 requests
Current window (12:01:00 - 12:02:00): 30 requests so far
Current time: 12:01:15 (25% into current window)
Weighted count = 80 * (1 - 0.25) + 30 = 80 * 0.75 + 30 = 60 + 30 = 90
→ Under limit of 100 → Accept
Pros:
- Good approximation of sliding window.
- Memory efficient (two counters per key).
- Smooths out the boundary spike problem.
Cons:
- Not perfectly accurate (approximation).
Used by: Redis-based rate limiters, Cloudflare.
Algorithm Comparison
| Algorithm | Burst Handling | Accuracy | Memory | Complexity |
|---|---|---|---|---|
| Token Bucket | Allows controlled bursts | Good | Low | Low |
| Leaky Bucket | No bursts (smooth output) | Good | Low | Low |
| Fixed Window | Allows boundary spikes | Poor at edges | Very low | Very low |
| Sliding Window Log | No boundary issues | Exact | High | Medium |
| Sliding Window Counter | Smooth approximation | Good | Low | Low |
Where to Rate Limit
Client → [CDN / Edge] → [API Gateway] → [Service] → [Database]
↑ ↑ ↑ ↑
Edge rate limit Gateway RL Service RL DB connection limit
| Location | What It Protects | Granularity |
|---|---|---|
| Client-side | Prevents accidental overuse | Per-client |
| CDN / Edge | Infrastructure from DDoS | Per-IP, global |
| API Gateway | Backend services from overload | Per-API key, per-user |
| Service level | Individual service limits | Per-endpoint |
| Database level | DB from connection exhaustion | Connection pool limits |
Rate Limiting Identifiers
| Identifier | Description | Pros | Cons |
|---|---|---|---|
| API Key | Unique key per client/app | Per-application limits | Must manage keys |
| User ID | Authenticated user identifier | Fair per-user limits | Requires auth |
| IP Address | Client IP | No auth needed | Shared IPs (NAT, proxies) |
| Combination | IP + API Key + Endpoint | Fine-grained control | More complex |
Distributed Rate Limiting
In a distributed system with multiple servers, rate limiting must be coordinated.
Centralized Counter (Redis)
Server 1 ──→ Redis (INCR rate:user:123) ──→ count → Check limit
Server 2 ──→ Redis (INCR rate:user:123) ──→ count → Check limit
Server 3 ──→ Redis (INCR rate:user:123) ──→ count → Check limit
-- Redis Lua script for atomic rate limiting (sliding window)
local key = KEYS[1]
local limit = tonumber(ARGV[1])
local window = tonumber(ARGV[2])
local now = tonumber(ARGV[3])
redis.call('ZREMRANGEBYSCORE', key, 0, now - window)
local count = redis.call('ZCARD', key)
if count < limit then
redis.call('ZADD', key, now, now .. math.random())
redis.call('EXPIRE', key, window)
return 1 -- allowed
else
return 0 -- rejected
end
Pros: Accurate, consistent across all servers.
Cons: Redis is a single point of failure; adds latency per request.
Local Rate Limiting + Synchronization
Each server maintains local counters and periodically syncs:
Server 1: local_count = 30 ──sync──→ Global count
Server 2: local_count = 25 ──sync──→ Global count
Server 3: local_count = 20 ──sync──→ Global count = 75
Pros: No per-request Redis call.
Cons: Less accurate; may briefly exceed limits.
Sticky Sessions
Route same client to same server → local rate limiting is accurate.
Cons: Uneven distribution, session affinity complexity.
Rate Limiting Response
HTTP 429 Too Many Requests
HTTP/1.1 429 Too Many Requests
Content-Type: application/json
Retry-After: 30
X-RateLimit-Limit: 100
X-RateLimit-Remaining: 0
X-RateLimit-Reset: 1705312800
{
"error": {
"code": "RATE_LIMITED",
"message": "Rate limit exceeded. Try again in 30 seconds.",
"retryAfter": 30
}
}
Standard Headers
| Header | Description |
|---|---|
X-RateLimit-Limit |
Maximum requests allowed in window |
X-RateLimit-Remaining |
Requests remaining in current window |
X-RateLimit-Reset |
Unix timestamp when the limit resets |
Retry-After |
Seconds to wait before retrying |
Client-Side Handling
Exponential Backoff with Jitter
When receiving a 429 response:
Attempt 1: Wait 1s + random(0, 0.5s)
Attempt 2: Wait 2s + random(0, 1s)
Attempt 3: Wait 4s + random(0, 2s)
Attempt 4: Wait 8s + random(0, 4s)
Max: Cap at 60s
Jitter prevents the thundering herd problem — without it, all rate-limited clients retry at the exact same time.
Tiered Rate Limiting
Free tier: 100 requests/hour
Basic tier: 1,000 requests/hour
Pro tier: 10,000 requests/hour
Enterprise: 100,000 requests/hour (or custom)
Summary
| Concept | Key Point |
|---|---|
| Token bucket | Most popular; allows controlled bursts |
| Leaky bucket | Smooth output rate; no bursts |
| Sliding window counter | Good balance of accuracy and efficiency |
| Distributed RL | Use Redis for centralized counting |
| Response | HTTP 429 + Retry-After + rate limit headers |
| Client handling | Exponential backoff with jitter |
Rule of thumb: Use token bucket for most API rate limiting. Use sliding window counter when precision matters. Always return proper 429 responses with retry information.