14. Rate Limiting

title: 14. Rate Limiting

Rate limiting controls the number of requests a client can make to a service within a given time window. It protects services from abuse, ensures fair usage, and prevents cascading failures.

Why Rate Limiting?

Concern	How Rate Limiting Helps
DDoS protection	Limits damage from malicious traffic floods
Resource protection	Prevents one client from monopolizing resources
Cost control	Limits expensive API calls (e.g., third-party APIs)
Fair usage	Ensures all users get reasonable access
Stability	Prevents cascading failures from traffic spikes
Compliance	Enforces SLA-based usage tiers

Rate Limiting Algorithms

1. Token Bucket

A "bucket" holds tokens; each request consumes a token. Tokens are added at a fixed rate.

Bucket capacity: 10 tokens
Refill rate: 2 tokens/second

Time 0: [■ ■ ■ ■ ■ ■ ■ ■ ■ ■] 10 tokens — bucket full
5 requests arrive: [■ ■ ■ ■ ■ □ □ □ □ □] 5 tokens remain
1 second passes: [■ ■ ■ ■ ■ ■ ■ □ □ □] 7 tokens (refilled 2)
8 requests arrive: All 7 accepted, 1 rejected

Parameters:

Bucket size: Maximum burst capacity.
Refill rate: Sustained request rate.

Pros:

Allows bursts up to the bucket size.
Smooth, well-understood behavior.
Memory efficient — one counter and timestamp per key.

Cons:

Two parameters to tune.

Used by: AWS API Gateway, Stripe, many production systems.

2. Leaky Bucket

Requests enter a FIFO queue (bucket). The queue is processed at a fixed rate. If the queue is full, new requests are rejected.

Requests arrive (variable rate): ■■■■■■■
                                      ↓
                            ┌─────────────┐
                            │  Queue (10)  │ ← Overflow → Rejected
                            └──────┬──────┘
                                   ↓
                         Fixed outflow rate: 2/sec
                                   ↓
                              [Processed]

Parameters:

Bucket (queue) size: Maximum queued requests.
Outflow rate: Fixed processing rate.

Pros:

Perfectly smooth output rate.
Ideal for services that need constant throughput.

Cons:

No burst handling — even legitimate bursts are queued or dropped.
Old requests in the queue may cause latency.

Used by: Nginx (limit_req with nodelay), network traffic shaping.

3. Fixed Window Counter

Divide time into fixed windows (e.g., 1 minute). Count requests in each window. Reject when the count exceeds the limit.

Window: 1 minute | Limit: 100 requests

12:00:00 - 12:01:00  → 100 requests allowed
12:01:00 - 12:02:00  → Counter resets, 100 more allowed

The boundary problem:

Limit: 100 requests per minute

12:00:30 - 12:01:00 → 100 requests (allowed — within window 1)
12:01:00 - 12:01:30 → 100 requests (allowed — within window 2)

Result: 200 requests in 1 minute (12:00:30 - 12:01:30) — DOUBLE the limit!

Pros:

Very simple to implement.
Memory efficient (one counter per window).

Cons:

Boundary spike problem — can allow 2x the rate at window edges.

4. Sliding Window Log

Keep a timestamped log of all requests. For each new request, count requests in the past window. Reject if over limit.

Limit: 100 requests per minute

Request at 12:01:30:
  → Count all requests from 12:00:30 to 12:01:30
  → If count < 100: Accept, add timestamp to log
  → If count >= 100: Reject

Pros:

Precise — no boundary spike problem.
Accurate per-client rate limiting.

Cons:

High memory: Must store timestamp for every request.
Expensive computation: Must scan logs for counting.

5. Sliding Window Counter

A hybrid of fixed window counter and sliding window log. It uses two adjacent windows and a weighted count.

Limit: 100 requests per minute

Previous window (12:00:00 - 12:01:00): 80 requests
Current window  (12:01:00 - 12:02:00): 30 requests so far
Current time: 12:01:15 (25% into current window)

Weighted count = 80 * (1 - 0.25) + 30 = 80 * 0.75 + 30 = 60 + 30 = 90
→ Under limit of 100 → Accept

Pros:

Good approximation of sliding window.
Memory efficient (two counters per key).
Smooths out the boundary spike problem.

Cons:

Not perfectly accurate (approximation).

Used by: Redis-based rate limiters, Cloudflare.

Algorithm Comparison

Algorithm	Burst Handling	Accuracy	Memory	Complexity
Token Bucket	Allows controlled bursts	Good	Low	Low
Leaky Bucket	No bursts (smooth output)	Good	Low	Low
Fixed Window	Allows boundary spikes	Poor at edges	Very low	Very low
Sliding Window Log	No boundary issues	Exact	High	Medium
Sliding Window Counter	Smooth approximation	Good	Low	Low

Where to Rate Limit

Client → [CDN / Edge] → [API Gateway] → [Service] → [Database]
           ↑                ↑              ↑           ↑
     Edge rate limit   Gateway RL     Service RL    DB connection limit

Location	What It Protects	Granularity
Client-side	Prevents accidental overuse	Per-client
CDN / Edge	Infrastructure from DDoS	Per-IP, global
API Gateway	Backend services from overload	Per-API key, per-user
Service level	Individual service limits	Per-endpoint
Database level	DB from connection exhaustion	Connection pool limits

Rate Limiting Identifiers

Identifier	Description	Pros	Cons
API Key	Unique key per client/app	Per-application limits	Must manage keys
User ID	Authenticated user identifier	Fair per-user limits	Requires auth
IP Address	Client IP	No auth needed	Shared IPs (NAT, proxies)
Combination	IP + API Key + Endpoint	Fine-grained control	More complex

Distributed Rate Limiting

In a distributed system with multiple servers, rate limiting must be coordinated.

Centralized Counter (Redis)

Server 1 ──→ Redis (INCR rate:user:123) ──→ count → Check limit
Server 2 ──→ Redis (INCR rate:user:123) ──→ count → Check limit
Server 3 ──→ Redis (INCR rate:user:123) ──→ count → Check limit

-- Redis Lua script for atomic rate limiting (sliding window)
local key = KEYS[1]
local limit = tonumber(ARGV[1])
local window = tonumber(ARGV[2])
local now = tonumber(ARGV[3])

redis.call('ZREMRANGEBYSCORE', key, 0, now - window)
local count = redis.call('ZCARD', key)

if count < limit then
    redis.call('ZADD', key, now, now .. math.random())
    redis.call('EXPIRE', key, window)
    return 1  -- allowed
else
    return 0  -- rejected
end

Pros: Accurate, consistent across all servers.
Cons: Redis is a single point of failure; adds latency per request.

Local Rate Limiting + Synchronization

Each server maintains local counters and periodically syncs:

Server 1: local_count = 30 ──sync──→ Global count
Server 2: local_count = 25 ──sync──→ Global count
Server 3: local_count = 20 ──sync──→ Global count = 75

Pros: No per-request Redis call.
Cons: Less accurate; may briefly exceed limits.

Sticky Sessions

Route same client to same server → local rate limiting is accurate.
Cons: Uneven distribution, session affinity complexity.

Rate Limiting Response

HTTP 429 Too Many Requests

HTTP/1.1 429 Too Many Requests
Content-Type: application/json
Retry-After: 30
X-RateLimit-Limit: 100
X-RateLimit-Remaining: 0
X-RateLimit-Reset: 1705312800

{
  "error": {
    "code": "RATE_LIMITED",
    "message": "Rate limit exceeded. Try again in 30 seconds.",
    "retryAfter": 30
  }
}

Standard Headers

Header	Description
`X-RateLimit-Limit`	Maximum requests allowed in window
`X-RateLimit-Remaining`	Requests remaining in current window
`X-RateLimit-Reset`	Unix timestamp when the limit resets
`Retry-After`	Seconds to wait before retrying

Client-Side Handling

Exponential Backoff with Jitter

When receiving a 429 response:

Attempt 1: Wait 1s + random(0, 0.5s)
Attempt 2: Wait 2s + random(0, 1s)
Attempt 3: Wait 4s + random(0, 2s)
Attempt 4: Wait 8s + random(0, 4s)
Max: Cap at 60s

Jitter prevents the thundering herd problem — without it, all rate-limited clients retry at the exact same time.

Tiered Rate Limiting

Free tier:     100 requests/hour
Basic tier:    1,000 requests/hour
Pro tier:      10,000 requests/hour
Enterprise:    100,000 requests/hour (or custom)

Summary

Concept	Key Point
Token bucket	Most popular; allows controlled bursts
Leaky bucket	Smooth output rate; no bursts
Sliding window counter	Good balance of accuracy and efficiency
Distributed RL	Use Redis for centralized counting
Response	HTTP 429 + Retry-After + rate limit headers
Client handling	Exponential backoff with jitter

Rule of thumb: Use token bucket for most API rate limiting. Use sliding window counter when precision matters. Always return proper 429 responses with retry information.