14. Rate Limiting

Rate limiting controls the number of requests a client can make to a service within a given time window. It protects services from abuse, ensures fair usage, and prevents cascading failures.


Why Rate Limiting?

Concern How Rate Limiting Helps
DDoS protection Limits damage from malicious traffic floods
Resource protection Prevents one client from monopolizing resources
Cost control Limits expensive API calls (e.g., third-party APIs)
Fair usage Ensures all users get reasonable access
Stability Prevents cascading failures from traffic spikes
Compliance Enforces SLA-based usage tiers

Rate Limiting Algorithms

1. Token Bucket

A "bucket" holds tokens; each request consumes a token. Tokens are added at a fixed rate.

Bucket capacity: 10 tokens
Refill rate: 2 tokens/second

Time 0: [■ ■ ■ ■ ■ ■ ■ ■ ■ ■] 10 tokens — bucket full
5 requests arrive: [■ ■ ■ ■ ■ □ □ □ □ □] 5 tokens remain
1 second passes: [■ ■ ■ ■ ■ ■ ■ □ □ □] 7 tokens (refilled 2)
8 requests arrive: All 7 accepted, 1 rejected

Parameters:

  • Bucket size: Maximum burst capacity.
  • Refill rate: Sustained request rate.

Pros:

  • Allows bursts up to the bucket size.
  • Smooth, well-understood behavior.
  • Memory efficient — one counter and timestamp per key.

Cons:

  • Two parameters to tune.

Used by: AWS API Gateway, Stripe, many production systems.


2. Leaky Bucket

Requests enter a FIFO queue (bucket). The queue is processed at a fixed rate. If the queue is full, new requests are rejected.

Requests arrive (variable rate): ■■■■■■■
                                      ↓
                            ┌─────────────┐
                            │  Queue (10)  │ ← Overflow → Rejected
                            └──────┬──────┘
                                   ↓
                         Fixed outflow rate: 2/sec
                                   ↓
                              [Processed]

Parameters:

  • Bucket (queue) size: Maximum queued requests.
  • Outflow rate: Fixed processing rate.

Pros:

  • Perfectly smooth output rate.
  • Ideal for services that need constant throughput.

Cons:

  • No burst handling — even legitimate bursts are queued or dropped.
  • Old requests in the queue may cause latency.

Used by: Nginx (limit_req with nodelay), network traffic shaping.


3. Fixed Window Counter

Divide time into fixed windows (e.g., 1 minute). Count requests in each window. Reject when the count exceeds the limit.

Window: 1 minute | Limit: 100 requests

12:00:00 - 12:01:00  → 100 requests allowed
12:01:00 - 12:02:00  → Counter resets, 100 more allowed

The boundary problem:

Limit: 100 requests per minute

12:00:30 - 12:01:00 → 100 requests (allowed — within window 1)
12:01:00 - 12:01:30 → 100 requests (allowed — within window 2)

Result: 200 requests in 1 minute (12:00:30 - 12:01:30) — DOUBLE the limit!

Pros:

  • Very simple to implement.
  • Memory efficient (one counter per window).

Cons:

  • Boundary spike problem — can allow 2x the rate at window edges.

4. Sliding Window Log

Keep a timestamped log of all requests. For each new request, count requests in the past window. Reject if over limit.

Limit: 100 requests per minute

Request at 12:01:30:
  → Count all requests from 12:00:30 to 12:01:30
  → If count < 100: Accept, add timestamp to log
  → If count >= 100: Reject

Pros:

  • Precise — no boundary spike problem.
  • Accurate per-client rate limiting.

Cons:

  • High memory: Must store timestamp for every request.
  • Expensive computation: Must scan logs for counting.

5. Sliding Window Counter

A hybrid of fixed window counter and sliding window log. It uses two adjacent windows and a weighted count.

Limit: 100 requests per minute

Previous window (12:00:00 - 12:01:00): 80 requests
Current window  (12:01:00 - 12:02:00): 30 requests so far
Current time: 12:01:15 (25% into current window)

Weighted count = 80 * (1 - 0.25) + 30 = 80 * 0.75 + 30 = 60 + 30 = 90
→ Under limit of 100 → Accept

Pros:

  • Good approximation of sliding window.
  • Memory efficient (two counters per key).
  • Smooths out the boundary spike problem.

Cons:

  • Not perfectly accurate (approximation).

Used by: Redis-based rate limiters, Cloudflare.


Algorithm Comparison

Algorithm Burst Handling Accuracy Memory Complexity
Token Bucket Allows controlled bursts Good Low Low
Leaky Bucket No bursts (smooth output) Good Low Low
Fixed Window Allows boundary spikes Poor at edges Very low Very low
Sliding Window Log No boundary issues Exact High Medium
Sliding Window Counter Smooth approximation Good Low Low

Where to Rate Limit

Client → [CDN / Edge] → [API Gateway] → [Service] → [Database]
           ↑                ↑              ↑           ↑
     Edge rate limit   Gateway RL     Service RL    DB connection limit
Location What It Protects Granularity
Client-side Prevents accidental overuse Per-client
CDN / Edge Infrastructure from DDoS Per-IP, global
API Gateway Backend services from overload Per-API key, per-user
Service level Individual service limits Per-endpoint
Database level DB from connection exhaustion Connection pool limits

Rate Limiting Identifiers

Identifier Description Pros Cons
API Key Unique key per client/app Per-application limits Must manage keys
User ID Authenticated user identifier Fair per-user limits Requires auth
IP Address Client IP No auth needed Shared IPs (NAT, proxies)
Combination IP + API Key + Endpoint Fine-grained control More complex

Distributed Rate Limiting

In a distributed system with multiple servers, rate limiting must be coordinated.

Centralized Counter (Redis)

Server 1 ──→ Redis (INCR rate:user:123) ──→ count → Check limit
Server 2 ──→ Redis (INCR rate:user:123) ──→ count → Check limit
Server 3 ──→ Redis (INCR rate:user:123) ──→ count → Check limit
-- Redis Lua script for atomic rate limiting (sliding window)
local key = KEYS[1]
local limit = tonumber(ARGV[1])
local window = tonumber(ARGV[2])
local now = tonumber(ARGV[3])

redis.call('ZREMRANGEBYSCORE', key, 0, now - window)
local count = redis.call('ZCARD', key)

if count < limit then
    redis.call('ZADD', key, now, now .. math.random())
    redis.call('EXPIRE', key, window)
    return 1  -- allowed
else
    return 0  -- rejected
end

Pros: Accurate, consistent across all servers.
Cons: Redis is a single point of failure; adds latency per request.

Local Rate Limiting + Synchronization

Each server maintains local counters and periodically syncs:

Server 1: local_count = 30 ──sync──→ Global count
Server 2: local_count = 25 ──sync──→ Global count
Server 3: local_count = 20 ──sync──→ Global count = 75

Pros: No per-request Redis call.
Cons: Less accurate; may briefly exceed limits.

Sticky Sessions

Route same client to same server → local rate limiting is accurate.
Cons: Uneven distribution, session affinity complexity.


Rate Limiting Response

HTTP 429 Too Many Requests

HTTP/1.1 429 Too Many Requests
Content-Type: application/json
Retry-After: 30
X-RateLimit-Limit: 100
X-RateLimit-Remaining: 0
X-RateLimit-Reset: 1705312800

{
  "error": {
    "code": "RATE_LIMITED",
    "message": "Rate limit exceeded. Try again in 30 seconds.",
    "retryAfter": 30
  }
}

Standard Headers

Header Description
X-RateLimit-Limit Maximum requests allowed in window
X-RateLimit-Remaining Requests remaining in current window
X-RateLimit-Reset Unix timestamp when the limit resets
Retry-After Seconds to wait before retrying

Client-Side Handling

Exponential Backoff with Jitter

When receiving a 429 response:

Attempt 1: Wait 1s + random(0, 0.5s)
Attempt 2: Wait 2s + random(0, 1s)
Attempt 3: Wait 4s + random(0, 2s)
Attempt 4: Wait 8s + random(0, 4s)
Max: Cap at 60s

Jitter prevents the thundering herd problem — without it, all rate-limited clients retry at the exact same time.


Tiered Rate Limiting

Free tier:     100 requests/hour
Basic tier:    1,000 requests/hour
Pro tier:      10,000 requests/hour
Enterprise:    100,000 requests/hour (or custom)

Summary

Concept Key Point
Token bucket Most popular; allows controlled bursts
Leaky bucket Smooth output rate; no bursts
Sliding window counter Good balance of accuracy and efficiency
Distributed RL Use Redis for centralized counting
Response HTTP 429 + Retry-After + rate limit headers
Client handling Exponential backoff with jitter

Rule of thumb: Use token bucket for most API rate limiting. Use sliding window counter when precision matters. Always return proper 429 responses with retry information.