Design a Rate Limiter for AI Services

Rate limiting for AI inference services (ChatGPT, Claude, Grok, Gemini) presents unique challenges beyond traditional API rate limiting. Users pay for different tiers with vastly different quotas—a free user might get 10 messages per day while an enterprise customer gets millions of tokens per minute. Requests have variable costs (a simple greeting vs. a 100K context analysis), and long-running streaming responses complicate traditional request counting.

This walkthrough follows the Interview Framework. We'll design a rate limiting system that handles tiered quotas, token-based metering, and the distributed challenges of serving millions of users across multiple regions.

Phase 1: Requirements

Functional Requirements

Enforce per-user rate limits — Each user has configurable limits based on their subscription tier (free, plus, API, enterprise)
Support multiple limit dimensions — Requests per minute (RPM), tokens per minute (TPM), tokens per day (TPD), and concurrent requests
Limit concurrent GPU usage — Prevent users from monopolizing GPU resources with many simultaneous long-running requests
Handle variable request costs — A request's token cost isn't known until completion (output tokens vary)
Return limit status — Respond with remaining quota and reset times
Allow real-time tier changes — When a user upgrades, new limits apply immediately

Non-Functional Requirements

Requirement	Target	Rationale
Latency	< 10ms for limit check	Rate limiting shouldn't dominate inference latency
Availability	99.99%	Rate limiter failure shouldn't block all traffic
Accuracy	±5% acceptable	Soft limits are fine; we're not billing per-request
Scale	100M users, 10K RPM per user max	Support both consumer and API traffic

Capacity Estimation

Let's estimate for a service like ChatGPT:

Users and traffic:

100M monthly active users
10M daily active users
Average 20 requests/day per active user
Peak: 200M requests/day = 2,300 requests/second average, ~10K RPS peak

Rate limit state:

Each user needs token bucket state for RPM, TPM, TPD (2 fields each) + concurrent counter = ~200 bytes per active user
10M active users × 200 bytes = 2 GB of rate limit state
Fits comfortably in Redis

The capacity math shows this fits in a single Redis cluster. Don't over-engineer with complex sharding unless the interviewer pushes on scale. Start simple and evolve.

Phase 2: Data Model

Core Entities

UserQuota — The user's subscription tier and limits (stored in Quota Service)

CREATE TABLE user_quotas (
    user_id         UUID PRIMARY KEY,
    org_id          UUID,                    -- For team/enterprise accounts
    tier            VARCHAR(20) NOT NULL,    -- 'free', 'plus', 'api', 'enterprise'

    -- Rate limits (0 = unlimited)
    rpm_limit       INT NOT NULL,            -- Requests per minute
    tpm_limit       INT NOT NULL,            -- Tokens per minute
    tpd_limit       INT NOT NULL,            -- Tokens per day
    concurrent_limit INT NOT NULL DEFAULT 1, -- Max simultaneous requests (GPU slots)

    -- Access controls
    allowed_models  TEXT[],                  -- ['grok-1', 'grok-2', 'grok-2-mini']
    max_context     INT NOT NULL,            -- Max input tokens

    -- Metadata
    updated_at      TIMESTAMP NOT NULL,

    INDEX idx_org_id (org_id)
);

OrgQuota — Optional org-level pools for enterprise or API accounts

CREATE TABLE org_quotas (
    org_id          UUID PRIMARY KEY,
    rpm_limit       INT NOT NULL,            -- Shared requests per minute
    tpm_limit       INT NOT NULL,            -- Shared tokens per minute
    tpd_limit       INT NOT NULL,            -- Shared tokens per day
    concurrent_limit INT NOT NULL,           -- Shared GPU slots
    updated_at      TIMESTAMP NOT NULL
);

Example tier configurations:

Tier	RPM	TPM	TPD	Concurrent	Models
Free	3	10K	50K	1	Base model only
Plus ($20/mo)	40	40K	500K	3	Base + flagship
API (pay-per-token)	500	200K	Unlimited	10	All models
Enterprise	Custom	Custom	Custom	50+	All + fine-tuned

Why concurrent limits matter for AI: A single inference request can hold a GPU for 10-60 seconds (long context, streaming). Without concurrent limits, one user could exhaust GPU capacity with 100 simultaneous requests, blocking everyone else. This is more critical than RPM for GPU resource management.

RateLimitState — Current consumption (stored in Redis)

# Redis key structure for token buckets (lazy refill)
# Each bucket stores available capacity + timestamp of last refill
bucket:{user_id}:rpm   → Hash { tokens, last_refill_ms }  # max = rpm_limit, refills over 60s
bucket:{user_id}:tpm   → Hash { tokens, last_refill_ms }  # max = tpm_limit, refills over 60s
bucket:{user_id}:tpd   → Hash { tokens, last_refill_ms }  # max = tpd_limit, refills over 86400s

# Concurrent request tracking (not a token bucket — it's a gauge)
active:{user_id}       → count of in-flight requests (no TTL - managed explicitly)
active:{user_id}:{request_id} → 1 (TTL: >= max request time; refresh during streaming)

Each token bucket starts at full capacity on first use. On every request, the Lua script lazily calculates how many tokens have accumulated since the last refill (based on elapsed time × refill rate), adds them back up to the maximum, then checks whether enough tokens are available for the current request. This lazy refill approach gives the same behavior as continuous replenishment with zero background overhead—exactly the approach used by Anthropic and OpenAI in production.

The active:{user_id}:{request_id} marker keys serve as a safety net. If a request crashes without calling the release function, the marker expires after the maximum request time. A lazy reconciliation check then fixes any drift:

# Reconciliation job: fix concurrent counters where markers have expired
def reconcile_concurrent_slots(user_id: str):
    """Called lazily when a user hits their concurrent limit."""
    # Use SCAN to count active markers (non-blocking, unlike KEYS)
    cursor, actual_count = 0, 0
    while True:
        cursor, keys = redis.scan(cursor, match=f"active:{user_id}:*", count=100)
        actual_count += len(keys)
        if cursor == 0:
            break

    # Atomically set counter to match reality
    redis.set(f"active:{user_id}", actual_count)

Trigger reconciliation lazily—only when a user hits their concurrent limit and gets rejected. Before returning 429, call reconcile_concurrent_slots() and recheck. If markers have expired, the counter will be corrected and the request may now be allowed.

UsageRecord — For billing and analytics (written to Kafka → Data Warehouse)

@dataclass
class UsageRecord:
    user_id: str
    request_id: str
    model: str
    input_tokens: int
    output_tokens: int
    latency_ms: int
    timestamp: datetime
    was_rate_limited: bool

Keep rate limit state (Redis) separate from billing records (Kafka/DW). Rate limiting needs sub-millisecond reads; billing can be eventually consistent and batched.

Phase 3: API Design

External API (Client-Facing)

The rate limiter is transparent to clients—they just see the inference API with rate limit headers.

POST /v1/chat/completions
Authorization: Bearer <api-key>
Content-Type: application/json

{
  "model": "grok-2",
  "messages": [{"role": "user", "content": "Hello!"}],
  "max_tokens": 100
}

Success Response (200):

HTTP/1.1 200 OK
X-RateLimit-Limit-Requests: 40
X-RateLimit-Remaining-Requests: 39
X-RateLimit-Limit-Tokens-1m: 40000
X-RateLimit-Remaining-Tokens-1m: 39850
X-RateLimit-Reset-Tokens-1m: 2024-01-01T12:01:00Z
X-RateLimit-Limit-Tokens-1d: 500000
X-RateLimit-Remaining-Tokens-1d: 499000
X-RateLimit-Reset-Tokens-1d: 2024-01-02T00:00:00Z
X-RateLimit-Limit-Concurrent: 3
X-RateLimit-Remaining-Concurrent: 2
X-RateLimit-Reset-Requests: 2024-01-01T12:01:00Z

{"id": "chatcmpl-xxx", "choices": [...], "usage": {"prompt_tokens": 10, "completion_tokens": 15}}

Rate Limited Response (429):

HTTP/1.1 429 Too Many Requests
Retry-After: 45
X-RateLimit-Limit-Requests: 40
X-RateLimit-Remaining-Requests: 0
X-RateLimit-Reset-Requests: 2024-01-01T12:01:00Z
X-RateLimit-Limit-Tokens-1m: 40000
X-RateLimit-Remaining-Tokens-1m: 0
X-RateLimit-Reset-Tokens-1m: 2024-01-01T12:01:00Z

{
  "error": {
    "type": "rate_limit_exceeded",
    "message": "Rate limit exceeded. Please retry after 45 seconds.",
    "code": "rate_limit_exceeded"
  }
}

Internal APIs

Quota Service:

GET /internal/quota/{user_id}
→ { tier, rpm_limit, tpm_limit, tpd_limit, concurrent_limit, allowed_models, max_context }

PUT /internal/quota/{user_id}
← { tier, rpm_limit, concurrent_limit, ... }  // Admin or billing system updates

Rate Limiter Service:

POST /internal/rate-limit/check
← { user_id, model, estimated_tokens }
→ { allowed: bool, retry_after_ms: int, remaining: {...} }

POST /internal/rate-limit/record
← { user_id, request_id, actual_tokens }
→ { success: bool }

Phase 4: High-Level Design

Architecture Overview

Component Responsibilities

Component	Responsibility
API Gateway	Authentication, request routing, TLS termination
Rate Limiter Service	Enforce limits, update buckets, return headers
Quota Service	Store/retrieve per-user limits, handle tier changes
Redis Cluster	Fast token bucket state with lazy refill
Inference Service	Run LLM inference, report actual token usage
Billing Service	Aggregate usage for invoicing (API tier)

Request Flow: Check and Consume

The Token Estimation Problem

Unlike traditional rate limiting where each request costs 1 unit, AI requests have variable costs. Output tokens aren't known until generation completes.

Solution: Reserve-Then-Adjust Pattern

import uuid
import time

def check_and_reserve(user_id: str, request: ChatRequest) -> RateLimitResult:
    quota = quota_service.get_quota(user_id)
    request_id = str(uuid.uuid4())

    # Estimate tokens (input is known, output capped by model limits)
    input_tokens = count_tokens(request.messages)
    model_max_output = MODEL_OUTPUT_LIMITS[request.model]
    requested_max = request.max_tokens or model_max_output
    estimated_output = min(requested_max, model_max_output)  # Upper bound reservation
    estimated_total = input_tokens + estimated_output

    now_ms = int(time.time() * 1000)

    # Check all dimensions atomically via token bucket (including concurrent GPU slots)
    result = redis.eval(TOKEN_BUCKET_CHECK_AND_RESERVE_LUA,
        keys=[
            f"bucket:{user_id}:rpm",
            f"bucket:{user_id}:tpm",
            f"bucket:{user_id}:tpd",
            f"active:{user_id}",
            f"active:{user_id}:{request_id}"
        ],
        args=[
            quota.rpm_limit,        # RPM bucket capacity
            quota.tpm_limit,        # TPM bucket capacity
            quota.tpd_limit,        # TPD bucket capacity
            quota.concurrent_limit,
            1,                      # rpm_cost
            estimated_total,        # token_cost
            now_ms,
            request_timeout_seconds()
        ])

    allowed, retry_after, reason = result
    if allowed:
        return RateLimitResult(
            allowed=True,
            request_id=request_id,
            reserved_tokens=estimated_total
        )
    else:
        return RateLimitResult(
            allowed=False,
            retry_after=retry_after,
            rejection_reason=reason  
        )


 ():
    
    quota = quota_service.get_quota(user_id)
    delta = actual - reserved  
    redis.(RELEASE_AND_ADJUST_LUA,
        keys=[
            ,
            ,
            ,
            
        ],
        args=[
            delta,
            quota.tpm_limit,    
            quota.tpd_limit     
        ])

Note the request_id is passed through the entire request lifecycle—it's used to prevent double-decrement if the release is called multiple times (idempotency).

Over-reserving can cause false rejections for heavy users. A common optimization is to reserve a prediction + headroom, then top up token consumption during streaming (e.g., every N tokens) so you never exceed the limit while keeping utilization high.

How Token Bucket with Lazy Refill Works:

Each rate dimension (RPM, TPM, TPD) is modeled as a token bucket:

Capacity equals the configured limit (e.g., 40 RPM, 40K TPM)
Refill rate = capacity / window duration (e.g., 40/60 = 0.667 tokens/sec for RPM)
Bucket starts full on first use
On each request, calculate elapsed time since last refill, add elapsed × refill_rate tokens (capped at capacity), then consume

This gives smooth, continuous replenishment rather than abrupt resets. A key advantage over fixed windows: with token bucket we can compute a precise retry-after time (deficit / refill_rate) instead of "wait until the window resets."

Lua Script for Atomic Token Bucket Check-and-Reserve:

-- TOKEN_BUCKET_CHECK_AND_RESERVE_LUA
-- KEYS: [1] rpm_bucket, [2] tpm_bucket, [3] tpd_bucket,
--       [4] concurrent_key, [5] request_marker_key
-- ARGV: [1] rpm_capacity, [2] tpm_capacity, [3] tpd_capacity,
--       [4] concurrent_limit, [5] rpm_cost, [6] token_cost,
--       [7] now_ms, [8] request_ttl_seconds

local rpm_bucket, tpm_bucket, tpd_bucket = KEYS[1], KEYS[2], KEYS[3]
local concurrent_key, request_marker_key = KEYS[4], KEYS[5]

local rpm_capacity = tonumber(ARGV[1])
local tpm_capacity = tonumber(ARGV[2])
local tpd_capacity = tonumber(ARGV[3])
local concurrent_limit = tonumber(ARGV[4])
local rpm_cost = tonumber(ARGV[5])
local token_cost = tonumber(ARGV[6])
local now_ms = tonumber(ARGV[7])
local request_ttl = tonumber(ARGV[8])

-- Refill rates: tokens per millisecond
-- RPM refills over 60s, TPM over 60s, TPD over 86400s
local rpm_refill_rate = rpm_capacity / 60000.0
local tpm_refill_rate = tpm_capacity / 60000.0
local tpd_refill_rate = tpd_capacity / 86400000.0

-- Lazy refill: calculate available tokens based on elapsed time
local function refill(bucket_key, capacity, refill_rate)
     data = redis.call(, bucket_key, , )
     tokens = (data[])
     last_refill = (data[])

     tokens ==  
         capacity  
    

     elapsed = now_ms - last_refill
     elapsed >  
        tokens = .(capacity, tokens + elapsed * refill_rate)
    
     tokens



 concurrent = (redis.call(, concurrent_key)  )
 concurrent_limit >   concurrent >= concurrent_limit 
     {, , }



 rpm_tokens = refill(rpm_bucket, rpm_capacity, rpm_refill_rate)
 tpm_tokens = refill(tpm_bucket, tpm_capacity, tpm_refill_rate)
 tpd_tokens = refill(tpd_bucket, tpd_capacity, tpd_refill_rate)

 rpm_capacity >   rpm_tokens < rpm_cost 
     deficit = rpm_cost - rpm_tokens
     wait_s = .(deficit / (rpm_refill_rate * ))
     {, wait_s, }

 tpm_capacity >   tpm_tokens < token_cost 
     deficit = token_cost - tpm_tokens
     wait_s = .(deficit / (tpm_refill_rate * ))
     {, wait_s, }

 tpd_capacity >   tpd_tokens < token_cost 
     deficit = token_cost - tpd_tokens
     wait_s = .(deficit / (tpd_refill_rate * ))
     {, wait_s, }



redis.call(, rpm_bucket, , rpm_tokens - rpm_cost, , now_ms)
redis.call(, tpm_bucket, , tpm_tokens - token_cost, , now_ms)
redis.call(, tpd_bucket, , tpd_tokens - token_cost, , now_ms)


redis.call(, rpm_bucket, )     
redis.call(, tpm_bucket, )
redis.call(, tpd_bucket, )  


redis.call(, concurrent_key)
redis.call(, request_marker_key, , , request_ttl)

 {, , }

Why lazy refill? A background thread constantly updating millions of buckets would be wasteful. Instead, each bucket recalculates its token count on access using elapsed time. The result is identical to continuous replenishment, but with zero background work. If a bucket key expires from Redis (via TTL), the next request simply initializes a fresh full bucket—equivalent to a fully refilled bucket after an idle period.

Releasing Concurrent Slot and Returning Tokens on Completion:

-- RELEASE_AND_ADJUST_LUA
-- Called when request completes (success or failure)
-- KEYS: [1] concurrent_key, [2] request_marker_key, [3] tpm_bucket, [4] tpd_bucket
-- ARGV: [1] token_delta (actual - estimated), [2] tpm_capacity, [3] tpd_capacity
local concurrent_key, request_marker_key = KEYS[1], KEYS[2]
local tpm_bucket, tpd_bucket = KEYS[3], KEYS[4]
local token_delta = tonumber(ARGV[1])  -- negative = over-reserved (return tokens)
local tpm_capacity = tonumber(ARGV[2])
local tpd_capacity = tonumber(ARGV[3])

-- Only decrement if marker exists (prevents double-decrement)
if redis.call('DEL', request_marker_key) == 1 then
    redis.call('DECR', concurrent_key)
end

-- Adjust token buckets: return over-reserved tokens or consume under-reserved
-- delta < 0 means over-reserved → return tokens (add to bucket)
-- delta > 0 means under-reserved → consume more (subtract from bucket)
if token_delta ~= 0 then
    local tpm_data = redis.call('HMGET', tpm_bucket, 'tokens')
    local tpm_tokens = tonumber(tpm_data[1])
    if tpm_tokens ~= nil then
        local new_tpm = math.max(0, math.min(tpm_capacity, tpm_tokens - token_delta))
        redis.call('HSET', tpm_bucket, , new_tpm)
    

     tpd_data = redis.call(, tpd_bucket, )
     tpd_tokens = (tpd_data[])
     tpd_tokens ~=  
         new_tpd = .(, .(tpd_capacity, tpd_tokens - token_delta))
        redis.call(, tpd_bucket, , new_tpd)

Alternative: Incremental Token Tracking for Streams

For long-running streams (30+ seconds), the reserve-then-adjust approach has a drawback: if you reserve 4,000 tokens but the response only uses 500, you've blocked 3,500 tokens of quota during the entire generation time. An alternative is to consume from the token bucket incrementally:

# Alternative approach: reserve only input tokens, consume output as it streams
def check_and_reserve_streaming(user_id: str, request: ChatRequest) -> RateLimitResult:
    input_tokens = count_tokens(request.messages)
    # Only consume input tokens from bucket upfront (output consumed incrementally)
    result = redis.eval(TOKEN_BUCKET_CHECK_AND_RESERVE_LUA, ..., token_cost=input_tokens)
    return result

# Called every N tokens during streaming (e.g., every 100 tokens)
def consume_streaming_tokens(user_id: str, request_id: str, new_tokens: int):
    """Consume tokens from buckets as they're generated."""
    now_ms = int(time.time() * 1000)
    # Consume from TPM and TPD buckets (no refill needed — just subtract)
    redis.eval(CONSUME_STREAMING_LUA,
        keys=[f"bucket:{user_id}:tpm", f"bucket:{user_id}:tpd"],
        args=[new_tokens, now_ms])
    # Refresh marker TTL to prevent premature expiration
    redis.expire(f"active:{user_id}:{request_id}", REQUEST_TIMEOUT_SECONDS)

With this approach, no final adjustment is needed—tokens are consumed from the bucket as they're generated. The trade-off is more Redis operations during streaming, but better quota utilization for heavy users.

Why token bucket? Both Anthropic and OpenAI use token bucket for their production rate limiting. The key advantages over fixed-window counters: (1) no boundary-burst problem — with fixed windows, a user can make 2× the limit by timing requests across a window reset; (2) precise retry-after — you can compute exactly when enough tokens will be available rather than "wait for the window to reset"; (3) smoother traffic — capacity replenishes continuously rather than all-at-once. To prevent abuse, cap the bucket capacity at the per-minute or per-day limit so users can't accumulate unlimited burst capacity. See Rate Limiter for detailed algorithm comparisons.

Common mistake: Checking limits after inference completes. This lets users burst far over their limits before being stopped. Always check before processing, using estimated costs.

Quota Service Design

The Quota Service manages per-user limits. It's read-heavy (every request checks limits) with rare writes (tier changes).

Caching Strategy:

class QuotaService:
    def __init__(self):
        self.local_cache = TTLCache(maxsize=100_000, ttl=60)  # 1 min TTL
        self.db = PostgresPool()

    def get_quota(self, user_id: str) -> UserQuota:
        # Check local cache first
        if user_id in self.local_cache:
            return self.local_cache[user_id]

        # Cache miss - fetch from DB
        quota = self.db.query(
            "SELECT * FROM user_quotas WHERE user_id = %s",
            user_id
        )

        if quota is None:
            # New user - return default free tier
            quota = DEFAULT_FREE_QUOTA

        self.local_cache[user_id] = quota
        return quota

    def update_quota(self, user_id: str, new_quota: UserQuota):
        # Write to DB
        self.db.execute(
            "INSERT INTO user_quotas (...) VALUES (...) "
            "ON CONFLICT (user_id) DO UPDATE SET ...",
            new_quota
        )

        # Invalidate cache across all instances
        self.publish_invalidation(user_id)

Cache Invalidation on Tier Change:

When a user upgrades (e.g., free → plus), we need to:

Update the database
Invalidate cached quota across all Rate Limiter instances

# Using Redis pub/sub for cache invalidation
def publish_invalidation(self, user_id: str):
    redis.publish("quota_invalidation", user_id)

# Each Rate Limiter instance subscribes
def handle_invalidation(user_id: str):
    local_cache.pop(user_id, None)  # Remove from local cache
    # Next request will fetch fresh quota from DB

Short cache TTL (60s) means even without explicit invalidation, users see new limits within a minute. This simplifies the system—you could skip pub/sub invalidation entirely for most use cases.

Phase 5: Scaling & Trade-offs

Handling Hot Users (The "Justin" Problem)

If one user (like a heavy API customer) generates thousands of requests per second, their token bucket key becomes a hot key in Redis.

Solution: Sharded Buckets

# Instead of single bucket
bucket:justin:tpm = { tokens: 150000, last_refill_ms: ... }

# Use sharded buckets (each shard gets a fraction of the capacity)
bucket:justin:tpm:0 = { tokens: 37500, ... }  # capacity = tpm_limit / 4
bucket:justin:tpm:1 = { tokens: 37500, ... }
bucket:justin:tpm:2 = { tokens: 37500, ... }
bucket:justin:tpm:3 = { tokens: 37500, ... }

# Consume: pick random shard
# Check total: sum all shards (can tolerate slight inaccuracy)
def consume_sharded(user_id: str, metric: str, amount: int, num_shards: int = 4):
    shard = random.randint(0, num_shards - 1)
    key = f"bucket:{user_id}:{metric}:{shard}"
    # Each shard has capacity = total_capacity / num_shards
    redis.eval(CONSUME_FROM_BUCKET_LUA, keys=[key], args=[amount, now_ms, ...])

def get_total_remaining(user_id: str, metric: str, num_shards: int = 4) -> int:
    total = 0
    for i in range(num_shards):
        key = f"bucket:{user_id}:{metric}:{i}"
        tokens = redis.hget(key, 'tokens')
        total += float(tokens or 0)
    return int(total)

Multi-Region Rate Limiting

For global services, users might hit different regions. Two approaches:

Option A: Global Redis (Simpler)

All regions share one Redis cluster (or use Redis Global Datastore)
Accurate limits, but adds cross-region latency (~50-100ms)

Option B: Local Rate Limiting with Sync (Faster)

Each region has its own Redis
Bucket capacity is divided across regions (e.g., 3 regions each get 1/3 of capacity)
Periodic sync reconciles token counts

# Option B: Region-local buckets
global_rpm_limit = 300
num_regions = 3
local_rpm_capacity = global_rpm_limit // num_regions  # 100 per region

# Each region maintains local token buckets with 1/3 capacity
# Every 10 seconds, sync token counts with global state for accuracy

Approach	Latency	Accuracy	Complexity
Global Redis	+50-100ms	Exact	Low
Local with sync	+1-2ms	~±10%	Medium

Interview recommendation: Start with global Redis. "We'd use Redis Global Datastore for cross-region replication. If the 50ms latency becomes a problem, we could switch to local rate limiting with relaxed accuracy—for rate limiting, being 10% off is acceptable."

Graceful Degradation Under Load

When the system is overloaded, we can't serve everyone. Priority matters.

Priority Queue by Tier:

class PriorityRateLimiter:
    # Tier priorities (higher = more important)
    PRIORITIES = {
        "enterprise": 100,
        "api": 80,
        "plus": 60,
        "free": 20,
    }

    def check_with_priority(self, user_id: str, system_load: float) -> bool:
        quota = self.quota_service.get_quota(user_id)

        # Normal rate limit check
        if not self.check_rate_limit(user_id, quota):
            return False

        # Under high load, shed lower-priority traffic
        if system_load > 0.8:  # 80%+ loaded
            priority = self.PRIORITIES[quota.tier]
            # threshold increases with load: 80 at 80% load, 95 at 95% load
            threshold = system_load * 100
            if priority < threshold:
                return False  # Shed this request (free=20 shed first, enterprise=100 never shed)

        return True

Model Fallback:

# Model tiers: larger models require more GPU memory/compute
LARGE_MODELS = {"grok-2", "grok-2-vision"}  # Flagship models
SMALL_MODELS = {"grok-2-mini", "grok-1"}    # Efficient models

def select_model(requested_model: str, user_tier: str, system_load: float) -> str:
    if system_load < 0.8:
        return requested_model

    # High load: downgrade free users to smaller model
    if user_tier == "free" and requested_model in LARGE_MODELS:
        return "grok-2-mini"  # Smaller model, frees up GPU capacity

    return requested_model

API Key vs User Rate Limiting

For API access, rate limits are typically per API key, not per user account. This is an important distinction:

Dimension	Consumer App	API Access
Identity	User ID (OAuth)	API Key
Rate limit key	`bucket:{user_id}:*`	`bucket:{api_key}:*`
Quota lookup	User → Tier → Limits	API Key → Org → Limits
Billing	Subscription-based	Usage-based (metered)

def get_rate_limit_key(request: Request) -> str:
    """Determine rate limit identity from request."""
    auth_header = request.headers.get("Authorization", "")

    if is_api_key(auth_header):
        # API key access - rate limit by key
        api_key = auth_header.split(" ")[1]
        key_id = stable_key_id(api_key)  # Hash or prefix of the key
        return f"api_key:{key_id}"
    else:
        # Consumer access - rate limit by user
        user_id = get_user_id_from_session(request)
        return f"user:{user_id}"

For API keys, organizations often have org-level quotas that pool across all keys:

# Org-level rate limiting for API access
bucket:org:{org_id}:tpm     → shared token bucket across all API keys in org
bucket:api_key:{key_id}:rpm → per-key request bucket (prevent single key abuse)

Failure Modes

Redis Unavailable:

def check_rate_limit_with_fallback(user_id: str) -> bool:
    try:
        return self.check_rate_limit_redis(user_id)
    except RedisConnectionError:
        # Fallback: local in-memory rate limiter
        # Less accurate but keeps service running
        return self.check_rate_limit_local(user_id)

Decision: Fail-Open vs Fail-Closed

Scenario	Recommendation	Rationale
Redis down (temporary)	Fail-open	Brief period without limits is better than total outage
Quota service down	Use cached/default limits	Stale limits are better than blocking
Security-critical (login)	Fail-closed	Protection is paramount

Monitoring and Alerting

Key Metrics:

# Rate limiter metrics
rate_limit.checks.total          # Total limit checks
rate_limit.rejections.total      # Rejected requests
rate_limit.rejections.by_tier    # Rejections per tier
rate_limit.latency.p99           # Rate check latency

# Quota service metrics
quota.cache.hit_rate             # Should be >95%
quota.db.query_latency           # Database response time

# Per-user metrics (for abuse detection)
rate_limit.user.requests         # Requests per user
rate_limit.user.tokens           # Tokens per user

Alerts:

Rejection rate > 5% for paying tiers (indicates limit misconfiguration)
Rate limiter latency > 50ms (Redis issues)
Single user consuming > 10% of total capacity (potential abuse)

Common Pitfalls

Not estimating tokens before inference — If you only count tokens after completion, a user can send 100 concurrent requests with huge prompts, consuming 100x their limit before any rejection. Always estimate and reserve upfront.

Forgetting to adjust after completion — If you reserve 500 tokens but only use 50, you've over-counted by 450. Always reconcile actual vs. estimated usage.

Using request count for AI services — A "hello" request and a "summarize this 50-page document" request have vastly different costs. Token-based limits are essential for fair resource allocation.

Hardcoding limits — Limits change frequently (sales, promotions, abuse response). Store them in a Quota Service, not in code. Make them configurable per-user, not just per-tier.

Ignoring organization/team accounts — Enterprise customers often have org-wide quotas shared across team members. Your data model needs org_id and the ability to aggregate limits across users.

Not considering streaming responses — A streaming request might take 30 seconds. If you only count it when complete, concurrent request limits become meaningless. Track in-flight requests separately.

No safety net for crashed requests — If a request crashes mid-inference without releasing its concurrent slot, that slot is leaked forever. Use request marker keys with TTL (>= max request time) and lazy reconciliation to recover leaked slots.

Ignoring GPU as the bottleneck — Traditional rate limiting focuses on requests/sec or bandwidth. For AI inference, GPU memory and compute are the scarce resources. A user with 10 concurrent long-context requests can exhaust GPU capacity even at low RPM. Always include concurrent request limits.

Interview Checklist

Use this checklist to ensure you've covered the key points:

Summary

Aspect	Design Decision
Algorithm	Token bucket with lazy refill (same as Anthropic/OpenAI)
Limit dimensions	RPM, TPM, TPD, and concurrent requests (GPU slots)
Quota storage	PostgreSQL with local cache (60s TTL)
Bucket storage	Redis hashes with Lua scripts for atomic refill-check-consume
Token handling	Reserve estimated, return over-reserved tokens after completion
Concurrent tracking	Request markers with safety TTL, explicit release, lazy reconciliation
Multi-region	Global Redis (simple) or local with sync (fast)
Failure mode	Fail-open with local fallback
Priority	Tier-based load shedding under pressure
API vs Consumer	API key limits for API access, user limits for consumer

The key differentiators from traditional rate limiting are:

Token bucket with lazy refill — The industry-standard algorithm used by major AI providers, offering smooth replenishment and precise retry-after times
Concurrent GPU slot limiting — The scarcest resource in AI inference is GPU capacity, not network bandwidth
Quota Service — Per-user configurable limits that can change in real-time
Reserve-Then-Adjust — Handling variable token costs when output length is unpredictable
API Key vs User identity — Different rate limiting strategies for API and consumer access

For more on rate limiting algorithms and foundational concepts, see Rate Limiter in the Building Blocks section.