Guide
AI API Rate Limiting Best Practices: Handle Limits Like a Pro
April 17, 2026 · 8 min read
Rate limiting is the most common cause of AI API failures in production. Every provider imposes limits — requests per minute, tokens per minute, concurrent connections. If you don't handle them gracefully, your users see errors. This guide covers battle-tested patterns for production rate limit handling.
Common Rate Limits by Provider
| Provider | RPM (Requests/min) | TPM (Tokens/min) | Concurrent |
|---|---|---|---|
| OpenAI (Tier 1) | 500 | 200K | Unlimited |
| OpenAI (Tier 4) | 10,000 | 2M | Unlimited |
| Anthropic (Build) | 1,000 | 400K | Unlimited |
| DeepSeek | 300 | 300K | 50 |
| AIPower | 600 | Unlimited | 100 |
Pattern 1: Exponential Backoff with Jitter
The most important pattern. Never retry immediately — wait, and add randomness to prevent thundering herd:
import time
import random
from openai import OpenAI, RateLimitError
client = OpenAI(base_url="https://api.aipower.me/v1", api_key="YOUR_KEY")
def call_with_backoff(messages, model="deepseek/deepseek-chat", max_retries=5):
for attempt in range(max_retries):
try:
return client.chat.completions.create(
model=model, messages=messages
)
except RateLimitError as e:
if attempt == max_retries - 1:
raise
# Exponential backoff: 1s, 2s, 4s, 8s, 16s + jitter
wait = (2 ** attempt) + random.uniform(0, 1)
print(f"Rate limited. Waiting {wait:.1f}s (attempt {attempt + 1})")
time.sleep(wait)Pattern 2: Token Bucket Rate Limiter
Prevent hitting limits in the first place by controlling your request rate:
import time
import threading
class TokenBucket:
"""Rate limiter using token bucket algorithm."""
def __init__(self, rate, capacity):
self.rate = rate # tokens per second
self.capacity = capacity # max burst size
self.tokens = capacity
self.last_refill = time.monotonic()
self.lock = threading.Lock()
def acquire(self):
with self.lock:
now = time.monotonic()
elapsed = now - self.last_refill
self.tokens = min(self.capacity, self.tokens + elapsed * self.rate)
self.last_refill = now
if self.tokens >= 1:
self.tokens -= 1
return True
return False
def wait_and_acquire(self):
while not self.acquire():
time.sleep(0.05)
# Usage: 10 requests per second, burst up to 20
limiter = TokenBucket(rate=10, capacity=20)
def rate_limited_call(messages, model="deepseek/deepseek-chat"):
limiter.wait_and_acquire()
return client.chat.completions.create(model=model, messages=messages)Pattern 3: Request Queue with Workers
import asyncio
from collections import deque
class RequestQueue:
def __init__(self, max_concurrent=10, rpm_limit=500):
self.semaphore = asyncio.Semaphore(max_concurrent)
self.rpm_limit = rpm_limit
self.request_times = deque()
async def process(self, messages, model):
async with self.semaphore:
# Check RPM
now = time.monotonic()
while len(self.request_times) >= self.rpm_limit:
oldest = self.request_times[0]
if now - oldest > 60:
self.request_times.popleft()
else:
await asyncio.sleep(0.1)
now = time.monotonic()
self.request_times.append(now)
return await asyncio.to_thread(
client.chat.completions.create,
model=model, messages=messages
)Best Practices Checklist
- Always implement exponential backoff — the 429 response includes a
Retry-Afterheader; use it - Track token usage — response headers include
x-ratelimit-remaining-tokens - Use request queues — don't let burst traffic saturate your limits
- Set client-side timeouts — 30s is a good default; don't wait forever
- Fall back to another model — if GPT is rate-limited, switch to DeepSeek
- Cache repeated queries — same input should not hit the API twice
- Monitor usage in real time — the AIPower dashboard shows requests/min live
AIPower Rate Limit Advantages
- No token-per-minute limits — only request count limits
- Higher default limits — 600 RPM vs 500 RPM on OpenAI Tier 1
- Auto-fallback — use
model="auto"and AIPower routes around rate-limited providers - Transparent headers — every response includes remaining quota in headers
Handle rate limits like a pro. Start at aipower.me — generous limits, 50 free API calls, real-time usage monitoring.