Architecture
AI API Rate Limits Explained: How to Handle Throttling Like a Pro
April 16, 2026 · 6 min read
Every AI API has rate limits. Hit them and your application breaks. Understanding and handling rate limits properly is the difference between a demo and a production application. Here's how to do it right.
Rate Limits by Provider
| Provider | Default RPM | Default TPM | 429 Behavior |
|---|---|---|---|
| OpenAI | 60-10,000 | 60K-2M | Retry-After header |
| Anthropic | 60-4,000 | 80K-400K | Retry-After header |
| DeepSeek | 60 | 1M | Variable wait |
| AIPower | 200/min | Unlimited | 429 + retry hint |
The Right Way to Handle 429 Errors
import time
import random
from openai import OpenAI
client = OpenAI(base_url="https://api.aipower.me/v1", api_key="YOUR_KEY")
def call_with_retry(messages, model="deepseek/deepseek-chat", max_retries=5):
"""Exponential backoff with jitter — the production-grade pattern."""
for attempt in range(max_retries):
try:
return client.chat.completions.create(
model=model,
messages=messages,
)
except Exception as e:
if "429" in str(e) or "rate" in str(e).lower():
# Exponential backoff: 1s, 2s, 4s, 8s, 16s + random jitter
wait = (2 ** attempt) + random.uniform(0, 1)
print(f"Rate limited. Waiting {wait:.1f}s (attempt {attempt + 1}/{max_retries})")
time.sleep(wait)
else:
raise # Re-raise non-rate-limit errors
raise RuntimeError("Max retries exceeded")Pattern: Request Queue with Concurrency Control
import asyncio
from openai import AsyncOpenAI
aclient = AsyncOpenAI(base_url="https://api.aipower.me/v1", api_key="YOUR_KEY")
class RateLimitedQueue:
def __init__(self, max_concurrent=10, rpm_limit=180):
self.semaphore = asyncio.Semaphore(max_concurrent)
self.interval = 60 / rpm_limit # Seconds between requests
async def call(self, messages, model="deepseek/deepseek-chat"):
async with self.semaphore:
await asyncio.sleep(self.interval) # Spread requests evenly
return await aclient.chat.completions.create(
model=model, messages=messages,
)
async def batch(self, message_list, model="deepseek/deepseek-chat"):
tasks = [self.call(msgs, model) for msgs in message_list]
return await asyncio.gather(*tasks, return_exceptions=True)
# Process 1000 requests without hitting rate limits
queue = RateLimitedQueue(max_concurrent=10, rpm_limit=180)
results = asyncio.run(queue.batch(all_messages))Pattern: Multi-Provider Fallback for Rate Limits
When one provider rate-limits you, fall back to another:
# With AIPower, use smart routing as automatic fallback
# "auto" routes to available models — if DeepSeek is rate-limited,
# it tries Qwen, then GLM, then others
response = client.chat.completions.create(
model="auto", # Never rate-limited because it has 10+ backend providers
messages=[{"role": "user", "content": "Hello!"}],
)Monitoring Rate Limit Usage
from collections import deque
import time
class RateMonitor:
def __init__(self, window_seconds=60):
self.calls = deque()
self.window = window_seconds
def record(self):
now = time.time()
self.calls.append(now)
# Remove calls outside the window
while self.calls and self.calls[0] < now - self.window:
self.calls.popleft()
@property
def current_rpm(self):
return len(self.calls)
def safe_to_call(self, limit=180):
return self.current_rpm < limitBest Practices Summary
- Always implement retry with exponential backoff — never retry immediately
- Add jitter — prevents thundering herd when many clients retry simultaneously
- Use a request queue — don't fire all requests at once
- Monitor your RPM — stay under limits proactively
- Use an API gateway — AIPower's smart routing auto-distributes across providers
- Cache responses — identical queries shouldn't hit the API twice
AIPower's gateway distributes your requests across 10 providers, dramatically reducing the chance of hitting any single provider's rate limit. Try it at aipower.me — 200 RPM default, 50 free calls.