Architecture

AI API Rate Limits Explained: How to Handle Throttling Like a Pro

April 16, 2026 · 6 min read

Every AI API has rate limits. Hit them and your application breaks. Understanding and handling rate limits properly is the difference between a demo and a production application. Here's how to do it right.

Rate Limits by Provider

ProviderDefault RPMDefault TPM429 Behavior
OpenAI60-10,00060K-2MRetry-After header
Anthropic60-4,00080K-400KRetry-After header
DeepSeek601MVariable wait
AIPower200/minUnlimited429 + retry hint

The Right Way to Handle 429 Errors

import time
import random
from openai import OpenAI

client = OpenAI(base_url="https://api.aipower.me/v1", api_key="YOUR_KEY")

def call_with_retry(messages, model="deepseek/deepseek-chat", max_retries=5):
    """Exponential backoff with jitter — the production-grade pattern."""
    for attempt in range(max_retries):
        try:
            return client.chat.completions.create(
                model=model,
                messages=messages,
            )
        except Exception as e:
            if "429" in str(e) or "rate" in str(e).lower():
                # Exponential backoff: 1s, 2s, 4s, 8s, 16s + random jitter
                wait = (2 ** attempt) + random.uniform(0, 1)
                print(f"Rate limited. Waiting {wait:.1f}s (attempt {attempt + 1}/{max_retries})")
                time.sleep(wait)
            else:
                raise  # Re-raise non-rate-limit errors
    raise RuntimeError("Max retries exceeded")

Pattern: Request Queue with Concurrency Control

import asyncio
from openai import AsyncOpenAI

aclient = AsyncOpenAI(base_url="https://api.aipower.me/v1", api_key="YOUR_KEY")

class RateLimitedQueue:
    def __init__(self, max_concurrent=10, rpm_limit=180):
        self.semaphore = asyncio.Semaphore(max_concurrent)
        self.interval = 60 / rpm_limit  # Seconds between requests

    async def call(self, messages, model="deepseek/deepseek-chat"):
        async with self.semaphore:
            await asyncio.sleep(self.interval)  # Spread requests evenly
            return await aclient.chat.completions.create(
                model=model, messages=messages,
            )

    async def batch(self, message_list, model="deepseek/deepseek-chat"):
        tasks = [self.call(msgs, model) for msgs in message_list]
        return await asyncio.gather(*tasks, return_exceptions=True)

# Process 1000 requests without hitting rate limits
queue = RateLimitedQueue(max_concurrent=10, rpm_limit=180)
results = asyncio.run(queue.batch(all_messages))

Pattern: Multi-Provider Fallback for Rate Limits

When one provider rate-limits you, fall back to another:

# With AIPower, use smart routing as automatic fallback
# "auto" routes to available models — if DeepSeek is rate-limited,
# it tries Qwen, then GLM, then others
response = client.chat.completions.create(
    model="auto",  # Never rate-limited because it has 10+ backend providers
    messages=[{"role": "user", "content": "Hello!"}],
)

Monitoring Rate Limit Usage

from collections import deque
import time

class RateMonitor:
    def __init__(self, window_seconds=60):
        self.calls = deque()
        self.window = window_seconds

    def record(self):
        now = time.time()
        self.calls.append(now)
        # Remove calls outside the window
        while self.calls and self.calls[0] < now - self.window:
            self.calls.popleft()

    @property
    def current_rpm(self):
        return len(self.calls)

    def safe_to_call(self, limit=180):
        return self.current_rpm < limit

Best Practices Summary

  1. Always implement retry with exponential backoff — never retry immediately
  2. Add jitter — prevents thundering herd when many clients retry simultaneously
  3. Use a request queue — don't fire all requests at once
  4. Monitor your RPM — stay under limits proactively
  5. Use an API gateway — AIPower's smart routing auto-distributes across providers
  6. Cache responses — identical queries shouldn't hit the API twice

AIPower's gateway distributes your requests across 10 providers, dramatically reducing the chance of hitting any single provider's rate limit. Try it at aipower.me — 200 RPM default, 50 free calls.

Ready to try?

50 free API calls. 16 models. One API key.

Create free account