Guide

AI API Rate Limiting Best Practices: Handle Limits Like a Pro

April 17, 2026 · 8 min read

Rate limiting is the most common cause of AI API failures in production. Every provider imposes limits — requests per minute, tokens per minute, concurrent connections. If you don't handle them gracefully, your users see errors. This guide covers battle-tested patterns for production rate limit handling.

Common Rate Limits by Provider

ProviderRPM (Requests/min)TPM (Tokens/min)Concurrent
OpenAI (Tier 1)500200KUnlimited
OpenAI (Tier 4)10,0002MUnlimited
Anthropic (Build)1,000400KUnlimited
DeepSeek300300K50
AIPower600Unlimited100

Pattern 1: Exponential Backoff with Jitter

The most important pattern. Never retry immediately — wait, and add randomness to prevent thundering herd:

import time
import random
from openai import OpenAI, RateLimitError

client = OpenAI(base_url="https://api.aipower.me/v1", api_key="YOUR_KEY")

def call_with_backoff(messages, model="deepseek/deepseek-chat", max_retries=5):
    for attempt in range(max_retries):
        try:
            return client.chat.completions.create(
                model=model, messages=messages
            )
        except RateLimitError as e:
            if attempt == max_retries - 1:
                raise
            # Exponential backoff: 1s, 2s, 4s, 8s, 16s + jitter
            wait = (2 ** attempt) + random.uniform(0, 1)
            print(f"Rate limited. Waiting {wait:.1f}s (attempt {attempt + 1})")
            time.sleep(wait)

Pattern 2: Token Bucket Rate Limiter

Prevent hitting limits in the first place by controlling your request rate:

import time
import threading

class TokenBucket:
    """Rate limiter using token bucket algorithm."""
    def __init__(self, rate, capacity):
        self.rate = rate          # tokens per second
        self.capacity = capacity  # max burst size
        self.tokens = capacity
        self.last_refill = time.monotonic()
        self.lock = threading.Lock()

    def acquire(self):
        with self.lock:
            now = time.monotonic()
            elapsed = now - self.last_refill
            self.tokens = min(self.capacity, self.tokens + elapsed * self.rate)
            self.last_refill = now

            if self.tokens >= 1:
                self.tokens -= 1
                return True
            return False

    def wait_and_acquire(self):
        while not self.acquire():
            time.sleep(0.05)

# Usage: 10 requests per second, burst up to 20
limiter = TokenBucket(rate=10, capacity=20)

def rate_limited_call(messages, model="deepseek/deepseek-chat"):
    limiter.wait_and_acquire()
    return client.chat.completions.create(model=model, messages=messages)

Pattern 3: Request Queue with Workers

import asyncio
from collections import deque

class RequestQueue:
    def __init__(self, max_concurrent=10, rpm_limit=500):
        self.semaphore = asyncio.Semaphore(max_concurrent)
        self.rpm_limit = rpm_limit
        self.request_times = deque()

    async def process(self, messages, model):
        async with self.semaphore:
            # Check RPM
            now = time.monotonic()
            while len(self.request_times) >= self.rpm_limit:
                oldest = self.request_times[0]
                if now - oldest > 60:
                    self.request_times.popleft()
                else:
                    await asyncio.sleep(0.1)
                    now = time.monotonic()

            self.request_times.append(now)
            return await asyncio.to_thread(
                client.chat.completions.create,
                model=model, messages=messages
            )

Best Practices Checklist

  1. Always implement exponential backoff — the 429 response includes a Retry-After header; use it
  2. Track token usage — response headers include x-ratelimit-remaining-tokens
  3. Use request queues — don't let burst traffic saturate your limits
  4. Set client-side timeouts — 30s is a good default; don't wait forever
  5. Fall back to another model — if GPT is rate-limited, switch to DeepSeek
  6. Cache repeated queries — same input should not hit the API twice
  7. Monitor usage in real time — the AIPower dashboard shows requests/min live

AIPower Rate Limit Advantages

  • No token-per-minute limits — only request count limits
  • Higher default limits — 600 RPM vs 500 RPM on OpenAI Tier 1
  • Auto-fallback — use model="auto" and AIPower routes around rate-limited providers
  • Transparent headers — every response includes remaining quota in headers

Handle rate limits like a pro. Start at aipower.me — generous limits, 50 free API calls, real-time usage monitoring.

Ready to try?

50 free API calls. 16 models. One API key.

Create free account