Tutorial

AI API Streaming with Server-Sent Events: Complete Guide

April 17, 2026 · 8 min read

Streaming lets your AI application display responses token-by-token as they are generated, rather than waiting for the entire response. This dramatically improves perceived latency and user experience. Under the hood, AI APIs use Server-Sent Events (SSE) to push tokens to your client in real time.

How SSE Streaming Works

When you set stream=True, the API sends a series of small JSON chunks over an HTTP connection instead of one large response. Each chunk contains one or more tokens. The connection stays open until the response is complete.

  1. Client sends a POST request with stream: true
  2. Server responds with Content-Type: text/event-stream
  3. Server pushes data: {...} events as tokens are generated
  4. Final event: data: [DONE] signals completion

Python: Basic Streaming

from openai import OpenAI

client = OpenAI(
    base_url="https://api.aipower.me/v1",
    api_key="YOUR_AIPOWER_KEY",
)

# Enable streaming
stream = client.chat.completions.create(
    model="deepseek/deepseek-chat",
    messages=[{"role": "user", "content": "Explain how neural networks learn"}],
    stream=True,
)

# Process tokens as they arrive
for chunk in stream:
    content = chunk.choices[0].delta.content
    if content:
        print(content, end="", flush=True)
print()  # Newline at the end

Node.js: Basic Streaming

import OpenAI from "openai";

const client = new OpenAI({
  baseURL: "https://api.aipower.me/v1",
  apiKey: "YOUR_AIPOWER_KEY",
});

async function streamResponse() {
  const stream = await client.chat.completions.create({
    model: "deepseek/deepseek-chat",
    messages: [{ role: "user", content: "Explain how neural networks learn" }],
    stream: true,
  });

  for await (const chunk of stream) {
    const content = chunk.choices[0]?.delta?.content;
    if (content) process.stdout.write(content);
  }
  console.log();
}

streamResponse();

Building a Real-Time Chat UI (FastAPI + SSE)

from fastapi import FastAPI
from fastapi.responses import StreamingResponse
from openai import OpenAI

app = FastAPI()
client = OpenAI(base_url="https://api.aipower.me/v1", api_key="YOUR_KEY")

@app.post("/api/chat")
async def chat(messages: list):
    async def generate():
        stream = client.chat.completions.create(
            model="deepseek/deepseek-chat",
            messages=messages,
            stream=True,
        )
        for chunk in stream:
            content = chunk.choices[0].delta.content
            if content:
                yield f"data: {content}\n\n"
        yield "data: [DONE]\n\n"

    return StreamingResponse(generate(), media_type="text/event-stream")

Frontend: Consuming SSE in JavaScript

async function sendMessage(messages) {
  const response = await fetch("/api/chat", {
    method: "POST",
    headers: { "Content-Type": "application/json" },
    body: JSON.stringify(messages),
  });

  const reader = response.body.getReader();
  const decoder = new TextDecoder();
  let result = "";

  while (true) {
    const { done, value } = await reader.read();
    if (done) break;
    const text = decoder.decode(value);
    const lines = text.split("\n").filter(line => line.startsWith("data: "));
    for (const line of lines) {
      const data = line.slice(6);
      if (data === "[DONE]") return result;
      result += data;
      updateChatUI(result); // Re-render the message in your UI
    }
  }
  return result;
}

Streaming with Error Handling

def stream_with_retry(messages, model="deepseek/deepseek-chat", max_retries=3):
    for attempt in range(max_retries):
        try:
            stream = client.chat.completions.create(
                model=model, messages=messages, stream=True,
            )
            full_response = ""
            for chunk in stream:
                content = chunk.choices[0].delta.content
                if content:
                    full_response += content
                    yield content
            return
        except Exception as e:
            if attempt == max_retries - 1:
                raise
            print(f"Stream error (attempt {attempt + 1}): {e}")

Performance Tips

  • First-token latency: Streaming shows the first token in 200-500ms vs 2-5s for non-streaming full responses
  • Cost is identical: Streaming does not cost more or less than non-streaming requests
  • Abort early: Cancel the stream if the user navigates away to save output tokens
  • Buffer for rendering: Batch UI updates every 50ms instead of per-token to avoid jank

All 16 models on AIPower support streaming. Start building real-time AI UIs at aipower.me with 50 free API calls.

Ready to try?

50 free API calls. 16 models. One API key.

Create free account