0Pricing
AI Engineering Academy · Lesson

Understanding Token Streaming

Understand how the streaming API sends partial completions as they are generated, how the OpenAI stream=True parameter works, and when streaming improves user experience.

Why Streaming Matters for User Experience

Without streaming, your application must wait for the LLM to generate the complete response before displaying anything — often 5-30 seconds for long answers. With streaming, the first token appears within 200-500ms of sending the request, and subsequent tokens stream in as they are generated. This transforms the perceived user experience from waiting to an engaging live generation effect, dramatically improving perceived responsiveness even though the total generation time is identical.

How LLMs Generate Tokens

LLMs are autoregressive: they generate text one token at a time, where each new token is conditioned on all previous tokens. When the API receives a request, the GPU starts sampling the first token immediately after the prompt is processed. Each subsequent token takes roughly the same time. Streaming sends each token to the client as soon as it is sampled, rather than buffering all tokens and sending the complete string at the end.

# Conceptual model of autoregressive generation
prompt = 'The capital of France is'

# Step 1: process full prompt, predict next token
# token_1 = sample(logits) → ' Paris'

# Step 2: append token_1 to context, predict next
# token_2 = sample(logits) → '.'

# Step 3: append token_2 to context, predict next
# token_3 = sample(logits) → '<|end|>'

# Total time: time_to_process_prompt + n_tokens * time_per_token
# With streaming: first token arrives after time_to_process_prompt (TTFT)
# Without streaming: everything arrives after TTFT + n_tokens * time_per_token

All lessons in this course

  1. Understanding Token Streaming
  2. Consuming Streams with the Python SDK
  3. Streaming in FastAPI with Server-Sent Events
  4. Handling Tool Calls in Streamed Responses
← Back to AI Engineering Academy