Streaming AI Responses Token-by-Token
Stream LLM completions to the UI with backpressure-aware readers and abort handling.
Why Stream AI Responses?
When you call a large language model (LLM) like GPT-4 or Claude, the model generates tokens one by one. A typical response may take 5–20 seconds to complete. If you wait for the full response before sending anything to the client, the user stares at a blank screen the entire time.
Streaming solves this: you pipe each token to the browser as it is produced, creating the familiar "typewriter" effect used by ChatGPT, Claude.ai, and Gemini.
- Perceived latency drops from time-to-full-response to time-to-first-token (often under 300 ms).
- Users can start reading and even abort early if the answer is already clear.
- Server memory stays flat — you never buffer the whole response.
In Next.js 15 the primitives you need are ReadableStream, the Web Streams API, and StreamingTextResponse (or a plain Response with a stream body).
How LLM SDKs Expose Streams
Most LLM SDKs return an async iterable or a ReadableStream when you pass stream: true. The Vercel AI SDK unifies these under a single interface.
With the official OpenAI SDK you receive a stream of ChatCompletionChunk objects. Each chunk carries a delta.content string that may be one token, a few characters, or an empty string at the end.
- OpenAI SDK:
openai.chat.completions.create({ stream: true })returns anAsyncIterable. - Vercel AI SDK:
streamText()returns a result withresult.toDataStreamResponse()ready for Next.js Route Handlers. - Anthropic SDK:
client.messages.stream()returns an async iterable ofMessageStreamEvent.
Regardless of the SDK, the pattern is the same: iterate over chunks, encode each piece, and enqueue it into a ReadableStream that becomes the HTTP response body.
All lessons in this course
- Server-Sent Events from Route Handlers
- Integrating WebSocket Services in a Serverless World
- Streaming AI Responses Token-by-Token
- Presence, Cursors, and Live Collaboration State