0PricingLogin
Next.js 15 Fullstack (App Router + Server Actions) · Lesson

Streaming AI Responses Token-by-Token

Stream LLM completions to the UI with backpressure-aware readers and abort handling.

Why Stream AI Responses?

When you call a large language model (LLM) like GPT-4 or Claude, the model generates tokens one by one. A typical response may take 5–20 seconds to complete. If you wait for the full response before sending anything to the client, the user stares at a blank screen the entire time.

Streaming solves this: you pipe each token to the browser as it is produced, creating the familiar "typewriter" effect used by ChatGPT, Claude.ai, and Gemini.

  • Perceived latency drops from time-to-full-response to time-to-first-token (often under 300 ms).
  • Users can start reading and even abort early if the answer is already clear.
  • Server memory stays flat — you never buffer the whole response.

In Next.js 15 the primitives you need are ReadableStream, the Web Streams API, and StreamingTextResponse (or a plain Response with a stream body).

How LLM SDKs Expose Streams

Most LLM SDKs return an async iterable or a ReadableStream when you pass stream: true. The Vercel AI SDK unifies these under a single interface.

With the official OpenAI SDK you receive a stream of ChatCompletionChunk objects. Each chunk carries a delta.content string that may be one token, a few characters, or an empty string at the end.

  • OpenAI SDK: openai.chat.completions.create({ stream: true }) returns an AsyncIterable.
  • Vercel AI SDK: streamText() returns a result with result.toDataStreamResponse() ready for Next.js Route Handlers.
  • Anthropic SDK: client.messages.stream() returns an async iterable of MessageStreamEvent.

Regardless of the SDK, the pattern is the same: iterate over chunks, encode each piece, and enqueue it into a ReadableStream that becomes the HTTP response body.

All lessons in this course

  1. Server-Sent Events from Route Handlers
  2. Integrating WebSocket Services in a Serverless World
  3. Streaming AI Responses Token-by-Token
  4. Presence, Cursors, and Live Collaboration State
← Back to Next.js 15 Fullstack (App Router + Server Actions)