Streaming LLM Responses End-to-End

Ravinder·February 4, 2025·7 min read

AILLMStreamingSSEWebSockets

The Problem With Waiting

Users hate staring at a spinner for 8 seconds while your LLM assembles a complete response. Streaming fixes perceived latency — first token in ~400 ms feels fast even when total generation is 12 seconds. But wiring streaming correctly end-to-end is not obvious. SSE semantics, backpressure, cancellation, and partial JSON tokens each have their own failure modes, and most tutorials stop at "call the API with stream: true."

This post walks the entire path from provider API to browser DOM, including the parts nobody covers.

Architecture Overview

sequenceDiagram participant B as Browser participant G as API Gateway participant S as Stream Handler participant L as LLM Provider B->>G: POST /chat (prompt, abort signal) G->>S: Forward request S->>L: POST /completions stream=true L-->>S: SSE: data: {"delta": "Hello"} L-->>S: SSE: data: {"delta": " world"} L-->>S: SSE: data: [DONE] S-->>G: Transfer-Encoding: chunked G-->>B: text/event-stream B->>B: append tokens to DOM Note over B,S: On abort: B cancels fetch → S closes L connection

The critical insight: you need two separate streaming connections — one from your server to the LLM, and one from the browser to your server. They must be coupled so an abort from the browser actually cancels the upstream LLM request and stops billing you.

Server-Side: Consuming the Provider Stream

Most providers (OpenAI, Anthropic, Gemini) expose an SSE endpoint. The Node.js implementation below uses the Anthropic SDK but the pattern is identical for any provider.

// stream-handler.ts
import Anthropic from "@anthropic-ai/sdk";
import { Request, Response } from "express";
 
const client = new Anthropic();
 
export async function streamChat(req: Request, res: Response) {
  const { messages, system } = req.body;
 
  // Set SSE headers before any await — headers must be sent before first chunk
  res.setHeader("Content-Type", "text/event-stream");
  res.setHeader("Cache-Control", "no-cache");
  res.setHeader("Connection", "keep-alive");
  res.setHeader("X-Accel-Buffering", "no"); // critical for nginx
  res.flushHeaders();
 
  const abortController = new AbortController();
 
  // When the client disconnects, abort the upstream LLM request
  req.on("close", () => {
    abortController.abort();
  });
 
  try {
    const stream = await client.messages.stream(
      {
        model: "claude-opus-4-5",
        max_tokens: 2048,
        system,
        messages,
      },
      { signal: abortController.signal }
    );
 
    for await (const event of stream) {
      if (event.type === "content_block_delta") {
        const delta = event.delta;
        if (delta.type === "text_delta") {
          // SSE format: "data: <payload>\n\n"
          res.write(`data: ${JSON.stringify({ text: delta.text })}\n\n`);
        }
      }
 
      if (event.type === "message_stop") {
        res.write(`data: [DONE]\n\n`);
        res.end();
        return;
      }
    }
  } catch (err: any) {
    if (err.name === "AbortError") {
      // Client disconnected — this is normal, not an error
      res.end();
      return;
    }
    res.write(`data: ${JSON.stringify({ error: err.message })}\n\n`);
    res.end();
  }
}

The nginx Buffering Problem

If you run behind nginx without X-Accel-Buffering: no, nginx buffers your SSE chunks and delivers them in batches — destroying the streaming effect entirely. Set this header on the response. In nginx config you can also set proxy_buffering off globally for SSE routes.

location /api/chat/stream {
    proxy_pass http://localhost:3001;
    proxy_buffering off;
    proxy_cache off;
    proxy_read_timeout 300s;
    chunked_transfer_encoding on;
}

Client-Side: Consuming SSE

The browser EventSource API is read-only and does not support POST requests or custom headers. Use the Fetch API with a ReadableStream reader instead.

// useStream.ts
export async function streamChat(
  messages: Message[],
  onToken: (text: string) => void,
  onDone: () => void,
  signal: AbortSignal
) {
  const response = await fetch("/api/chat/stream", {
    method: "POST",
    headers: { "Content-Type": "application/json" },
    body: JSON.stringify({ messages }),
    signal, // abort signal from the component
  });
 
  if (!response.ok) {
    throw new Error(`HTTP ${response.status}`);
  }
 
  const reader = response.body!.getReader();
  const decoder = new TextDecoder();
  let buffer = "";
 
  while (true) {
    const { done, value } = await reader.read();
    if (done) break;
 
    buffer += decoder.decode(value, { stream: true });
 
    // SSE messages are separated by "\n\n"
    const messages_raw = buffer.split("\n\n");
    buffer = messages_raw.pop() ?? ""; // last chunk may be incomplete
 
    for (const msg of messages_raw) {
      if (!msg.startsWith("data: ")) continue;
      const payload = msg.slice(6).trim();
      if (payload === "[DONE]") {
        onDone();
        return;
      }
      try {
        const parsed = JSON.parse(payload);
        if (parsed.text) onToken(parsed.text);
        if (parsed.error) throw new Error(parsed.error);
      } catch {
        // Partial JSON — keep in buffer and try next chunk
        buffer = msg + "\n\n" + buffer;
      }
    }
  }
}

Partial-Token JSON Parsing

The JSON parse try/catch above handles a real failure mode: when a token contains " or \n, the SSE chunk boundary can land mid-JSON. The fix is to keep the incomplete chunk in the buffer and let it merge with the next chunk before parsing. Never assume each data: line is valid JSON.

React Hook with Abort

// useStreamChat.ts
import { useRef, useState, useCallback } from "react";
 
export function useStreamChat() {
  const [tokens, setTokens] = useState<string>("");
  const [streaming, setStreaming] = useState(false);
  const abortRef = useRef<AbortController | null>(null);
 
  const send = useCallback(async (messages: Message[]) => {
    // Cancel any in-flight request before starting a new one
    abortRef.current?.abort();
    abortRef.current = new AbortController();
 
    setTokens("");
    setStreaming(true);
 
    try {
      await streamChat(
        messages,
        (text) => setTokens((prev) => prev + text),
        () => setStreaming(false),
        abortRef.current.signal
      );
    } catch (err: any) {
      if (err.name !== "AbortError") {
        console.error("Stream error:", err);
      }
      setStreaming(false);
    }
  }, []);
 
  const abort = useCallback(() => {
    abortRef.current?.abort();
    setStreaming(false);
  }, []);
 
  return { tokens, streaming, send, abort };
}

The abortRef.current?.abort() at the start of send ensures that if the user sends a new message before the previous stream finishes, the old request is cancelled — not left running in the background consuming tokens.

Backpressure

SSE has no built-in backpressure. If the LLM generates tokens faster than the browser can render them, you accumulate tokens in memory. For most use cases this is fine — token throughput (~100 tokens/sec) is far below what innerHTML can handle. But if you're doing expensive DOM operations per token (syntax highlighting, markdown parsing), batch updates:

// Debounced render — accumulate tokens, render every 50ms
let pending = "";
let rafId: number | null = null;
 
function onToken(text: string) {
  pending += text;
  if (!rafId) {
    rafId = requestAnimationFrame(() => {
      setTokens((prev) => prev + pending);
      pending = "";
      rafId = null;
    });
  }
}

requestAnimationFrame batches renders to the display refresh rate (60Hz), which is usually enough. For markdown parsing, trigger it only on onDone.

Timeout and Error Handling

LLM providers can stall mid-stream (network hiccup, provider overload). Add a keep-alive timeout on the server:

let lastChunk = Date.now();
const timeout = setInterval(() => {
  if (Date.now() - lastChunk > 30_000) {
    abortController.abort();
    res.write(`data: ${JSON.stringify({ error: "stream timeout" })}\n\n`);
    res.end();
    clearInterval(timeout);
  }
}, 5_000);
 
// In the event loop:
lastChunk = Date.now(); // reset on each chunk

On the client, surface the error in the UI — never silently drop it.

Testing Streams

Unit testing streaming endpoints requires a different approach than regular HTTP tests.

// stream.test.ts
import { createReadStream } from "stream";
 
test("stream emits tokens then DONE", async () => {
  const tokens: string[] = [];
  let done = false;
 
  await streamChat(
    [{ role: "user", content: "Hello" }],
    (t) => tokens.push(t),
    () => (done = true),
    new AbortController().signal
  );
 
  expect(tokens.length).toBeGreaterThan(0);
  expect(done).toBe(true);
});
 
test("abort stops stream", async () => {
  const ac = new AbortController();
  const tokens: string[] = [];
 
  const promise = streamChat(
    [{ role: "user", content: "Write a long essay" }],
    (t) => {
      tokens.push(t);
      if (tokens.length === 3) ac.abort(); // abort after 3 tokens
    },
    () => {},
    ac.signal
  );
 
  await expect(promise).resolves.not.toThrow();
  expect(tokens.length).toBeLessThan(100); // stream was cut short
});

Key Takeaways

Set X-Accel-Buffering: no and disable nginx proxy_buffering — without this, streaming silently becomes batched delivery.
Use Fetch + ReadableStream instead of EventSource — POST support, custom headers, and abort signal integration are not optional.
Couple browser abort to upstream LLM abort via req.on("close") — otherwise cancelled requests keep billing you.
Buffer incomplete SSE chunks before JSON parsing — chunk boundaries can split mid-JSON token.
Batch DOM updates with requestAnimationFrame when doing expensive rendering per token.
Add a server-side keep-alive timeout — LLM providers can stall mid-stream without sending an error.