Guides
Streaming compression
Stream compressed tokens as they're produced over Server-Sent Events for low-latency, user-facing UX.
Use streaming when a human is watching the output land - chat UIs, agent traces, anything where time-to-first-token matters more than the total wall-clock time. The Compresr SDKs expose compress_stream / compressStream as iterators; the underlying HTTP transport is Server-Sent Events on /api/compress/question-specific/stream.
This guide covers the basic streaming pattern in all three languages, the shape of each chunk, how to handle errors that fire mid-stream, and when streaming is the wrong choice.
Why stream
A non-streaming call buffers the entire compressed output server-side and returns it in one HTTP response. Latency to the first visible character equals the total compression time.
Streaming pushes characters to the client the moment the model produces them. First visible output lands much sooner, which matters when:
- A user is watching a chat response build up
- An agent loop displays per-step traces
- A long-running compression sits on the latency critical path of a UI
For background jobs and pipelines that need the full output before doing anything with it, the synchronous compress call is simpler.
Basic streaming
The Python and TypeScript SDKs expose streaming as an iterator. cURL streams Server-Sent Events directly; pass -N to disable output buffering and parse data: lines yourself.
The chunk shape
Every chunk is a small JSON object with the same fields:
Concatenate content across chunks until you see done: true. The order is stable: content chunks first, then a single done: true chunk last. The terminal chunk's content is always an empty string. If the server aborts mid-stream the chunk's error field carries a description.
If you need the token-accounting metadata (original_tokens, compressed_tokens, tokens_saved, actual_compression_ratio, duration_ms) for billing or telemetry, call the non-streaming compress endpoint — the streaming endpoint only emits text chunks, not the response envelope.
Handling errors mid-stream
A stream can fail partway. The connection can drop, the server can return 5xx after some content has already been emitted, or a downstream rate limit can interrupt the response. Design for partial output: keep what you have, surface a clear error, and decide whether to retry from scratch or keep the partial result.
When NOT to stream
- The downstream consumer is a non-streaming LLM call. You gain nothing by streaming the compressed text just to buffer it before passing it to a synchronous
chat.completions.create. Use plaincompressinstead. - Batch / pipeline jobs over many contexts. Per-call streaming overhead is wasted when no human is watching. Use
compress_batchto round-trip once instead. - Tight latency budgets at high concurrency. Streams hold an open HTTP connection per request, so the synchronous
compresscall is cheaper to schedule when you're firing hundreds of compressions per second.