Guides

Streaming compression

Stream compressed tokens as they're produced over SSE-formatted data frames for low-latency, user-facing UX.

Use streaming when a human is watching the output land - chat UIs, agent traces, anything where time-to-first-token matters more than the total wall-clock time. The Compresr SDKs expose compress_stream / compressStream as iterators; the underlying HTTP transport is newline-delimited JSON in SSE-formatted data: frames on /api/compress/question-specific/stream. It's a subset of SSE - no event:, id:, or retry: lines.

This guide covers the basic streaming pattern in all three languages, the shape of each chunk, how to handle errors that fire mid-stream, and when streaming is the wrong choice.

Why stream

A non-streaming call buffers the entire compressed output server-side and returns it in one HTTP response. Latency to the first visible character equals the total compression time.

Streaming pushes characters to the client the moment the model produces them. First visible output lands much sooner, which matters when:

A user is watching a chat response build up
An agent loop displays per-step traces
A long-running compression sits on the latency critical path of a UI

For background jobs and pipelines that need the full output before doing anything with it, the synchronous compress call is simpler.

Basic streaming

The Python and TypeScript SDKs expose streaming as an iterator. cURL reads the data: frames directly; pass -N to disable output buffering and parse them yourself. Watch for the data: [DONE] sentinel that marks end-of-stream.

python

The chunk shape

The wire only carries content frames:

text

No done, no error. The server signals end-of-stream with a literal data: [DONE] sentinel and closes the connection.

The SDK generator (Python) and async iterator (TypeScript) yield each wire frame verbatim as {content, done: false}, then yield a final synthetic {content: "", done: true} after the underlying iterator exhausts. Concatenate content across chunks and stop on done: true.

The error field on StreamChunk is defined in the schema but currently unused - no code path populates it. Server-side aborts propagate as raw httpx exceptions (Python) or ConnectionError (TypeScript), not as chunks with error set.

If you need the token-accounting metadata (original_tokens, compressed_tokens, tokens_saved, actual_compression_ratio, duration_ms) for billing or telemetry, call the non-streaming compress endpoint; the streaming endpoint only emits text chunks, not the response envelope.

Handling errors

Errors split cleanly by when they happen:

Before the first byte. 4xx / 5xx responses are mapped to typed errors (RateLimitError, AuthenticationError, CompresrError, ...) by the initial status check. Catch these to backoff or reauth.
After the first byte. Once the SDK is inside the byte-loop, mid-stream drops surface as raw transport exceptions - httpx.HTTPError in Python, ConnectionError in TypeScript. No typed RateLimitError post-first-byte, since headers and status have already been consumed.

Design for partial output: keep what you have, surface a clear error, and decide whether to reissue from scratch. RetryConfig is bypassed on streaming - the SDK only retries the initial connect, never a mid-stream failure. retry_after / retryAfter is nullable; guard the sleep.

python

SDK caveats

Python is sync-only. compress_stream returns a Generator, not an AsyncGenerator. There is no compress_stream_async. On FastAPI or asyncio, offload the loop to a thread with asyncio.to_thread or run_in_executor, or use the TypeScript SDK which is async-native.
TypeScript timeout kills the whole stream. The client uses a single AbortController armed for timeout ms from connect - not idle-based. Long streams may need timeout: 0 or a much larger value at client construction.
Agent-layer streaming is not wired up. client.messages.stream(...) and client.chat.completions.stream(...) on the TS agent facades throw CompresrError('streaming not yet implemented', code: 'not_implemented'); the Python facades don't expose .stream(...) at all. It's a Phase 2 work item. The compression-API compress_stream / compressStream shown above is the working streaming path today.

When NOT to stream

The downstream consumer is a non-streaming LLM call. You gain nothing by streaming the compressed text just to buffer it before passing it to a synchronous chat.completions.create. Use plain compress instead.
Batch / pipeline jobs over many contexts. Per-call streaming overhead is wasted when no human is watching. Use compress_batch to round-trip once instead.
Tight latency budgets at high concurrency. Streams hold an open HTTP connection per request, so the synchronous compress call is cheaper to schedule when you're firing hundreds of compressions per second.