LLM provider recipes

Pipe Compresr-compressed context into OpenAI, Anthropic, Gemini, or a local Ollama call.

Compresr doesn't call the provider for you. It returns a shorter string that you drop into the call you already make, in the slot the provider treats as reference material (system, system_instruction, or top-level system). Pick your provider:

Client setup

Every recipe below assumes a single client built once at module scope. Same shape as the Python SDK reference.

python

CompressResponse.data is Optional[CompressResult] in Python and nullable in TS; on error it is None/null. The recipes below read .data.compressed_context directly and assume the call succeeded. In production, wrap the compress call in try/except CompresrError (Py) or check if (!result.data) throw new Error(result.error ?? 'compress failed') (TS) before piping the string to the provider. See errors.

The OpenAI chat completions API treats the first {"role": "system", ...} message as instructions. Drop the compressed text there; keep the user's actual question in a user message.

python

Compresr's token counts use tiktoken.encoding_for_model(compression_model_name) when the model is known, falling back to cl100k_base otherwise. That resolves to cl100k_base for the GPT-4 family and o200k_base for the GPT-5 family; count deltas from OpenAI's billing are usually within 1%. Fewer input tokens also means a faster prefill, so first-token latency drops noticeably on chat UIs.

When this helps

Token-cost wins, every provider. Hosted providers bill per input token. Compressing a 20k-token RAG payload by 60–80% lands directly on your bill, every call.
Headroom inside the context window. Long retrieval pipelines or accumulated chat history can crowd even 200k+ windows; compression buys room for the model's own reasoning output.
Lower time-to-first-token. Prefill is roughly linear in input length. Fewer input tokens means the model starts generating sooner; noticeable on chat UIs and agent loops.

Notes

Compressed context belongs in the slot the provider treats as reference material: system for OpenAI/Anthropic/Ollama, system_instruction for Gemini. The user's actual question stays in the user-role slot. Don't compress short, structured system prompts (instructions, output schemas, tool definitions). Compresr is designed for long, semi-redundant retrieved or accumulated context, not for prompts you wrote by hand.

For parameter semantics (target_compression_ratio, coarse, and friends) see models. For handling rate-limit and timeout failures from this two-call chain, see errors.