LLM provider recipes
Pipe Compresr-compressed context into OpenAI, Anthropic, Gemini, or a local Ollama call.
Compresr doesn't call the provider for you. It returns a shorter string — you drop that string into the call you already make, in the slot the provider treats as reference material (system, system_instruction, or top-level system). Pick your provider:
The OpenAI chat completions API treats the first {"role": "system", ...} message as instructions. Drop the compressed text there; keep the user's actual question in a user message.
Compresr's token counts use tiktoken cl100k_base — OpenAI's tokenizer for the GPT-4 and GPT-5 family — so the numbers in the Compresr response match what OpenAI will bill almost exactly. Fewer input tokens also means a faster prefill, so first-token latency drops noticeably on chat UIs.
When this helps
- Token-cost wins, every provider. Hosted providers bill per input token. Compressing a 20k-token RAG payload by 60–80% lands directly on your bill, every call.
- Headroom inside the context window. Long retrieval pipelines or accumulated chat history can crowd even 200k+ windows; compression buys room for the model's own reasoning output.
- Lower time-to-first-token. Prefill is roughly linear in input length. Fewer input tokens means the model starts generating sooner — noticeable on chat UIs and agent loops.
Notes
Compressed context belongs in the slot the provider treats as reference material — system for OpenAI/Anthropic/Ollama, system_instruction for Gemini. The user's actual question stays in the user-role slot. Don't compress short, structured system prompts (instructions, output schemas, tool definitions) — Compresr is designed for long, semi-redundant retrieved or accumulated context, not for prompts you wrote by hand.
For parameter semantics (target_compression_ratio, coarse, and friends) see models. For handling rate-limit and timeout failures from this two-call chain, see errors.