Introduction
Compresr shrinks the context you send to your LLM without losing the answer.
Compresr is a context-compression API for LLM developers. You send the long text you would otherwise pass to your model - RAG chunks, document bodies, chat history, tool output - together with the query you want answered. Compresr scores every span of your input against that query, keeps the spans that carry the answer, drops the rest, and returns a shorter context you can forward to your LLM exactly as you would the original.
The result is the same answer with fewer input tokens. That means lower cost per call, a longer effective context window for the same model, and faster inference on every downstream request. Compresr sits in front of whatever LLM stack you already use - it does not replace your model, your prompt, or your retrieval layer.
Where it fits
The shape of "long text + a query about it" shows up in most production LLM workloads. A few places teams plug Compresr in:
- RAG: shrink retrieved chunks before they reach GPT, Claude, or Gemini, so you only pay for tokens that actually answer the user.
- Agent loops: compress accumulating chat history, scratchpads, and tool transcripts so long-running agents stop drifting into the context window.
- Tool output: compress noisy API responses, search results, or file contents before they re-enter the prompt.
- Long-context Q&A: feed compressed documents into smaller, cheaper models without losing the parts that carry the answer.
The model: latte_v1
latte_v1 is the public compression model. It is query-specific: every call requires a query, and the model keeps the spans of context that answer it. This is the right default whenever the downstream LLM call already has a clear question, instruction, or retrieval intent.
When to use latte_v1
Use latte_v1 whenever you know what the downstream LLM is being asked. RAG pipelines, agent tool calls with a concrete goal, long-context Q&A, and chat turns with a fresh user message all qualify. If the input is short enough that compression is not worth the round trip, skip it; otherwise latte_v1 is the model you want.
Supported knobs on latte_v1:
query(required) - the question or instruction Compresr scores spans against. Without it, the model has nothing to keep against.coarse- skip span-level scoring and compress at a coarser granularity. Faster and cheaper on very long inputs where sentence-level precision is not needed.heuristic_chunking- chunk the input with a heuristic splitter before scoring. Helps on structured inputs (logs, transcripts, tables) where the default chunker over- or under-splits.disable_placeholders- return only the kept spans, with no placeholder markers between dropped regions. Useful when the downstream LLM is sensitive to gap markers.
Full parameter semantics, defaults, and trade-offs - including target_compression_ratio, which controls how aggressively the model drops spans - live in the Models reference.
Start with
Pick your language and send the first request. The shape of the call is identical across all three; only the syntax differs.
Pick a language
- Python -
pip install compresr, then callclient.compress(...). - TypeScript -
npm install @compresr/sdk, then callclient.compress({...}). - cURL - one
POSTrequest, no install required.
Framework integrations
First-party integrations ship in both SDKs — drop them into existing pipelines without rewriting the surrounding code.
- LangChain — three middlewares (tool output, history summarization, prompt budget) + RAG document compressor + single-tool wrapper, for
create_agentandContextualCompressionRetriever. - LangGraph — adds
make_compresr_nodefor custom state graphs, lossyCompresrCheckpointSerializer+CompresrStorefor at-rest compression, andcompresr_handoff_toolfor supervisor → sub-agent transfers. - LlamaIndex —
CompresrNodePostprocessorfor query engines,wrap_tool_with_compresrforFunctionTools, andCompresrMemoryBlockfor the Memory API. - LiteLLM — Python-only
pre_callguardrail that auto-compresses tool/function messages before they go upstream — works against every LiteLLM provider. - LLM provider recipes — manual pattern called directly against OpenAI, Anthropic, Gemini, or local Ollama.
Related reading
- Quick start - the same 30-second example in Python, TypeScript, and cURL.
- Authentication - how
cmp_keys are issued, rotated, and revoked. - API reference - every endpoint, parameter, and response field.