Skip to content
Compresr docs

Introduction

Compresr shrinks the context you send to your LLM without losing the answer.

Compresr is a context-compression API for LLM developers. You send the long text you would otherwise pass to your model - RAG chunks, document bodies, chat history, tool output - together with the query you want answered. Compresr scores every span of your input against that query, keeps the spans that carry the answer, drops the rest, and returns a shorter context you can forward to your LLM exactly as you would the original.

The result is the same answer with fewer input tokens. That means lower cost per call, a longer effective context window for the same model, and faster inference on every downstream request. Compresr sits in front of whatever LLM stack you already use - it does not replace your model, your prompt, or your retrieval layer.

Where it fits

The shape of "long text + a query about it" shows up in most production LLM workloads. A few places teams plug Compresr in:

  • RAG: shrink retrieved chunks before they reach GPT, Claude, or Gemini, so you only pay for tokens that actually answer the user.
  • Agent loops: compress accumulating chat history, scratchpads, and tool transcripts so long-running agents stop drifting into the context window.
  • Tool output: compress noisy API responses, search results, or file contents before they re-enter the prompt.
  • Long-context Q&A: feed compressed documents into smaller, cheaper models without losing the parts that carry the answer.

The model: latte_v1

latte_v1 is the public compression model. It is query-specific: every call requires a query, and the model keeps the spans of context that answer it. This is the right default whenever the downstream LLM call already has a clear question, instruction, or retrieval intent.

When to use latte_v1

Use latte_v1 whenever you know what the downstream LLM is being asked. RAG pipelines, agent tool calls with a concrete goal, long-context Q&A, and chat turns with a fresh user message all qualify. If the input is short enough that compression is not worth the round trip, skip it; otherwise latte_v1 is the model you want.

Supported knobs on latte_v1:

  • query (required) - the question or instruction Compresr scores spans against. Without it, the model has nothing to keep against.
  • coarse - skip span-level scoring and compress at a coarser granularity. Faster and cheaper on very long inputs where sentence-level precision is not needed.
  • heuristic_chunking - chunk the input with a heuristic splitter before scoring. Helps on structured inputs (logs, transcripts, tables) where the default chunker over- or under-splits.
  • disable_placeholders - return only the kept spans, with no placeholder markers between dropped regions. Useful when the downstream LLM is sensitive to gap markers.

Full parameter semantics, defaults, and trade-offs - including target_compression_ratio, which controls how aggressively the model drops spans - live in the Models reference.

Start with

Pick your language and send the first request. The shape of the call is identical across all three; only the syntax differs.

Pick a language

  • Python - pip install compresr, then call client.compress(...).
  • TypeScript - npm install @compresr/sdk, then call client.compress({...}).
  • cURL - one POST request, no install required.

Framework integrations

First-party integrations ship in both SDKs — drop them into existing pipelines without rewriting the surrounding code.

  • LangChain — three middlewares (tool output, history summarization, prompt budget) + RAG document compressor + single-tool wrapper, for create_agent and ContextualCompressionRetriever.
  • LangGraph — adds make_compresr_node for custom state graphs, lossy CompresrCheckpointSerializer + CompresrStore for at-rest compression, and compresr_handoff_tool for supervisor → sub-agent transfers.
  • LlamaIndexCompresrNodePostprocessor for query engines, wrap_tool_with_compresr for FunctionTools, and CompresrMemoryBlock for the Memory API.
  • LiteLLM — Python-only pre_call guardrail that auto-compresses tool/function messages before they go upstream — works against every LiteLLM provider.
  • LLM provider recipes — manual pattern called directly against OpenAI, Anthropic, Gemini, or local Ollama.
  • Quick start - the same 30-second example in Python, TypeScript, and cURL.
  • Authentication - how cmp_ keys are issued, rotated, and revoked.
  • API reference - every endpoint, parameter, and response field.