Skip to content
Compresr docs

Introduction

Compresr shrinks the context you send to your LLM without losing the answer.

Compresr is a context-compression API for LLM developers. Send the long text you'd pass to your model with the query you want answered; Compresr keeps the spans that carry the answer, drops the rest, and returns a shorter context you forward to your LLM. The result: fewer input tokens, lower cost, a longer effective context window, and faster inference. It sits in front of whatever LLM stack you already use and replaces nothing.

Use cases

Anywhere you send long text with a question about it:

  • RAG — compress retrieved chunks before they reach the model, so you only pay for tokens that carry the information.
  • Conversations — compress growing chat history so long sessions stay inside the context window.
  • Tool outputs (agents) — compress noisy tool results — web search hits, API responses, file dumps — before they re-enter the prompt. Pass the tool call's intent as the query so Compresr keeps only what the agent asked for.
  • Document understanding — ask a question against long, dense documents — legal contracts, medical records, financial filings — and keep only the spans that answer it.

The models: latte_v1 and latte_v2

Compresr exposes two query-specific compression models on the public API:

  • latte_v1 — the stable, battle-tested model.
  • latte_v2 (beta) — the newer model: up to 5x faster than latte_v1 at the same or better compression quality.

Which one to use

Reach for latte_v2 by default. It's a drop-in for latte_v1: every parameter latte_v1 takes works here too, plus a dynamic mode that picks the compression ratio per input. latte_v1 stays available for the rare case where latte_v2 falls short.

Parameters

Every call takes a query and the context to compress. Everything else is optional:

querystringRequired
The question or instruction Compresr scores against. It keeps the spans of context that carry the information and drops the rest.
target_compression_rationumberOptional
How aggressively to compress. 0 < r ≤ 1 removes that fraction of the input; r > 1 targets an rx reduction. Ignored when dynamic is on.
coarsebooleanOptional
Default: false
Score at a coarser granularity instead of span by span. Faster on very long inputs where fine-grained precision is not needed.
heuristic_chunkingbooleanOptional
Default: false
Split the input with a heuristic chunker before scoring. Helps on structured inputs like logs, transcripts, and tables.
disable_placeholdersbooleanOptional
Default: false
Return only the kept spans, with no placeholder markers between removed regions. Useful when the downstream LLM is sensitive to gap markers.
dynamicbooleanOptional
Default: false
latte_v2 only. Let Compresr choose the compression ratio per input instead of a fixed target_compression_ratio. A good default for mixed-difficulty inputs; use a fixed ratio when you need a predictable token budget.
dynamic_min_rationumberOptional
Default: 1.5
latte_v2 only. Lower bound on the ratio dynamic can pick.
dynamic_max_rationumberOptional
Default: 10.0
latte_v2 only. Upper bound on the ratio dynamic can pick.

The three dynamic* parameters are latte_v2 only. Full semantics, defaults, and the support matrix live in the Models reference.

Start with

Pick your language and send the first request. The shape of the call is identical across all three; only the syntax differs.

Pick a language

  • Python - pip install compresr, then call client.compress(...).
  • TypeScript - npm install @compresr/sdk, then call client.compress({...}).
  • cURL - one POST request, no install required.

Framework integrations

First-party integrations ship in both SDKs, so you can drop them into existing pipelines without rewriting the surrounding code.

  • LangChain: three middlewares (tool output, history summarization, prompt budget) + RAG document compressor + single-tool wrapper, for create_agent and ContextualCompressionRetriever.
  • LangGraph: adds make_compresr_node for custom state graphs, lossy CompresrCheckpointSerializer + CompresrStore for at-rest compression, and compresr_handoff_tool for supervisor → sub-agent transfers.
  • LlamaIndex: CompresrNodePostprocessor for query engines, wrap_tool_with_compresr for FunctionTools, and CompresrMemoryBlock for the Memory API.
  • LiteLLM: Python-only pre_call guardrail that auto-compresses tool/function messages before they go upstream, working against every LiteLLM provider.
  • LLM provider recipes: manual pattern called directly against OpenAI, Anthropic, Gemini, or local Ollama.
  • Quick start - the same 30-second example in Python, TypeScript, and cURL.
  • Authentication - how cmp_ keys are issued, rotated, and revoked.
  • API reference - every endpoint, parameter, and response field.