# https://compresr.ai/docs/introduction

> Human-readable page: https://compresr.ai/docs/introduction

Compresr is a context-compression API for LLM developers. You send the long text you would otherwise pass to your model - RAG chunks, document bodies, chat history, tool output - together with the `query` you want answered. Compresr scores every span of your input against that `query`, keeps the spans that carry the answer, drops the rest, and returns a shorter context you can forward to your LLM exactly as you would the original.

The result keeps the answer-bearing tokens with far fewer input tokens: lower cost per call, a longer effective context window for the same model, and faster inference on every downstream request. At light compression, answer quality holds or improves; higher ratios trade some accuracy for even lower cost. Compresr sits in front of whatever LLM stack you already use - it does not replace your model, your prompt, or your retrieval layer.

## Where it fits

The shape of "long text + a query about it" shows up in most production LLM workloads. A few places teams plug Compresr in:

- **RAG**: shrink retrieved chunks before they reach GPT, Claude, or Gemini, so you only pay for tokens that actually answer the user.
- **Agent loops**: compress accumulating chat history, scratchpads, and tool transcripts so long-running agents stop drifting into the context window.
- **Tool output**: compress noisy API responses, search results, or file contents before they re-enter the prompt.
- **Long-context Q&A**: feed compressed documents into smaller, cheaper models without losing the parts that carry the answer.

## The models: `latte_v1` and `latte_v2`

Compresr exposes two query-specific compression models on the public API:

- **`latte_v1`** — query-specific compression with structural knobs (`coarse`, `heuristic_chunking`, `disable_placeholders`).
- **`latte_v2`** — up to 5x faster than `latte_v1` at the same compression quality. A single relevance pass per request, no structural knobs.

Both are query-specific: every call requires a `query`, and the model keeps the spans of `context` that answer it. This is the right default whenever the downstream LLM call already has a clear question, instruction, or retrieval intent.

> **When to use these models**
> Use either `latte_v1` or `latte_v2` whenever you know what the downstream LLM is being asked. RAG pipelines, agent tool calls with a concrete goal, long-context Q&A, and chat turns with a fresh user message all qualify. If the input is short enough that compression is not worth the round trip, skip it.

Supported knobs on `latte_v1`:

- `query` *(required)* - the question or instruction Compresr scores spans against. Without it, the model has nothing to keep against.
- `coarse` - skip span-level scoring and compress at a coarser granularity. Faster and cheaper on very long inputs where sentence-level precision is not needed.
- `heuristic_chunking` - chunk the input with a heuristic splitter before scoring. Helps on structured inputs (logs, transcripts, tables) where the default chunker over- or under-splits.
- `disable_placeholders` - return only the kept spans, with no placeholder markers between dropped regions. Useful when the downstream LLM is sensitive to gap markers.

Supported knobs on `latte_v2`:

- `query` *(required)* - same semantics as on `latte_v1`.

Full parameter semantics, defaults, and trade-offs - including `target_compression_ratio`, which controls how aggressively the model drops spans - live in the [Models reference](/docs/api-reference/models).

## Start with

Pick your language and send the first request. The shape of the call is identical across all three; only the syntax differs.

### Pick a language

- [Python](/docs/sdks/python) - `pip install compresr`, then call `client.compress(...)`.
- [TypeScript](/docs/sdks/typescript) - `npm install @compresr/sdk`, then call `client.compress({...})`.
- [cURL](/docs/sdks/curl) - one `POST` request, no install required.

### Framework integrations

First-party integrations ship in both SDKs, so you can drop them into existing pipelines without rewriting the surrounding code.

- [LangChain](/docs/framework-integration/langchain): three middlewares (tool output, history summarization, prompt budget) + RAG document compressor + single-tool wrapper, for `create_agent` and `ContextualCompressionRetriever`.
- [LangGraph](/docs/framework-integration/langgraph): adds `make_compresr_node` for custom state graphs, lossy `CompresrCheckpointSerializer` + `CompresrStore` for at-rest compression, and `compresr_handoff_tool` for supervisor → sub-agent transfers.
- [LlamaIndex](/docs/framework-integration/llamaindex): `CompresrNodePostprocessor` for query engines, `wrap_tool_with_compresr` for `FunctionTool`s, and `CompresrMemoryBlock` for the Memory API.
- [LiteLLM](/docs/framework-integration/litellm): Python-only `pre_call` guardrail that auto-compresses tool/function messages before they go upstream, working against every LiteLLM provider.
- [LLM provider recipes](/docs/framework-integration/llm-providers): manual pattern called directly against OpenAI, Anthropic, Gemini, or local Ollama.

### Related reading

- [Quick start](/docs/quick-start) - the same 30-second example in Python, TypeScript, and cURL.
- [Authentication](/docs/authentication) - how `cmp_` keys are issued, rotated, and revoked.
- [API reference](/docs/api-reference/conventions) - every endpoint, parameter, and response field.