Guides

RAG integration

Drop Compresr between retrieval and the LLM call to shrink retrieved context against the user's question.

A typical RAG pipeline retrieves top-k chunks, joins them, and stuffs them into the LLM prompt. Compresr adds one step in the middle: take those retrieved chunks, score them against the user's question, and pass only the spans that answer it to the LLM.

This guide covers the pipeline shape, first-party integrations for LangChain and LlamaIndex, and a raw vector-DB call for when you're not using either framework.

Pipeline shape

text

You keep your existing vector DB, embedding model, and LLM. Compresr only filters the chunks before they hit the LLM. The query passed to Compresr is the same string the user typed; the context is the concatenation (or list, if using batch) of the retrieved chunks.

With LangChain: `CompresrExtractor`

The SDK ships a first-party BaseDocumentCompressor; drop it straight into ContextualCompressionRetriever. It batches all eligible documents in a single Compresr call (up to 100 per batch) and tags each compressed document with metadata["compresr"] = True.

python

See the LangChain integration page for the full reference, including tool-output middleware (CompresrToolMiddleware), history compression (CompresrSummarizationMiddleware), and outbound-prompt budgeting (CompresrPromptMiddleware) for agent loops.

With LlamaIndex: `CompresrNodePostprocessor`

CompresrNodePostprocessor is a BaseNodePostprocessor. Pass it to as_query_engine (or any RetrieverQueryEngine) and the retrieved nodes are compressed query-aware before synthesis ever sees them.

python

The postprocessor batches eligible nodes in a single Compresr call (up to 100 per batch), copies each NodeWithScore (originals untouched in the index), and writes new text via node.set_content(). CompresrNodePostprocessor also accepts target_token (an absolute token budget per node — translated to a ratio via max(avg_tokens / target_token, 1.0)) if you need to fit inside a hard context window rather than a relative ratio. See the LlamaIndex integration page for the full reference, including tool wrapping and memory-block compression.

Direct vector DB (no framework)

If you're calling a vector DB directly (pgvector, Pinecone, Qdrant, Chroma, Weaviate), the shape is the same: retrieve, compress, pass to LLM. Use compress_batch / compressBatch to filter each retrieved chunk independently against the same user question in one call.

python

The async twin is compress_batch_async (Python) — same signature. For LangChain LCEL / RunnableSequence setups, CompresrExtractor also implements acompress_documents.

Per-chunk queries with `inputs=`

contexts= + queries= broadcasts one question across chunks. For hybrid search or multi-hop RAG where each chunk has its own query, use the pair form inputs=[{"context": ..., "query": ...}, ...]. Mutually exclusive with contexts=/queries=.

python

Tips

Use the framework-native integration when you can. CompresrExtractor (LangChain) and CompresrNodePostprocessor (LlamaIndex) handle batching, node cloning, partitioning by min_tokens, and error policy for you. Manual calls work but you reimplement that wiring.
Pick a ratio that matches your token budget. target_compression_ratio has two regimes: 0 < r ≤ 1 removes that fraction of tokens (0.5 = drop half); r > 1 targets an Nx reduction (4 = 4x smaller, 60 = 60x). The server hard-caps at 200. Lighter ratios (around 0.3-0.5) keep enough surrounding context for citation-style or extractive Q&A. Heavier settings (0.7+, or Nx like 4) work for summarization-style answers. Start at 0.5, measure on a held-out set, tune from there.
Put the compressed text in the system message. It's reference material the model should ground its answer in, not a turn in the conversation. The user's actual question stays in the user message. See the LLM provider recipes for the exact slot per provider.
Filter each chunk independently with batch. compress_batch(contexts=[...], queries="single question") filters each retrieved chunk against the same user question in one HTTP round trip, cheaper than N parallel compress() calls.

When NOT to compress

Tiny contexts. If your retrieved context is under ~500 tokens, the API call overhead isn't worth the savings; set a higher min_tokens to skip them automatically.
Tightly structured retrieval results (JSON, tool outputs, schemas). Compresr is prose-optimized. For structured payloads you can turn on heuristic_chunking and disable_placeholders to reduce field-loss risk, or route tool outputs through CompresrToolMiddleware instead.

Variable chunk sizes: `latte_v2` + dynamic ratios

If your corpus has highly variable chunk sizes (short FAQ snippets alongside long PDF pages), a single target_compression_ratio over- or under-compresses at the extremes. Set compression_model_name="latte_v2" and pass dynamic=True with dynamic_min_ratio / dynamic_max_ratio — the model picks a per-chunk ratio inside that window based on input size. latte_v2-only.

LangChain integration: full middleware + extractor reference.
LlamaIndex integration: postprocessor, tool wrapper, memory block.
Batch compression: full reference for compress_batch.