Guides
RAG integration
Drop Compresr between retrieval and the LLM call to shrink retrieved context against the user's question.
A typical RAG pipeline retrieves top-k chunks, joins them, and stuffs them into the LLM prompt. Compresr adds one step in the middle: take those retrieved chunks, score them against the user's question, and pass only the spans that answer it to the LLM.
This guide covers the pipeline shape, first-party integrations for LangChain and LlamaIndex, and a raw vector-DB call for when you're not using either framework.
Pipeline shape
You keep your existing vector DB, embedding model, and LLM. Compresr only filters the chunks before they hit the LLM. The query passed to Compresr is the same string the user typed; the context is the concatenation (or list, if using batch) of the retrieved chunks.
With LangChain — CompresrExtractor
The SDK ships a first-party BaseDocumentCompressor — drop it straight into ContextualCompressionRetriever. It batches all eligible documents in a single Compresr call (up to 100 per batch) and tags each compressed document with metadata["compresr"] = True.
See the LangChain integration page for the full reference — including tool-output middleware (CompresrToolMiddleware), history compression (CompresrSummarizationMiddleware), and outbound-prompt budgeting (CompresrPromptMiddleware) for agent loops.
With LlamaIndex — CompresrNodePostprocessor
CompresrNodePostprocessor is a BaseNodePostprocessor. Pass it to as_query_engine (or any RetrieverQueryEngine) and the retrieved nodes are compressed query-aware before synthesis ever sees them.
The postprocessor batches eligible nodes in a single Compresr call (up to 100 per batch), copies each NodeWithScore (originals untouched in the index), and writes new text via node.set_content(). See the LlamaIndex integration page for the full reference, including tool wrapping and memory-block compression.
Direct vector DB (no framework)
If you're calling a vector DB directly (pgvector, Pinecone, Qdrant, Chroma, Weaviate), the shape is the same: retrieve, compress, pass to LLM. Use compress_batch / compressBatch to filter each retrieved chunk independently against the same user question in one call.
Tips
- Use the framework-native integration when you can.
CompresrExtractor(LangChain) andCompresrNodePostprocessor(LlamaIndex) handle batching, node cloning, partitioning bymin_tokens, and error policy for you. Manual calls work but you reimplement that wiring. - Pick a ratio that matches your token budget. Lighter ratios (around
0.3-0.5) keep enough surrounding context for the LLM to ground its answer in real source spans — best for citation-style or extractive Q&A. Heavier ratios (0.7+or Nx mode like4) work for summarization-style answers where the model only needs the gist. Start at0.5, measure answer quality on a held-out set, tune from there. - Put the compressed text in the system message. It's reference material the model should ground its answer in, not a turn in the conversation. The user's actual question stays in the user message. See the LLM provider recipes for the exact slot per provider.
- Filter each chunk independently with batch.
compress_batch(contexts=[...], queries="single question")filters each retrieved chunk against the same user question in one HTTP round trip — cheaper thanNparallelcompress()calls.
When NOT to compress
- Tiny contexts. If your retrieved context is under ~500 tokens, the API call overhead isn't worth the savings — set a higher
min_tokensto skip them automatically. - Tightly structured retrieval results (JSON, tool outputs, schemas). Compresr is built for prose. Applied to JSON it may strip fields you need.
Related
- LangChain integration — full middleware + extractor reference.
- LlamaIndex integration — postprocessor, tool wrapper, memory block.
- Batch compression — full reference for
compress_batch.