Skip to content
Compresr docs

LlamaIndex

First-party Compresr node postprocessor, tool wrapper, and memory block for LlamaIndex query engines and chat agents — Python and TypeScript.

The Compresr SDK ships first-party LlamaIndex integrations under compresr.integrations.llamaindex (Python) and @compresr/sdk/integrations/llamaindex (TypeScript). They cover the three places LlamaIndex users typically burn tokens: retrieved nodes feeding a query engine, verbose FunctionTool outputs in an agent, and long-running chat history under the Memory API. Every class is a drop-in subclass of the canonical LlamaIndex base, so they slot into existing pipelines without touching the rest of your graph.

1. Install

bash

2. Compress retrieved nodes (query engines)

CompresrNodePostprocessor is a BaseNodePostprocessor — pass it to as_query_engine or any RetrieverQueryEngine and the retrieved nodes are compressed query-aware before synthesis ever sees them.

python

The postprocessor copies each NodeWithScore (originals are untouched in the index), partitions by min_tokens, batches the eligible ones in a single Compresr call (up to 100 per batch), and writes new text on the copies via node.set_content() (falling back to node.text = and finally to metadata["compresr_compressed"] with a warning).

Options

PythonTypeScriptDefaultNotes
target_compression_ratiotargetCompressionRatio0.50 < r ≤ 1 removal fraction; r > 1 Nx target.
target_tokentargetTokenAlternative: absolute output budget per node. Overrides ratio (ratio = avg_chunk_tokens / target_token).
min_tokensminTokens200Skip nodes shorter than this.
coarsecoarseNoneWhen None, defers to backend default (paragraph-level).
queryqueryOverride the query string (otherwise pulled from QueryBundle.query_str).
on_erroronError"passthrough"Fail-open by default.

If the query can't be resolved, the postprocessor logs a warning and passes nodes through unchanged — latte_v1 requires a query.

3. Compress an agent tool's output

wrap_tool_with_compresr / wrapToolWithCompresr takes a FunctionTool (or tool-like duck type in TypeScript) and returns a new one whose return value is compressed transparently. The Python version preserves the original tool's name, description, fn_schema, and return_direct via FunctionTool.from_defaults(...).

python

If the wrapped function returns anything other than a string, it's passed through unchanged. Python's wrapper also automatically wires an async branch when the source tool exposes async_fn.

Query resolution: static query wins if set; otherwise query_extractor(args) is called if provided; otherwise — and only if you set neither — the wrapper consults args[query_arg] when query_arg is set (strict: if the named key is missing, the resolver returns None and does NOT fall back), or smart-picks from common keys (query, question, search_query, q, prompt, input, text) when query_arg is unset.

4. Compress chat history (Memory API)

CompresrMemoryBlock is a BaseMemoryBlock[str] — register it on Memory.from_defaults and the long-running buffer is compressed via Compresr when the memory layer needs to free tokens.

Python and TypeScript compress at different moments

Python compresses inside atruncate() — the Memory layer's truncation hook. TypeScript compresses inside get() — LlamaIndex.TS's BaseMemoryBlock has no atruncate. The observable output (a single compressed system message containing the history) is identical.

python

When the buffer overflows, the block calls Compresr with latte_v1, using the last user: line of the buffer as the query (or "conversation history" if no user line exists). Override with query= for a fixed query string, or set target_token for a fixed output budget.

Other fields on CompresrMemoryBlock: name (default "compresr_compressed_history"), priority (default 2), target_compression_ratio (default 0.5), coarse (default None), on_error (default "passthrough").

When this helps

  • High-recall retrievalsimilarity_top_k=20+ plus CompresrNodePostprocessor keeps the synthesis prompt tight without forcing you to throw away nodes.
  • Tool-heavy agentswrap_tool_with_compresr shaves verbose tool outputs (web pages, API dumps, search hits) down to what's relevant for the user's question.
  • Long-running chatCompresrMemoryBlock keeps multi-turn chat under the token cap without an extra LLM-summary call.
  • LangChain — equivalent middlewares for LangChain 1.0+ agents and retrievers.
  • Modelslatte_v1 parameter semantics.
  • RAG guide — the underlying retrieve → compress → answer pipeline.