LlamaIndex
First-party Compresr node postprocessor, tool wrapper, and memory block for LlamaIndex query engines and chat agents — Python and TypeScript.
The Compresr SDK ships first-party LlamaIndex integrations under compresr.integrations.llamaindex (Python) and @compresr/sdk/integrations/llamaindex (TypeScript). They cover the three places LlamaIndex users typically burn tokens: retrieved nodes feeding a query engine, verbose FunctionTool outputs in an agent, and long-running chat history under the Memory API. Every class is a drop-in subclass of the canonical LlamaIndex base, so they slot into existing pipelines without touching the rest of your graph.
1. Install
2. Compress retrieved nodes (query engines)
CompresrNodePostprocessor is a BaseNodePostprocessor — pass it to as_query_engine or any RetrieverQueryEngine and the retrieved nodes are compressed query-aware before synthesis ever sees them.
The postprocessor copies each NodeWithScore (originals are untouched in the index), partitions by min_tokens, batches the eligible ones in a single Compresr call (up to 100 per batch), and writes new text on the copies via node.set_content() (falling back to node.text = and finally to metadata["compresr_compressed"] with a warning).
Options
| Python | TypeScript | Default | Notes |
|---|---|---|---|
target_compression_ratio | targetCompressionRatio | 0.5 | 0 < r ≤ 1 removal fraction; r > 1 Nx target. |
target_token | targetToken | — | Alternative: absolute output budget per node. Overrides ratio (ratio = avg_chunk_tokens / target_token). |
min_tokens | minTokens | 200 | Skip nodes shorter than this. |
coarse | coarse | None | When None, defers to backend default (paragraph-level). |
query | query | — | Override the query string (otherwise pulled from QueryBundle.query_str). |
on_error | onError | "passthrough" | Fail-open by default. |
If the query can't be resolved, the postprocessor logs a warning and passes nodes through unchanged — latte_v1 requires a query.
3. Compress an agent tool's output
wrap_tool_with_compresr / wrapToolWithCompresr takes a FunctionTool (or tool-like duck type in TypeScript) and returns a new one whose return value is compressed transparently. The Python version preserves the original tool's name, description, fn_schema, and return_direct via FunctionTool.from_defaults(...).
If the wrapped function returns anything other than a string, it's passed through unchanged. Python's wrapper also automatically wires an async branch when the source tool exposes async_fn.
Query resolution: static query wins if set; otherwise query_extractor(args) is called if provided; otherwise — and only if you set neither — the wrapper consults args[query_arg] when query_arg is set (strict: if the named key is missing, the resolver returns None and does NOT fall back), or smart-picks from common keys (query, question, search_query, q, prompt, input, text) when query_arg is unset.
4. Compress chat history (Memory API)
CompresrMemoryBlock is a BaseMemoryBlock[str] — register it on Memory.from_defaults and the long-running buffer is compressed via Compresr when the memory layer needs to free tokens.
Python and TypeScript compress at different moments
Python compresses inside atruncate() — the Memory layer's truncation hook. TypeScript compresses inside get() — LlamaIndex.TS's BaseMemoryBlock has no atruncate. The observable output (a single compressed system message containing the history) is identical.
When the buffer overflows, the block calls Compresr with latte_v1, using the last user: line of the buffer as the query (or "conversation history" if no user line exists). Override with query= for a fixed query string, or set target_token for a fixed output budget.
Other fields on CompresrMemoryBlock: name (default "compresr_compressed_history"), priority (default 2), target_compression_ratio (default 0.5), coarse (default None), on_error (default "passthrough").
When this helps
- High-recall retrieval —
similarity_top_k=20+plusCompresrNodePostprocessorkeeps the synthesis prompt tight without forcing you to throw away nodes. - Tool-heavy agents —
wrap_tool_with_compresrshaves verbose tool outputs (web pages, API dumps, search hits) down to what's relevant for the user's question. - Long-running chat —
CompresrMemoryBlockkeeps multi-turn chat under the token cap without an extra LLM-summary call.