LlamaIndex

First-party Compresr node postprocessor, tool wrapper, and memory block for LlamaIndex query engines and chat agents for Python and TypeScript.

The Compresr SDK ships first-party LlamaIndex integrations under compresr.integrations.llamaindex (Python) and @compresr/sdk/integrations/llamaindex (TypeScript). They cover the three places LlamaIndex users typically burn tokens: retrieved nodes feeding a query engine, verbose FunctionTool outputs in an agent, and long-running chat history under the Memory API. Every class is a drop-in subclass of the canonical LlamaIndex base, so they slot into existing pipelines without touching the rest of your graph.

1. Install

bash

2. Compress retrieved nodes (query engines)

CompresrNodePostprocessor is a BaseNodePostprocessor; pass it to as_query_engine or any RetrieverQueryEngine and the retrieved nodes are compressed query-aware before synthesis ever sees them.

python

The postprocessor copies each NodeWithScore (originals are untouched in the index), partitions by min_tokens, batches the eligible nodes into slices of 100 and issues one Compresr call per slice, and writes new text on the copies via node.set_content() (falling back to node.text = and finally to metadata["compresr_compressed"] with a warning).

Options

Python	TypeScript	Default	Notes
`compression_model`	`compressionModel`	`"latte_v1"`	Default `"latte_v1"`.
`target_compression_ratio`	`targetCompressionRatio`	`0.5`	`0 < r ≤ 1` removal fraction; `r > 1` Nx target.
`target_token`	`targetToken`	n/a	Alternative: absolute output budget per node. Overrides ratio (`ratio = max(avg_chunk_tokens / target_token, 1.0)`). When `target_token` > current tokens the ratio pins to `1.0` (no compression is applied).
`min_tokens`	`minTokens`	`200`	Skip nodes shorter than this. Estimator uses `tiktoken` when available; otherwise falls back to `chars/4`.
`coarse`	`coarse`	`None`	When `None`, defers to backend default (paragraph-level).
`query`	`query`	n/a	Override the query string (otherwise pulled from `QueryBundle.query_str`).
`on_error`	`onError`	`"passthrough"`	Fail-open by default.
`api_key` / `base_url` / `client`	`apiKey` / `baseUrl` / `client`	n/a	Standard auth knobs.

If the query can't be resolved, the postprocessor logs a warning and passes nodes through unchanged; latte_v1 requires a query (latte_v2 treats it as optional).

3. Compress an agent tool's output

wrap_tool_with_compresr / wrapToolWithCompresr takes a FunctionTool (or tool-like duck type in TypeScript) and returns a new one whose return value is compressed transparently. The Python version preserves the original tool's name, description, fn_schema, and return_direct via FunctionTool.from_defaults(...).

python

If the wrapped function returns anything other than a string, it's passed through unchanged. Python's wrapper also automatically wires an async branch when the source tool exposes async_fn. The TypeScript wrapper currently overrides only .call — tools that expose .acall bypass compression on the async path.

Query resolution: static query wins if set; otherwise query_extractor(args) is called if provided; otherwise, and only if you set neither, the wrapper consults args[query_arg] when query_arg is set (strict: if the named key is missing, the resolver returns None and does NOT fall back), or smart-picks from common keys (query, question, search_query, q, prompt, input, text) when query_arg is unset.

4. Compress chat history (Memory API)

CompresrMemoryBlock is a BaseMemoryBlock[str]; register it on Memory.from_defaults and the long-running buffer is compressed via Compresr when the memory layer needs to free tokens.

Python and TypeScript compress at different moments

Python compresses inside atruncate(), the Memory layer's truncation hook. TypeScript compresses inside get(), since LlamaIndex.TS's BaseMemoryBlock has no atruncate. The observable output (a single compressed system message containing the history) is identical.

python

When the buffer overflows, the block calls Compresr with latte_v1, using the last user: line of the buffer as the query (or "conversation history" if no user line exists). Override with query= for a fixed query string, or set target_token for a target output-token budget (best-effort — translated to a compression ratio, not a hard cap).

Other fields on CompresrMemoryBlock: name (default "compresr_compressed_history"), priority (default 2), target_compression_ratio (default 0.5), coarse (default None), on_error (default "passthrough").

When this helps

High-recall retrieval: similarity_top_k=20+ plus CompresrNodePostprocessor keeps the synthesis prompt tight without forcing you to throw away nodes.
Tool-heavy agents: wrap_tool_with_compresr shaves verbose tool outputs (web pages, API dumps, search hits) down to what's relevant for the user's question.
Long-running chat: CompresrMemoryBlock keeps multi-turn chat under the token cap without an extra LLM-summary call.

LangChain: equivalent middlewares for LangChain 1.0+ agents and retrievers.
Models: latte_v2 parameter semantics.
RAG guide: the underlying retrieve → compress → answer pipeline.

1. Install

2. Compress retrieved nodes (query engines)

Options

3. Compress an agent tool's output

4. Compress chat history (Memory API)

When this helps

Related