LiteLLM

Drop the first-party Compresr guardrail into the LiteLLM proxy to auto-compress tool outputs across every provider.

LiteLLM abstracts 100+ providers behind one completion() interface. The Compresr Python SDK ships a first-party guardrail under compresr.integrations.litellm that plugs into the LiteLLM proxy server as a pre_call hook, compressing bulky tool/function messages query-aware before the request is forwarded upstream. No application changes are required: any client that calls the proxy gets the savings transparently, and the same config works against OpenAI, Anthropic, Bedrock, Vertex, Ollama, or anything else LiteLLM supports.

Python only

Because LiteLLM itself is a Python package, the Compresr LiteLLM integration is Python-only; there is no TypeScript counterpart. If you're calling LiteLLM directly from application code (not the proxy), see the manual pattern.

1. Install

In the environment where the LiteLLM proxy runs:

bash

LiteLLM's proxy walks litellm/proxy/guardrails/guardrail_hooks/<name>/ to discover guardrails, so until the upstream PR registering compresr directly merges, it needs to be registered explicitly. The package ships two console scripts that handle this — pick one:

bash

COMPRESR_AUTO_INSTALL_SHIM=1 also installs the shim automatically on first import compresr.integrations.litellm, if you'd rather not run a separate step. Once the LiteLLM PR merges upstream, both paths become no-ops.

2. Enable the guardrail in `config.yaml`

Add a guardrails: entry to your LiteLLM proxy config:

yaml

Start the proxy as you normally would (litellm --config config.yaml). Calls that hit a route with default_on: true (or that explicitly list compresr in their guardrails:) will run through the guardrail before going upstream.

3. Config reference

All keys live under litellm_params in the guardrail entry above.

Key	Default	What it does
`api_key`	env `COMPRESR_API_KEY`	Compresr API key (`cmp_...`). Missing both → guardrail fails to load.
`api_base`	env `COMPRESR_BASE_URL` or `https://api.compresr.ai`	Override for self-hosted/on-prem.
`timeout`	env `COMPRESR_TIMEOUT` or `10.0` (seconds)	HTTP timeout for the compress call. Invalid env values warn-log and fall back to default.
`compression_model_name`	`"latte_v2"`	Query-specific compression. `latte_v1` is also available — pass it explicitly if you want it.
`target_compression_ratio`	`0.5`	`0 < r ≤ 1` removal strength; `r > 1` Nx factor.
`target_ratio_by_role`	unset	Per-role ratio overrides, e.g. `{"system": 0.3, "tool": 0.6}`. Roles not listed fall back to `target_compression_ratio`.
`coarse`	`true`	Paragraph-level (faster); set `false` for token-level.
`min_chars_to_compress`	`500`	Skip messages shorter than this (avoids latency on trivial messages).
`compress_tool_outputs`	`true`	Compress `tool` / `function` result messages.
`compress_system`	`false`	Opt-in: compress the system prompt.
`compress_history`	`false`	Opt-in: compress prior (non-last) user messages.
`compress_last_user`	`false`	Opt-in: also rewrite the last user message (the query sent to Compresr is always the verbatim original).
`fail_closed`	`false`	On a Compresr availability error (timeout/connection/5xx), forward the original uncompressed request. Set `true` to raise instead. Validation and auth errors are always raised.
`cache_ttl`	`300` (seconds)	How long to cache a compression result in LiteLLM's `DualCache`, keyed by `(content, query, model, ratio, coarse)`. Saves a round-trip when the same tool output repeats in an agent loop. Instance-level only — not overridable per request.

4. What gets compressed by default

The guardrail's defaults are deliberately conservative: it only touches the messages that are usually pure data, not reasoning or instructions:

Compressed by default: tool / function result messages (search hits, RAG dumps, API responses).
Skipped by default: system (instructions), assistant (model reasoning, never compressed regardless of flags), prior user turns, the last user message.
Multimodal messages: for list-of-parts content, the text parts are extracted, compressed, and written back into the first text part; non-text parts (images, audio, files) pass through untouched.

To opt into anything more aggressive, flip the matching compress_* flag in config.

Per-target intent query (the important detail)

For tool / function result messages, the query sent to latte_v2 is not the last user message; it's the originating tool call's name + arguments, rendered as a single string:

text

The guardrail walks back from the target message through assistant turns, matches by tool_call_id (or legacy function_call.name), and renders the matched call. If no assistant turn matches, it falls back to the most recent user message. This produces a much more specific compression query than "what the user asked overall."

For system / prior-user / last-user compression, the query is the verbatim last user message.

5. Per-request overrides

Clients can override any optional config field for a single request by setting metadata.guardrail_config:

python

Override-able keys: compression_model_name, target_compression_ratio, target_ratio_by_role, coarse, min_chars_to_compress, compress_tool_outputs, compress_system, compress_history, compress_last_user, and fail_closed. api_key, api_base, timeout, and cache_ttl are instance-only — they can't be overridden per request.

6. Observability

After a successful compression pass, the guardrail stashes aggregate stats under data["metadata"]["compresr_stats"]:

python

It also adds the guardrail to LiteLLM's standard applied-guardrails header. The response will carry:

text

If Compresr is unavailable and fail_closed: false (the default), the guardrail forwards the original uncompressed request instead of failing — and marks the header compresr:fail_open so you can tell "didn't fire" apart from "fired but failed open":

text

Use the header for a quick smoke test that the guardrail fired; use compresr_stats for actual savings telemetry.

Releasing resources

If you construct CompresrGuardrail directly (outside the proxy's own lifecycle), call await guardrail.aclose() to release its underlying HTTP client. The proxy itself doesn't currently call this on shutdown.

7. Manual client-side pattern

If you aren't running the proxy and just want to compress a payload before calling litellm.completion() directly, use the SDK like you would in any other framework:

python

Switching providers is the usual one-line LiteLLM change to model=; the Compresr step is unchanged.

When this helps

Multi-provider deployments behind one proxy: every provider gets the same shorter prompt; savings compound across the whole fleet.
Cost-optimised routing: LiteLLM picks the cheapest model that meets your quality bar, Compresr reduces the input-token bill on whichever model wins.
Drop-in for existing LiteLLM users: no application code changes; the guardrail is config-only.

LLM provider recipes: the manual pattern called directly against OpenAI, Anthropic, Gemini, or a local Ollama.
Models: latte_v2 parameter semantics.