LiteLLM
Drop the first-party Compresr guardrail into the LiteLLM proxy to auto-compress tool outputs across every provider.
LiteLLM abstracts 100+ providers behind one completion() interface. The Compresr Python SDK ships a first-party guardrail under compresr.integrations.litellm that plugs into the LiteLLM proxy server as a pre_call hook — compressing bulky tool/function messages query-aware before the request is forwarded upstream. No application changes are required: any client that calls the proxy gets the savings transparently, and the same config works against OpenAI, Anthropic, Bedrock, Vertex, Ollama, or anything else LiteLLM supports.
Python only
Because LiteLLM itself is a Python package, the Compresr LiteLLM integration is Python-only — there is no TypeScript counterpart. If you're calling LiteLLM directly from application code (not the proxy), see the manual pattern.
1. Install
In the environment where the LiteLLM proxy runs:
LiteLLM's proxy walks litellm/proxy/guardrails/guardrail_hooks/<name>/ to discover guardrails. Until the upstream PR registering compresr directly merges, install the discovery shim once per environment:
The shim is a 20-line module that re-exports guardrail_class_registry, guardrail_initializer_registry, and initialize_guardrail from the real package. Once the LiteLLM PR merges, the shim is no-op.
2. Enable the guardrail in config.yaml
Add a guardrails: entry to your LiteLLM proxy config:
Start the proxy as you normally would (litellm --config config.yaml). Calls that hit a route with default_on: true (or that explicitly list compresr in their guardrails:) will run through the guardrail before going upstream.
3. Config reference
All keys live under litellm_params in the guardrail entry above.
| Key | Default | What it does |
|---|---|---|
api_key | env COMPRESR_API_KEY | Compresr API key (cmp_...). Missing both → guardrail fails to load. |
api_base | env COMPRESR_BASE_URL or https://api.compresr.ai | Override for self-hosted/on-prem. |
timeout | env COMPRESR_TIMEOUT or 10.0 (seconds) | HTTP timeout for the compress call. Invalid env values warn-log and fall back to default. |
compression_model_name | "latte_v1" | Only latte_v1 is currently public. |
target_compression_ratio | 0.5 | 0 < r ≤ 1 removal strength; r > 1 Nx factor. |
coarse | null | Paragraph-level when true/unset; token-level when false. |
min_chars_to_compress | 500 | Skip messages shorter than this (avoids latency on trivial messages). |
compress_tool_outputs | true | Compress tool / function result messages. |
compress_system | false | Opt-in: compress the system prompt. |
compress_history | false | Opt-in: compress prior (non-last) user messages. |
compress_last_user | false | Opt-in: also rewrite the last user message (the query sent to Compresr is always the verbatim original). |
fail_closed | false | On a Compresr availability error (timeout/connection/5xx), forward the original uncompressed request. Set true to raise instead. Validation and auth errors are always raised. |
4. What gets compressed by default
The guardrail's defaults are deliberately conservative — it only touches the messages that are usually pure data, not reasoning or instructions:
- Compressed by default:
tool/functionresult messages (search hits, RAG dumps, API responses). - Skipped by default:
system(instructions),assistant(model reasoning — never compressed regardless of flags), prioruserturns, the lastusermessage. - Always skipped: multimodal
content(lists, image blocks, audio, files) — only strings get compressed.
To opt into anything more aggressive, flip the matching compress_* flag in config.
Per-target intent query (the important detail)
For tool / function result messages, the query sent to latte_v1 is not the last user message — it's the originating tool call's name + arguments, rendered as a single string:
The guardrail walks back from the target message through assistant turns, matches by tool_call_id (or legacy function_call.name), and renders the matched call. If no assistant turn matches, it falls back to the most recent user message. This produces a much more specific compression query than "what the user asked overall."
For system / prior-user / last-user compression, the query is the verbatim last user message.
5. Per-request overrides
Clients can override any optional config field for a single request by setting metadata.guardrail_config:
Override-able keys: every config field from §3 except api_key, api_base, and timeout (which are instance-only).
6. Observability
After a successful compression pass, the guardrail stashes aggregate stats under data["metadata"]["compresr_stats"]:
It also adds the guardrail to LiteLLM's standard applied-guardrails header — the response will carry:
Use the header for a quick smoke test that the guardrail fired; use compresr_stats for actual savings telemetry.
7. Manual client-side pattern
If you aren't running the proxy and just want to compress a payload before calling litellm.completion() directly, use the SDK like you would in any other framework:
Switching providers is the usual one-line LiteLLM change to model=; the Compresr step is unchanged.
When this helps
- Multi-provider deployments behind one proxy — every provider gets the same shorter prompt; savings compound across the whole fleet.
- Cost-optimised routing — LiteLLM picks the cheapest model that meets your quality bar, Compresr reduces the input-token bill on whichever model wins.
- Drop-in for existing LiteLLM users — no application code changes; the guardrail is config-only.
Related
- LLM provider recipes — the manual pattern called directly against OpenAI, Anthropic, Gemini, or a local Ollama.
- Models —
latte_v1parameter semantics.