Skip to content
Compresr docs

LiteLLM

Drop the first-party Compresr guardrail into the LiteLLM proxy to auto-compress tool outputs across every provider.

LiteLLM abstracts 100+ providers behind one completion() interface. The Compresr Python SDK ships a first-party guardrail under compresr.integrations.litellm that plugs into the LiteLLM proxy server as a pre_call hook — compressing bulky tool/function messages query-aware before the request is forwarded upstream. No application changes are required: any client that calls the proxy gets the savings transparently, and the same config works against OpenAI, Anthropic, Bedrock, Vertex, Ollama, or anything else LiteLLM supports.

Python only

Because LiteLLM itself is a Python package, the Compresr LiteLLM integration is Python-only — there is no TypeScript counterpart. If you're calling LiteLLM directly from application code (not the proxy), see the manual pattern.

1. Install

In the environment where the LiteLLM proxy runs:

bash

LiteLLM's proxy walks litellm/proxy/guardrails/guardrail_hooks/<name>/ to discover guardrails. Until the upstream PR registering compresr directly merges, install the discovery shim once per environment:

bash

The shim is a 20-line module that re-exports guardrail_class_registry, guardrail_initializer_registry, and initialize_guardrail from the real package. Once the LiteLLM PR merges, the shim is no-op.

2. Enable the guardrail in config.yaml

Add a guardrails: entry to your LiteLLM proxy config:

yaml

Start the proxy as you normally would (litellm --config config.yaml). Calls that hit a route with default_on: true (or that explicitly list compresr in their guardrails:) will run through the guardrail before going upstream.

3. Config reference

All keys live under litellm_params in the guardrail entry above.

KeyDefaultWhat it does
api_keyenv COMPRESR_API_KEYCompresr API key (cmp_...). Missing both → guardrail fails to load.
api_baseenv COMPRESR_BASE_URL or https://api.compresr.aiOverride for self-hosted/on-prem.
timeoutenv COMPRESR_TIMEOUT or 10.0 (seconds)HTTP timeout for the compress call. Invalid env values warn-log and fall back to default.
compression_model_name"latte_v1"Only latte_v1 is currently public.
target_compression_ratio0.50 < r ≤ 1 removal strength; r > 1 Nx factor.
coarsenullParagraph-level when true/unset; token-level when false.
min_chars_to_compress500Skip messages shorter than this (avoids latency on trivial messages).
compress_tool_outputstrueCompress tool / function result messages.
compress_systemfalseOpt-in: compress the system prompt.
compress_historyfalseOpt-in: compress prior (non-last) user messages.
compress_last_userfalseOpt-in: also rewrite the last user message (the query sent to Compresr is always the verbatim original).
fail_closedfalseOn a Compresr availability error (timeout/connection/5xx), forward the original uncompressed request. Set true to raise instead. Validation and auth errors are always raised.

4. What gets compressed by default

The guardrail's defaults are deliberately conservative — it only touches the messages that are usually pure data, not reasoning or instructions:

  • Compressed by default: tool / function result messages (search hits, RAG dumps, API responses).
  • Skipped by default: system (instructions), assistant (model reasoning — never compressed regardless of flags), prior user turns, the last user message.
  • Always skipped: multimodal content (lists, image blocks, audio, files) — only strings get compressed.

To opt into anything more aggressive, flip the matching compress_* flag in config.

Per-target intent query (the important detail)

For tool / function result messages, the query sent to latte_v1 is not the last user message — it's the originating tool call's name + arguments, rendered as a single string:

text

The guardrail walks back from the target message through assistant turns, matches by tool_call_id (or legacy function_call.name), and renders the matched call. If no assistant turn matches, it falls back to the most recent user message. This produces a much more specific compression query than "what the user asked overall."

For system / prior-user / last-user compression, the query is the verbatim last user message.

5. Per-request overrides

Clients can override any optional config field for a single request by setting metadata.guardrail_config:

python

Override-able keys: every config field from §3 except api_key, api_base, and timeout (which are instance-only).

6. Observability

After a successful compression pass, the guardrail stashes aggregate stats under data["metadata"]["compresr_stats"]:

python

It also adds the guardrail to LiteLLM's standard applied-guardrails header — the response will carry:

text

Use the header for a quick smoke test that the guardrail fired; use compresr_stats for actual savings telemetry.

7. Manual client-side pattern

If you aren't running the proxy and just want to compress a payload before calling litellm.completion() directly, use the SDK like you would in any other framework:

python

Switching providers is the usual one-line LiteLLM change to model=; the Compresr step is unchanged.

When this helps

  • Multi-provider deployments behind one proxy — every provider gets the same shorter prompt; savings compound across the whole fleet.
  • Cost-optimised routing — LiteLLM picks the cheapest model that meets your quality bar, Compresr reduces the input-token bill on whichever model wins.
  • Drop-in for existing LiteLLM users — no application code changes; the guardrail is config-only.
  • LLM provider recipes — the manual pattern called directly against OpenAI, Anthropic, Gemini, or a local Ollama.
  • Modelslatte_v1 parameter semantics.