SDKs
Python SDK
Compress prompts and retrieved context with the official Python client.
The compresr package is the official Python client. It wraps the REST API with typed methods, handles auth, and ships both sync and async variants of every call. Python 3.9+.
1. Install
Install from PyPI — pip, poetry, and uv all work.
The agent client ships in the base install
As of compresr 2.6.0 the agent client layer (client.messages.create, client.chat.completions.create, client.run, WebSearchTool) is part of the base install — pip install compresr is enough. LangChain + provider chat-model + Tavily/Brave deps are pulled in automatically. Old compresr[agents] / compresr[agents-all] brackets still work as no-op aliases.
2. Initialize the client
Construct CompressionClient once at module scope and reuse it — the client keeps an internal httpx connection pool. Read the key from env, never hardcode.
The constructor takes api_key (required), plus optional base_url (defaults to https://api.compresr.ai) and timeout (seconds; default uses the SDK's built-in timeout). Override base_url only for regional or self-hosted endpoints.
See Authentication for key rotation, budgets, and rules.
3. compress
Synchronous single-request compression. Pass context, query, and compression_model_name="latte_v1"; the model keeps the spans that matter for the query. For many chunks against one query, see compress_batch; for token-by-token output, compress_stream.
Parameters
contextstringRequiredquerystringRequiredlatte_v1.compression_model_name"latte_v1"Requiredlatte_v1 is the only public model.target_compression_rationumberOptionalcoarseboolean | NoneOptionalNoneNone = backend default (paragraph-level); True locks paragraph-level; False opts into token-level precision.heuristic_chunkingboolean | NoneOptionalNonedisable_placeholdersboolean | NoneOptionalNone[...] placeholders inserted where content was dropped.Response
compress() returns a typed object — access fields as attributes (result.data.compressed_context). Response field names stay snake_case across every SDK.
dataobjectcompressed_contextstringThe compressed text, ready to drop into your prompt.
original_tokensintegerToken count of the input context (tiktoken cl100k).
compressed_tokensintegerToken count of the compressed output.
tokens_savedintegeroriginal_tokens − compressed_tokens.
actual_compression_rationumberFraction of input tokens actually removed (0–1). e.g. 0.5 = ~50% removed.
duration_msintegerServer-side wall-clock time for the compression pass.
4. Stream
client.compress_stream(...) returns an iterator yielding {content, done} chunks as the model produces them; the final chunk has done=True and empty content. Use it anywhere time-to-first-token matters (UIs, agent loops); for one-shot calls stick with compress.
The iterator is a normal generator — wrap it in itertools.islice, push chunks through a queue, or consume from a worker thread. Same context / query / compression_model_name rules as compress().
5. Batch
client.compress_batch(...) compresses many contexts in one request. Pass contexts: list[str] plus either a single queries: str (applied to every context) or a queries: list[str] matching contexts in length. Cheaper than firing N concurrent compress() calls — ideal for RAG re-ranking or bulk document processing.
queries is either a string (applied to every context) or a list matching contexts in length — mixing the two raises ValidationError. Per-item results carry the same fields as a single compress() call except target_compression_ratio (request-level only). The envelope also exposes aggregates: result.data.count, total_original_tokens, total_compressed_tokens, total_tokens_saved, average_compression_ratio.
6. Agent client
Construct CompressionClient with llm= and you get an agent surface — three call-shapes (Anthropic-style messages.create, OpenAI-style chat.completions.create, native run) that auto-compress every tool output above min_tokens before the LLM sees it. Behind all three sits LangChain 1.0's create_agent + the SDK's CompresrToolMiddleware. Use it as a drop-in for anthropic.Anthropic() / openai.OpenAI(); for raw (context, query) calls stick with compress.
These surfaces are SDK-shaped and have no direct cURL equivalent. The underlying compression is still the same /api/compress/question-specific/ endpoint — it's what the middleware fires whenever a tool returns.
Construct with llm=
Provider lives on the client; model lives at the call site. Swap providers by changing one string — same tools, same code:
The llm string accepts "anthropic" (provider only — every call must pass model="..."), "anthropic:claude-haiku-4-5" (default model, overridable at call site), or "anthropic/claude-haiku-4-5" (Vercel AI SDK convention; both separators accepted). If neither provides a model, the SDK raises CompresrError("model is required …").
Three call shapes
messages.create duck-types anthropic.types.Message, chat.completions.create duck-types openai.types.chat.ChatCompletion, and run returns a native NormalizedResult (.text, .tool_uses, .citations, .stop_reason, .usage).
Python also exposes async variants: acreate, arun. TypeScript is async by default.
Web search — WebSearchTool
Backed by Tavily (default) or Brave. The returned object is a real LangChain BaseTool; its output flows through CompresrToolMiddleware automatically.
Why not Anthropic / OpenAI / Gemini server search?
Provider-native server search tools (web_search_20250305, web_search_preview, google_search) execute server-side and return opaque/encrypted content that Compresr cannot read or compress. Use Tavily or Brave so the result is plaintext. See the Web search guide.
Bring your own tool
Any LangChain @tool-decorated function works. The string return value is compressed before the LLM sees it.
Streaming isn't on the agent layer yet — client.messages.stream(...) / client.chat.completions.stream(...) throw CompresrError('streaming not yet implemented'). The compression-API stream (compress_stream) is unaffected.
Per-call LLM knobs
Forwarded to the underlying chat model: temperature, top_p, top_k, max_tokens, max_output_tokens, stop, stop_sequences, presence_penalty, frequency_penalty, seed, logprobs, top_logprobs. Anything else is silently dropped.
Gemini aliasing
When provider == "google_genai" the SDK renames max_tokens → max_output_tokens automatically. Pass max_tokens from any provider — the SDK will do the right thing.
Compression knobs — compression={...}
Set at client construction. Applies to every tool-output compression the middleware fires.
| Key | Default | Effect |
|---|---|---|
target_compression_ratio | 0.5 | 0–1 removal strength; >1 = Nx factor (same as compress arg). |
min_tokens | 200 | Tool outputs shorter than this skip compression. |
coarse | server default (True) | Paragraph-level vs token-level. |
compression_model_name | "latte_v1" | Backend validates; latte_v1 is the only public model. |
allow_tools | None | Whitelist of tool names to compress. |
ignore_tools | None | Blacklist of tool names to leave untouched. |
on_error | "passthrough" | "raise" to fail loudly on backend errors instead of returning the original tool output. |
7. Async
compress_async and compress_batch_async are the async twins of compress and compress_batch — same params, return awaitables. Streaming is sync-only (no compress_stream_async). Call await client.aclose() when done to release the httpx pool — or use the client as an async context manager (async with CompressionClient(...) as client:). Use these inside event loops (FastAPI handlers, Discord bots, agent runtimes); for scripts the sync methods are simpler.
8. Errors & types
Every Compresr error inherits from CompresrError. Catch the base for a single handler; catch subclasses when recovery differs.
AuthenticationError—401. Missing, malformed, or revoked key. Rotate it.RateLimitError—429. Carries aretry_afterattribute (seconds). Back off and retry.ValidationError—400/422. Request body failed validation (e.g.target_compression_ratioout of range, missingquery). Fix the payload.CompresrError— base class. Network errors,5xx, or anything unexpected.
Always handle 429
The default tier has tight per-minute limits. A retry loop with exponential backoff (respecting retry_after) is the single most important piece of error handling for production.