SDKs

Python SDK

Compress prompts and retrieved context with the official Python client.

The compresr package is the official Python client. It wraps the REST API with typed methods, handles auth, and ships both sync and async variants of every call. Python 3.9+.

1. Install

Install from PyPI: pip, poetry, and uv all work.

bash

The agent client ships in the base install

As of compresr 2.8.2 the agent client layer (client.messages.create, client.chat.completions.create, client.run, WebSearchTool) is part of the base install: pip install compresr is enough. LangChain + provider chat-model + Tavily/Brave deps are pulled in automatically. Old compresr[agents] / compresr[agents-all] brackets still work as no-op aliases.

2. Initialize the client

Construct CompressionClient once at module scope and reuse it: the client keeps an internal httpx connection pool. Read the key from env, never hardcode.

The constructor takes api_key (required), plus optional base_url (defaults to https://api.compresr.ai) and timeout (seconds; default uses the SDK's built-in timeout). Override base_url only for regional or self-hosted endpoints.

python

See Authentication for key rotation, budgets, and rules.

Constructor options

api_keystr | NoneOptional

Default: None

If omitted, resolved from COMPRESR_API_KEY, then the [default] profile in ~/.compresr/credentials (INI format, populated by compresr-sdk login). Select a different profile with COMPRESR_PROFILE.

base_urlstr | NoneOptional

Default: None → COMPRESR_BASE_URL → https://api.compresr.ai

API endpoint override. Non-HTTPS URLs are refused unless COMPRESR_ALLOW_INSECURE=1 is set (raises CompresrError("insecure_base_url")).

timeoutint | NoneOptional

Default: None → 300

Request timeout in seconds (per HTTP call).

retry_configRetryConfig | NoneOptional

Default: None (built-in policy)

Override the retry policy. Default retries 429 and 503 with exponential backoff (respects Retry-After). Import RetryConfig from compresr: RetryConfig(max_retries=..., retry_on_status=...).

llmstr | NoneOptional

Default: None

Provider for the agent surface (e.g. "anthropic", "openai:gpt-4o-mini"). Required for client.messages / client.chat / client.run / client.research. See Section 6.

llm_api_keystr | NoneOptional

Default: None

API key for the LLM provider. Falls back to ANTHROPIC_API_KEY / OPENAI_API_KEY / GOOGLE_API_KEY depending on llm.

llm_http_clienthttpx.Client | NoneOptional

Default: None

Custom sync httpx.Client the SDK uses for the downstream LLM call. Escape hatch for corporate proxies, custom CA bundles, and mTLS: httpx.Client(verify="/etc/ssl/corp-ca.pem", proxies="http://proxy:3128").

llm_http_async_clienthttpx.AsyncClient | NoneOptional

Default: None

Async twin of llm_http_client, used by acreate / arun.

compressiondict | CompressionPolicy | NoneOptional

Default: None

Middleware compression policy applied to every tool output. See Compression knobs.

enable_prompt_cacheboolOptional

Default: True

Enable provider-side prompt caching (Anthropic cache_control, OpenAI prompt_cache_key). No-op for Gemini (implicit caching is always on server-side).

prompt_cache_ttl"5m" | "1h"Optional

Default: "5m"

Anthropic cache TTL. Longer TTL costs more per cache write but survives longer between calls. On OpenAI, "1h" maps to prompt_cache_retention: "24h".

prompt_cache_min_messagesintOptional

Default: 2

Skip caching for very short conversations (avoids paying the cache-write premium on trivial prompts).

openai_prompt_cache_keystr | NoneOptional

Default: None

Explicit OpenAI prompt_cache_key; when omitted the SDK does not set prompt_cache_key on the OpenAI call and provider defaults apply.

Environment variables

Variable	Purpose
`COMPRESR_API_KEY`	Fallback for `api_key`. Also read by the cURL examples.
`COMPRESR_BASE_URL`	Fallback for `base_url`.
`COMPRESR_ALLOW_INSECURE`	Set to `1` to allow non-HTTPS `base_url` (local dev / self-hosted gateway). Client refuses to start otherwise.
`ANTHROPIC_API_KEY` / `OPENAI_API_KEY` / `GOOGLE_API_KEY`	Fallback for `llm_api_key` when the matching provider is used.

CLI authentication

Instead of passing api_key= explicitly, run compresr-sdk login to write credentials to the [default] profile in ~/.compresr/credentials (INI format, no extension); the SDK picks them up automatically on the next CompressionClient() call. Set COMPRESR_PROFILE to select a different profile. compresr-sdk logout clears them. Both are also exposed programmatically as login() / logout() from compresr.

3. compress

Synchronous single-request compression. Pass context, query, and compression_model_name="latte_v2"; the model keeps the spans that matter for the query. For many chunks against one query, see compress_batch; for incremental output, compress_stream.

python

Parameters

latte_v2 accepts every parameter latte_v1 accepts, plus three latte_v2-only knobs for dynamic compression-ratio selection. See the Models reference for the canonical decision guide and the at-a-glance support matrix.

Shared parameters (both models)

contextstringRequired

The long text to compress: RAG chunks, document body, chat history.

querystr | NoneOptional

The question the compressed context must still answer. Required for latte_v1; optional for latte_v2. Backend validates.

compression_model_name"latte_v1" | "latte_v2"Optional

Default: "latte_v1"

Routes the call. SDK default is latte_v1 for stability; pass "latte_v2" to opt into the newer backbone. See the Models reference.

target_compression_ratiofloat | NoneOptional

Removal strength when 0 < r ≤ 1, or Nx target when r > 1 (e.g. 60 = 60×). Server hard-caps at 200. Ignored on latte_v2 when dynamic=True. See Models › target_compression_ratio.

coarseboolean | NoneOptional

Default: None

None = backend default (paragraph-level); True locks paragraph-level; False opts into token-level precision.

heuristic_chunkingboolean | NoneOptional

Default: None

Heuristic splitter (paragraphs, code blocks) instead of fixed-size chunks.

disable_placeholdersboolean | NoneOptional

Default: None

Skip the [...] placeholders inserted where content was dropped.

`latte_v2`-only parameters

dynamicbool | NoneOptional

Default: None

None = server default; True picks the compression ratio per-input automatically inside [dynamic_min_ratio, dynamic_max_ratio] and overrides target_compression_ratio; False explicitly forces the fixed-ratio path. Rejected on latte_v1 with ValidationError.

dynamic_min_ratiofloat | NoneOptional

Default: None (server default 1.5)

Floor on the chosen Nx ratio when dynamic=True. Must be ≥ 1.0. Only consulted when dynamic=True.

dynamic_max_ratiofloat | NoneOptional

Default: None (server default 10.0)

Ceiling on the chosen Nx ratio when dynamic=True. Must be ≥ 1.0. Only consulted when dynamic=True.

Response

compress() returns a typed object; access fields as attributes (result.data.compressed_context). Response field names stay snake_case across every SDK.

CompressionResponse

dataobject
- compressed_contextstring
  The compressed text, ready to drop into your prompt.
- original_tokensinteger
  Token count of the input context (tiktoken cl100k).
- compressed_tokensinteger
  Token count of the compressed output.
- tokens_savedinteger
  original_tokens − compressed_tokens.
- actual_compression_rationumber
  Fraction of input tokens removed (0..1) when target_compression_ratio was 0..1, or the achieved Nx factor when the Nx form was requested. Mirrors the input regime.
- duration_msinteger
  Server-side wall-clock time for the compression pass.

4. Stream

client.compress_stream(...) returns an iterator yielding {content, done} chunks as the model produces them; the final chunk has done=True and empty content. Use it anywhere time-to-first-token matters (UIs, agent loops); for one-shot calls stick with compress.

python

The iterator is a normal generator: wrap it in itertools.islice, push chunks through a queue, or consume from a worker thread. Same context / query / compression_model_name rules as compress().

5. Batch

client.compress_batch(...) compresses many contexts in one request. Pass contexts: list[str] plus either a single queries: str (applied to every context) or a queries: list[str] matching contexts in length. Cheaper than firing N concurrent compress() calls, and ideal for RAG re-ranking or bulk document processing.

python

queries is either a string (applied to every context) or a list matching contexts in length; mixing the two raises ValidationError. Per-item results carry the same fields as a single compress() call except target_compression_ratio (request-level only). The envelope also exposes aggregates: result.data.count, total_original_tokens, total_compressed_tokens, total_tokens_saved, average_compression_ratio.

Alternate form: `inputs=[{context, query}, ...]`

The wire format is a list of {context, query} pairs. Pass inputs= instead of contexts=/queries= when it matches your data shape more naturally (queues, streaming pipelines, per-item queries). Exactly one of inputs OR contexts is required; passing both — or neither — raises ValidationError.

python

6. Agent client

Construct CompressionClient with llm= and you get an agent surface: three call-shapes (Anthropic-style messages.create, OpenAI-style chat.completions.create, native run) that auto-compress every tool output above min_tokens before the LLM sees it. Behind all three sits LangChain 1.0's create_agent + the SDK's CompresrToolMiddleware. Use it as a drop-in for anthropic.Anthropic() / openai.OpenAI(); for raw (context, query) calls stick with compress.

These surfaces are SDK-shaped and have no direct cURL equivalent. The underlying compression is still the same /api/compress/question-specific/ endpoint; it's what the middleware fires whenever a tool returns.

Construct with `llm=`

Provider lives on the client; model lives at the call site. Swap providers by changing one string: same tools, same code:

python

The llm string accepts "anthropic" (provider only, every call must pass model="..."), "anthropic:claude-haiku-4-5" (default model, overridable at call site), or "anthropic/claude-haiku-4-5" (Vercel AI SDK convention; both separators accepted). If neither provides a model, the SDK raises CompresrError("model is required …").

Three call shapes

messages.create duck-types anthropic.types.Message, chat.completions.create duck-types openai.types.chat.ChatCompletion, and run returns a native NormalizedResult (.text, .tool_uses, .citations, .stop_reason, .usage).

python

Python also exposes async variants: acreate, arun. TypeScript is async by default.

run() and arun() are keyword-only

client.run(...) and client.arun(...) accept only keyword arguments — client.run("question") raises TypeError. Always pass prompt=..., model=..., tools=... by name. This matches how messages.create and chat.completions.create are called.

Web search: `WebSearchTool`

Three providers ship in the box: Tavily, Brave, and AgentCore (Amazon Bedrock via MCP). All three return a real LangChain BaseTool; their output flows through CompresrToolMiddleware automatically.

python

Why not Anthropic / OpenAI / Gemini server search?

Provider-native server search tools (web_search_20250305, web_search_preview, google_search) execute server-side and return opaque/encrypted content that Compresr cannot read or compress. Use Tavily, Brave, or AgentCore so the result is plaintext. See the Web search guide.

Provider reference

Tavily (WebSearchTool.tavily) — reads api_key=, then TAVILY_API_KEY env var. Raises ValueError if neither is set. Supports allowed_domains / blocked_domains natively.

Brave (WebSearchTool.brave) — reads api_key=, then BRAVE_SEARCH_API_KEY, then BRAVE_API_KEY. Raises ValueError if none of the three are set. allowed_domains / blocked_domains are not supported (Brave uses Goggles for filtering, out of scope); passing them emits a UserWarning.

AgentCore (WebSearchTool.agentcore) — install with pip install compresr[agentcore]. Talks to an Amazon Bedrock AgentCore gateway over MCP streamable-HTTP, authenticated via a Cognito OAuth 2.0 client-credentials handshake. Bearer tokens are cached; a 401 triggers one automatic re-mint. max_results is clamped to 1..25; responses larger than 1 MB are rejected. allowed_domains / blocked_domains are accepted for signature parity but emit a UserWarning — use Tavily if you need domain filtering.

AgentCore config resolves per field with precedence explicit arg → AgentCore-namespaced env → short env. If any field is unresolved, WebSearchTool.agentcore(...) raises ValueError listing every missing field:

Argument	Env var (primary)	Env var (fallback)
`gateway_url`	`AGENTCORE_GATEWAY_MCP_URL`	`GATEWAY_MCP_URL`
`cognito_token_url`	`AGENTCORE_COGNITO_TOKEN_URL`	`COGNITO_TOKEN_URL`
`client_id`	`AGENTCORE_COGNITO_CLIENT_ID`	`COGNITO_CLIENT_ID`
`client_secret`	`AGENTCORE_COGNITO_CLIENT_SECRET`	`COGNITO_CLIENT_SECRET`
`scope`	`AGENTCORE_COGNITO_SCOPE`	`COGNITO_SCOPE`

Bring your own tool

Any LangChain @tool-decorated function works. The string return value is compressed before the LLM sees it.

python

Streaming isn't on the agent layer yet: the Python facades expose only .create / .acreate, so client.messages.stream(...) / client.chat.completions.stream(...) raise AttributeError (not a typed CompresrError). The compression-API stream (compress_stream) is unaffected.

Research: `client.research`

When constructed with llm=, the client also exposes client.research, a multi-step search-and-summarize loop that runs a web-search tool for you, compresses each snippet before it enters the LLM's context, and returns a structured result with citations. client.research.run(question) runs the full loop (up to max_steps); client.research.search(question) is the same loop capped at 2 steps for quick lookups.

Accessing client.research when llm= was not passed raises CompresrError — the facade needs a chat model to reason about search results.

python

search"tavily" | "brave" | BaseToolOptional

Default: "tavily"

Provider string (uses env-var fallbacks) or a preconstructed WebSearchTool.

max_stepsintOptional

Default: 10

Upper bound on search / synthesize iterations. `.search()` overrides this to 2.

modelstr | NoneOptional

Default: None (uses client.llm)

Override the client-level model for this call.

compress_snippetsboolOptional

Default: True

Route each search snippet through the compression API before it enters the LLM context.

compression_modelstrOptional

Default: "latte_v1"

Which model runs the snippet compression.

min_compress_tokensintOptional

Default: 100

Skip compression for snippets shorter than this many tokens.

max_context_tokensintOptional

Default: 120_000

Hard ceiling on total tokens across all compressed snippets before synthesis.

system_promptstr | NoneOptional

Default: None

Override the built-in system prompt (see DEFAULT_RESEARCH_SYSTEM_PROMPT).

ResearchResult fields: answer: str, explanation: str, confidence: float | None, text: str, citations: list[Citation], trajectory: list[Step], usage: ResearchUsage, raw: Any (defaults to None; usually the provider's raw response object). ResearchUsage has int counters input_tokens, output_tokens, cache_read_tokens, cache_creation_tokens, calls, search_calls. Each Citation has url: str, title: str | None, snippet: str | None.

Per-call LLM knobs

Forwarded to the underlying chat model: temperature, top_p, top_k, max_tokens, max_output_tokens, stop, stop_sequences, presence_penalty, frequency_penalty, seed, logprobs, top_logprobs. Anything else is silently dropped.

python

Gemini aliasing

When provider == "google_genai" the SDK renames max_tokens → max_output_tokens automatically. Pass max_tokens from any provider; the SDK will do the right thing.

Compression knobs: `compression={...}`

Set at client construction. Applies to every tool-output compression the middleware fires. The model-routing keys mirror compress() — compression_model_name picks the backbone, and the compression-shaping keys forward through to the same /compress/question-specific/ endpoint.

Shared keys (accepted regardless of compression_model_name):

Key	Default	Effect
`compression_model_name`	`"latte_v1"`	Backend validates; `"latte_v1"` and `"latte_v2"` are both public. See Models.
`target_compression_ratio`	`0.5`	0–1 removal strength; `>1` = Nx factor (same as `compress` arg). Ignored on `latte_v2` when `dynamic=True`.
`min_tokens`	`200`	Tool outputs shorter than this skip compression. Middleware-side gate; not forwarded to the API.
`coarse`	server default (`True`)	Paragraph-level vs token-level.
`allow_tools`	`None`	Whitelist of tool names to compress.
`ignore_tools`	`None`	Blacklist of tool names to leave untouched.
`on_error`	`"passthrough"`	`"raise"` to fail loudly on backend errors instead of returning the original tool output.

The middleware policy doesn't expose the dynamic* latte_v2-only knobs. If you need adaptive ratio selection on tool outputs, call client.compress(...) directly with dynamic=True instead of routing through the middleware.

7. Async

compress_async and compress_batch_async are the async twins of compress and compress_batch: same params, return awaitables. Streaming is sync-only (no compress_stream_async). Call await client.aclose() when done to release the httpx pool, or use the client as an async context manager (async with CompressionClient(...) as client:). Use these inside event loops (FastAPI handlers, Discord bots, agent runtimes); for scripts the sync methods are simpler.

python

8. Errors & types

Every Compresr error inherits from CompresrError. Catch the base for a single handler; catch subclasses when recovery differs. Every subclass carries a stable code string and, where relevant, structured attributes you can branch on (e.g. err.retry_after, err.credits_remaining, err.available_models) instead of parsing prose.

Exception	HTTP	`code`	Structured attributes
`AuthenticationError`	401	`authentication_error`	—
`ScopeError`	403	`scope_error`	`required_scope`
`NotFoundError`	404	`not_found`	`resource: str \</td> <td>None`
`RateLimitError`	429	`rate_limit_exceeded`	`retry_after: int \</td> <td>None`
`ValidationError`	400 / 422	`validation_error`	`field: str \</td> <td>None`
`InsufficientCreditsError`	402	`insufficient_credits`	`credits_required`, `credits_remaining`
`BudgetLimitError`	402	`budget_limit_reached`	`current_budget`, `budget_used`
`ApiKeyBudgetError`	402	`api_key_budget_exceeded`	`api_key_budget`, `api_key_used`
`DailyLimitError`	429	`daily_limit_exceeded`	`daily_limit`, `requests_used`
`ModelNotFoundError`	404	`model_not_found`	`model_name`, `available_models`
`ContextWindowExceededError`	413	`context_window_exceeded`	`max_tokens`, `actual_tokens`
`ContentPolicyError`	400	`content_policy_violation`	`provider`
`TargetAuthenticationError`	401	`target_authentication_error`	`provider`
`ServiceUnavailableError`	503	`service_unavailable`	`service`, `retry_after: int \</td> <td>None`
`ServerError`	5xx	`server_error`	—
`CompresrTimeoutError`	—	`timeout`	`timeout_seconds: int \</td> <td>None`. Reserved; the transport currently maps HTTP timeouts to `CompresrConnectionError("Request timed out")` — catch that class instead.
`CompresrConnectionError`	—	`connection_error`	`service: str \</td> <td>None`
`CompresrError`	—	(varies)	base class; catch it as a fallback

Fractional Retry-After

Servers occasionally emit Retry-After: 1.5. The SDK stores retry_after as Optional[int], so fractional values are dropped (attribute reads as None). Guard with time.sleep(err.retry_after or 1).

python

Always handle 429

The default tier has tight per-minute limits. A retry loop with exponential backoff (respecting retry_after) is the single most important piece of error handling for production.

Constructor options

Environment variables

Shared parameters (both models)

latte_v2-only parameters

Alternate form: inputs=[{context, query}, ...]

Provider reference

`latte_v2`-only parameters

Alternate form: `inputs=[{context, query}, ...]`