Skip to content
Compresr docs

SDKs

Python SDK

Compress prompts and retrieved context with the official Python client.

The compresr package is the official Python client. It wraps the REST API with typed methods, handles auth, and ships both sync and async variants of every call. Python 3.9+.

1. Install

Install from PyPI: pip, poetry, and uv all work.

bash

The agent client ships in the base install

As of compresr 2.6.0 the agent client layer (client.messages.create, client.chat.completions.create, client.run, WebSearchTool) is part of the base install: pip install compresr is enough. LangChain + provider chat-model + Tavily/Brave deps are pulled in automatically. Old compresr[agents] / compresr[agents-all] brackets still work as no-op aliases.

2. Initialize the client

Construct CompressionClient once at module scope and reuse it: the client keeps an internal httpx connection pool. Read the key from env, never hardcode.

The constructor takes api_key (required), plus optional base_url (defaults to https://api.compresr.ai) and timeout (seconds; default uses the SDK's built-in timeout). Override base_url only for regional or self-hosted endpoints.

python

See Authentication for key rotation, budgets, and rules.

3. compress

Synchronous single-request compression. Pass context, query, and compression_model_name="latte_v1"; the model keeps the spans that matter for the query. For many chunks against one query, see compress_batch; for token-by-token output, compress_stream.

python

Parameters

latte_v2 accepts every parameter latte_v1 accepts, plus three latte_v2-only knobs for dynamic compression-ratio selection. See the Models reference for the canonical decision guide and the at-a-glance support matrix.

Shared parameters (both models)

contextstringRequired
The long text to compress: RAG chunks, document body, chat history.
querystringRequired
The question the compressed context must still answer. Required for both models.
compression_model_name"latte_v1" | "latte_v2"Required
Routes the call. latte_v2 is the recommended default. See the Models reference.
target_compression_rationumberOptional
Removal strength when 0 < r ≤ 1, or Nx target when r > 1. See Models › target_compression_ratio. Ignored on latte_v2 when dynamic=True.
coarseboolean | NoneOptional
Default: None
None = backend default (paragraph-level); True locks paragraph-level; False opts into token-level precision.
heuristic_chunkingboolean | NoneOptional
Default: None
Heuristic splitter (paragraphs, code blocks) instead of fixed-size chunks.
disable_placeholdersboolean | NoneOptional
Default: None
Skip the [...] placeholders inserted where content was dropped.

latte_v2-only parameters

dynamicbooleanOptional
Default: False
Pick the compression ratio per-input via Kneedle elbow selection inside [dynamic_min_ratio, dynamic_max_ratio]; overrides target_compression_ratio when True. Rejected on latte_v1 with ValidationError.
dynamic_min_ratiofloat | NoneOptional
Default: None (server default 1.5)
Floor on the chosen Nx ratio when dynamic=True. Must be ≥ 1.0. Only consulted when dynamic=True.
dynamic_max_ratiofloat | NoneOptional
Default: None (server default 10.0)
Ceiling on the chosen Nx ratio when dynamic=True. Must be ≥ 1.0. Only consulted when dynamic=True.

Response

compress() returns a typed object; access fields as attributes (result.data.compressed_context). Response field names stay snake_case across every SDK.

CompressionResponse
  • dataobject
    • compressed_contextstring

      The compressed text, ready to drop into your prompt.

    • original_tokensinteger

      Token count of the input context (tiktoken cl100k).

    • compressed_tokensinteger

      Token count of the compressed output.

    • tokens_savedinteger

      original_tokens − compressed_tokens.

    • actual_compression_rationumber

      Fraction of input tokens actually removed (0–1). e.g. 0.5 = ~50% removed.

    • duration_msinteger

      Server-side wall-clock time for the compression pass.

4. Stream

client.compress_stream(...) returns an iterator yielding {content, done} chunks as the model produces them; the final chunk has done=True and empty content. Use it anywhere time-to-first-token matters (UIs, agent loops); for one-shot calls stick with compress.

python

The iterator is a normal generator: wrap it in itertools.islice, push chunks through a queue, or consume from a worker thread. Same context / query / compression_model_name rules as compress().

5. Batch

client.compress_batch(...) compresses many contexts in one request. Pass contexts: list[str] plus either a single queries: str (applied to every context) or a queries: list[str] matching contexts in length. Cheaper than firing N concurrent compress() calls, and ideal for RAG re-ranking or bulk document processing.

python

queries is either a string (applied to every context) or a list matching contexts in length; mixing the two raises ValidationError. Per-item results carry the same fields as a single compress() call except target_compression_ratio (request-level only). The envelope also exposes aggregates: result.data.count, total_original_tokens, total_compressed_tokens, total_tokens_saved, average_compression_ratio.

6. Agent client

Construct CompressionClient with llm= and you get an agent surface: three call-shapes (Anthropic-style messages.create, OpenAI-style chat.completions.create, native run) that auto-compress every tool output above min_tokens before the LLM sees it. Behind all three sits LangChain 1.0's create_agent + the SDK's CompresrToolMiddleware. Use it as a drop-in for anthropic.Anthropic() / openai.OpenAI(); for raw (context, query) calls stick with compress.

These surfaces are SDK-shaped and have no direct cURL equivalent. The underlying compression is still the same /api/compress/question-specific/ endpoint; it's what the middleware fires whenever a tool returns.

Construct with llm=

Provider lives on the client; model lives at the call site. Swap providers by changing one string: same tools, same code:

python

The llm string accepts "anthropic" (provider only, every call must pass model="..."), "anthropic:claude-haiku-4-5" (default model, overridable at call site), or "anthropic/claude-haiku-4-5" (Vercel AI SDK convention; both separators accepted). If neither provides a model, the SDK raises CompresrError("model is required …").

Three call shapes

messages.create duck-types anthropic.types.Message, chat.completions.create duck-types openai.types.chat.ChatCompletion, and run returns a native NormalizedResult (.text, .tool_uses, .citations, .stop_reason, .usage).

python

Python also exposes async variants: acreate, arun. TypeScript is async by default.

Web search: WebSearchTool

Backed by Tavily (default) or Brave. The returned object is a real LangChain BaseTool; its output flows through CompresrToolMiddleware automatically.

python

Why not Anthropic / OpenAI / Gemini server search?

Provider-native server search tools (web_search_20250305, web_search_preview, google_search) execute server-side and return opaque/encrypted content that Compresr cannot read or compress. Use Tavily or Brave so the result is plaintext. See the Web search guide.

Bring your own tool

Any LangChain @tool-decorated function works. The string return value is compressed before the LLM sees it.

python

Streaming isn't on the agent layer yet: client.messages.stream(...) / client.chat.completions.stream(...) throw CompresrError('streaming not yet implemented'). The compression-API stream (compress_stream) is unaffected.

Per-call LLM knobs

Forwarded to the underlying chat model: temperature, top_p, top_k, max_tokens, max_output_tokens, stop, stop_sequences, presence_penalty, frequency_penalty, seed, logprobs, top_logprobs. Anything else is silently dropped.

python

Gemini aliasing

When provider == "google_genai" the SDK renames max_tokensmax_output_tokens automatically. Pass max_tokens from any provider; the SDK will do the right thing.

Compression knobs: compression={...}

Set at client construction. Applies to every tool-output compression the middleware fires. The model-routing keys mirror compress()compression_model_name picks the backbone, and the compression-shaping keys forward through to the same /compress/question-specific/ endpoint.

Shared keys (accepted regardless of compression_model_name):

KeyDefaultEffect
compression_model_name"latte_v1"Backend validates; "latte_v1" and "latte_v2" are both public. See Models.
target_compression_ratio0.50–1 removal strength; >1 = Nx factor (same as compress arg). Ignored on latte_v2 when dynamic=True.
min_tokens200Tool outputs shorter than this skip compression. Middleware-side gate; not forwarded to the API.
coarseserver default (True)Paragraph-level vs token-level.
heuristic_chunkingserver default (False)Structure-aware chunker before scoring.
disable_placeholdersserver default (False)Drop the [...] markers between kept spans.
allow_toolsNoneWhitelist of tool names to compress.
ignore_toolsNoneBlacklist of tool names to leave untouched.
on_error"passthrough""raise" to fail loudly on backend errors instead of returning the original tool output.

latte_v2-only keys (set compression_model_name="latte_v2" first; the backend rejects these with 422 on latte_v1):

KeyDefaultEffect
dynamicFalsePick the compression ratio per-tool-output via Kneedle elbow selection. Overrides target_compression_ratio when True.
dynamic_min_ratio1.5Floor on the chosen Nx ratio when dynamic=True. Must be ≥ 1.0.
dynamic_max_ratio10.0Ceiling on the chosen Nx ratio when dynamic=True. Must be ≥ 1.0.

7. Async

compress_async and compress_batch_async are the async twins of compress and compress_batch: same params, return awaitables. Streaming is sync-only (no compress_stream_async). Call await client.aclose() when done to release the httpx pool, or use the client as an async context manager (async with CompressionClient(...) as client:). Use these inside event loops (FastAPI handlers, Discord bots, agent runtimes); for scripts the sync methods are simpler.

python

8. Errors & types

Every Compresr error inherits from CompresrError. Catch the base for a single handler; catch subclasses when recovery differs.

  • AuthenticationError: 401. Missing, malformed, or revoked key. Rotate it.
  • RateLimitError: 429. Carries a retry_after attribute (seconds). Back off and retry.
  • ValidationError: 400 / 422. Request body failed validation (e.g. target_compression_ratio out of range, missing query). Fix the payload.
  • CompresrError: base class. Network errors, 5xx, or anything unexpected.
python

Always handle 429

The default tier has tight per-minute limits. A retry loop with exponential backoff (respecting retry_after) is the single most important piece of error handling for production.