Skip to content
Compresr docs

SDKs

Python SDK

Compress prompts and retrieved context with the official Python client.

The compresr package is the official Python client. It wraps the REST API with typed methods, handles auth, and ships both sync and async variants of every call. Python 3.9+.

1. Install

Install from PyPI — pip, poetry, and uv all work.

bash

The agent client ships in the base install

As of compresr 2.6.0 the agent client layer (client.messages.create, client.chat.completions.create, client.run, WebSearchTool) is part of the base install — pip install compresr is enough. LangChain + provider chat-model + Tavily/Brave deps are pulled in automatically. Old compresr[agents] / compresr[agents-all] brackets still work as no-op aliases.

2. Initialize the client

Construct CompressionClient once at module scope and reuse it — the client keeps an internal httpx connection pool. Read the key from env, never hardcode.

The constructor takes api_key (required), plus optional base_url (defaults to https://api.compresr.ai) and timeout (seconds; default uses the SDK's built-in timeout). Override base_url only for regional or self-hosted endpoints.

python

See Authentication for key rotation, budgets, and rules.

3. compress

Synchronous single-request compression. Pass context, query, and compression_model_name="latte_v1"; the model keeps the spans that matter for the query. For many chunks against one query, see compress_batch; for token-by-token output, compress_stream.

python

Parameters

contextstringRequired
The long text to compress: RAG chunks, document body, chat history.
querystringRequired
The question the compressed context must still answer. Required for latte_v1.
compression_model_name"latte_v1"Required
Public model identifier. latte_v1 is the only public model.
target_compression_rationumberOptional
Removal strength when 0 < r ≤ 1, or Nx target when r > 1. See the Models reference.
coarseboolean | NoneOptional
Default: None
Latte-only. None = backend default (paragraph-level); True locks paragraph-level; False opts into token-level precision.
heuristic_chunkingboolean | NoneOptional
Default: None
Latte-only. Heuristic splitter (paragraphs, code blocks) instead of fixed-size chunks.
disable_placeholdersboolean | NoneOptional
Default: None
Latte-only. Skip the [...] placeholders inserted where content was dropped.

Response

compress() returns a typed object — access fields as attributes (result.data.compressed_context). Response field names stay snake_case across every SDK.

CompressionResponse
  • dataobject
    • compressed_contextstring

      The compressed text, ready to drop into your prompt.

    • original_tokensinteger

      Token count of the input context (tiktoken cl100k).

    • compressed_tokensinteger

      Token count of the compressed output.

    • tokens_savedinteger

      original_tokens − compressed_tokens.

    • actual_compression_rationumber

      Fraction of input tokens actually removed (0–1). e.g. 0.5 = ~50% removed.

    • duration_msinteger

      Server-side wall-clock time for the compression pass.

4. Stream

client.compress_stream(...) returns an iterator yielding {content, done} chunks as the model produces them; the final chunk has done=True and empty content. Use it anywhere time-to-first-token matters (UIs, agent loops); for one-shot calls stick with compress.

python

The iterator is a normal generator — wrap it in itertools.islice, push chunks through a queue, or consume from a worker thread. Same context / query / compression_model_name rules as compress().

5. Batch

client.compress_batch(...) compresses many contexts in one request. Pass contexts: list[str] plus either a single queries: str (applied to every context) or a queries: list[str] matching contexts in length. Cheaper than firing N concurrent compress() calls — ideal for RAG re-ranking or bulk document processing.

python

queries is either a string (applied to every context) or a list matching contexts in length — mixing the two raises ValidationError. Per-item results carry the same fields as a single compress() call except target_compression_ratio (request-level only). The envelope also exposes aggregates: result.data.count, total_original_tokens, total_compressed_tokens, total_tokens_saved, average_compression_ratio.

6. Agent client

Construct CompressionClient with llm= and you get an agent surface — three call-shapes (Anthropic-style messages.create, OpenAI-style chat.completions.create, native run) that auto-compress every tool output above min_tokens before the LLM sees it. Behind all three sits LangChain 1.0's create_agent + the SDK's CompresrToolMiddleware. Use it as a drop-in for anthropic.Anthropic() / openai.OpenAI(); for raw (context, query) calls stick with compress.

These surfaces are SDK-shaped and have no direct cURL equivalent. The underlying compression is still the same /api/compress/question-specific/ endpoint — it's what the middleware fires whenever a tool returns.

Construct with llm=

Provider lives on the client; model lives at the call site. Swap providers by changing one string — same tools, same code:

python

The llm string accepts "anthropic" (provider only — every call must pass model="..."), "anthropic:claude-haiku-4-5" (default model, overridable at call site), or "anthropic/claude-haiku-4-5" (Vercel AI SDK convention; both separators accepted). If neither provides a model, the SDK raises CompresrError("model is required …").

Three call shapes

messages.create duck-types anthropic.types.Message, chat.completions.create duck-types openai.types.chat.ChatCompletion, and run returns a native NormalizedResult (.text, .tool_uses, .citations, .stop_reason, .usage).

python

Python also exposes async variants: acreate, arun. TypeScript is async by default.

Web search — WebSearchTool

Backed by Tavily (default) or Brave. The returned object is a real LangChain BaseTool; its output flows through CompresrToolMiddleware automatically.

python

Why not Anthropic / OpenAI / Gemini server search?

Provider-native server search tools (web_search_20250305, web_search_preview, google_search) execute server-side and return opaque/encrypted content that Compresr cannot read or compress. Use Tavily or Brave so the result is plaintext. See the Web search guide.

Bring your own tool

Any LangChain @tool-decorated function works. The string return value is compressed before the LLM sees it.

python

Streaming isn't on the agent layer yet — client.messages.stream(...) / client.chat.completions.stream(...) throw CompresrError('streaming not yet implemented'). The compression-API stream (compress_stream) is unaffected.

Per-call LLM knobs

Forwarded to the underlying chat model: temperature, top_p, top_k, max_tokens, max_output_tokens, stop, stop_sequences, presence_penalty, frequency_penalty, seed, logprobs, top_logprobs. Anything else is silently dropped.

python

Gemini aliasing

When provider == "google_genai" the SDK renames max_tokensmax_output_tokens automatically. Pass max_tokens from any provider — the SDK will do the right thing.

Compression knobs — compression={...}

Set at client construction. Applies to every tool-output compression the middleware fires.

KeyDefaultEffect
target_compression_ratio0.50–1 removal strength; >1 = Nx factor (same as compress arg).
min_tokens200Tool outputs shorter than this skip compression.
coarseserver default (True)Paragraph-level vs token-level.
compression_model_name"latte_v1"Backend validates; latte_v1 is the only public model.
allow_toolsNoneWhitelist of tool names to compress.
ignore_toolsNoneBlacklist of tool names to leave untouched.
on_error"passthrough""raise" to fail loudly on backend errors instead of returning the original tool output.

7. Async

compress_async and compress_batch_async are the async twins of compress and compress_batch — same params, return awaitables. Streaming is sync-only (no compress_stream_async). Call await client.aclose() when done to release the httpx pool — or use the client as an async context manager (async with CompressionClient(...) as client:). Use these inside event loops (FastAPI handlers, Discord bots, agent runtimes); for scripts the sync methods are simpler.

python

8. Errors & types

Every Compresr error inherits from CompresrError. Catch the base for a single handler; catch subclasses when recovery differs.

  • AuthenticationError401. Missing, malformed, or revoked key. Rotate it.
  • RateLimitError429. Carries a retry_after attribute (seconds). Back off and retry.
  • ValidationError400 / 422. Request body failed validation (e.g. target_compression_ratio out of range, missing query). Fix the payload.
  • CompresrError — base class. Network errors, 5xx, or anything unexpected.
python

Always handle 429

The default tier has tight per-minute limits. A retry loop with exponential backoff (respecting retry_after) is the single most important piece of error handling for production.