Guides

Batch compression

Compress many contexts in a single round trip - one HTTP call, one billing transaction, friendlier rate-limit footprint.

compress_batch takes a list of contexts and compresses them all in one HTTP call. Use it whenever you have a list of contexts that share the same compression model - typically RAG re-ranking against a single user question, or independent documents each with their own query.

This guide covers when batch beats N parallel single calls, the two shapes the queries field can take, the response envelope, and how batching interacts with concurrency.

Why batch beats N parallel calls

Firing compress() N times in parallel works, but it costs more than it should:

One HTTP round trip instead of N - TLS handshake, request parsing, and response framing happen once.
One billing transaction in usage logs instead of N - cleaner audit, lower per-call overhead.
Friendlier rate-limit footprint - a batch of 10 counts as a single request against your per-second limit, not 10.
Simpler retry logic - one retry decision for the whole batch instead of partial state across N in-flight requests.

The tradeoff: a batch is atomic. If any single context fails validation, the whole request fails (422 Unprocessable). For heterogeneous or untrusted inputs, sanitize first, or fall back to parallel single calls so one bad item doesn't invalidate the rest.

Batches are capped at 100 inputs per request. Chunk larger jobs and dispatch the chunks in parallel (see Scaling beyond a single batch).

Python also exposes compress_batch_async with an identical signature; TypeScript's compressBatch is already async and returns a Promise.

Same query for all contexts

The most common batch shape is RAG re-ranking: a handful of retrieved chunks, a single user question, each chunk filtered against that question. Pass queries as a single string and it's applied to every context.

python

Different query per context

Pass queries as a list, the same length as contexts, to compress each context against its own question. Useful when you have a batch of independent documents, each with its own intent.

python

Pair form: `inputs=[{context, query}, ...]`

The SDK accepts a second, wire-level form: an inputs list of {context, query} pairs. Prefer contexts + queries for a re-ranking shape (many contexts, one query, or aligned lists). Prefer inputs= for heterogeneous per-item queries, or when your data already comes as a list of pairs. The two forms are mutually exclusive - pass one or the other.

python

Client-side validation before any HTTP call

The SDK raises ValidationError client-side, before the request goes out, in two cases: (1) you pass both contexts and inputs (or neither), and (2) queries is a list whose length doesn't match contexts. No 422 round-trip. Fix the request, don't retry.

Response shape

compress_batch returns one envelope with a results array and aggregate totals. Each entry in results carries the same fields as a single compress call EXCEPT target_compression_ratio - that's request-level and applies to every item. Per-item fields: original_context, compressed_context, original_tokens, compressed_tokens, actual_compression_ratio, tokens_saved, duration_ms.

The envelope also has count (number of items) and aggregate token counts summed across the batch: total_original_tokens, total_compressed_tokens, total_tokens_saved. average_compression_ratio is the mean of per-item ratios, not a sum.

See the batch endpoint reference for the full response schema, and the pricing estimate endpoint for batch size and cost limits.

Batch-level knobs

Every knob on single compress also works on compress_batch and applies uniformly to every item: target_compression_ratio, coarse, heuristic_chunking, disable_placeholders, plus the latte_v2 adaptive trio dynamic, dynamic_min_ratio, dynamic_max_ratio. Semantics match the single-call form - see Single-call compression for details. To vary a knob per item, fall back to parallel single calls.

Scaling beyond a single batch

If you have more contexts than fit in a single batch, dispatch a handful of batch requests in parallel rather than one giant one. The sweet spot is usually a few dozen contexts per batch, run with a small concurrency limit (e.g. 4-8 batches in flight) - that keeps each request fast enough to retry cheaply while still amortizing HTTP overhead.

When to use batch vs parallel single calls

Use batch	Use parallel `compress()` calls
Same model + ratio across all contexts	Per-context model or ratio
Want one rate-limit-friendly request	Want independent retry per context
OK with all-or-nothing failure	Want partial results if some calls fail
Optimizing for cost and latency at scale	Long contexts where total payload would exceed batch size limits