Skip to content
Compresr docs

Guides

Batch compression

Compress many contexts in a single round trip - one HTTP call, one billing transaction, friendlier rate-limit footprint.

compress_batch takes a list of contexts and compresses them all in one HTTP call. Use it whenever you have a list of contexts that share the same compression model - typically RAG re-ranking against a single user question, or independent documents each with their own query.

This guide covers when batch beats N parallel single calls, the two shapes the queries field can take, the response envelope, and how batching interacts with concurrency.

Why batch beats N parallel calls

Firing compress() N times in parallel works, but it costs more than it should:

  • One HTTP round trip instead of N - TLS handshake, request parsing, and response framing happen once.
  • One billing transaction in usage logs instead of N - cleaner audit, lower per-call overhead.
  • Friendlier rate-limit footprint - a batch of 10 counts as a single request against your per-second limit, not 10.
  • Simpler retry logic - one retry decision for the whole batch instead of partial state across N in-flight requests.

The tradeoff: a batch is atomic. If any single context fails validation, the whole request fails (422 Unprocessable). For heterogeneous or untrusted inputs, sanitize first, or fall back to parallel single calls so one bad item doesn't invalidate the rest.

Same query for all contexts

The most common batch shape is RAG re-ranking: a handful of retrieved chunks, a single user question, each chunk filtered against that question. Pass queries as a single string and it's applied to every context.

python

Different query per context

Pass queries as a list, the same length as contexts, to compress each context against its own question. Useful when you have a batch of independent documents, each with its own intent.

python

contexts and queries length must match

When queries is a list, it must be the same length as contexts. A mismatch returns 422 Unprocessable. Mixing types (e.g. partial-list, partial-string) is also a ValidationError. Fix the request, don't retry.

Response shape

compress_batch returns one envelope with a results array and aggregate totals. Each entry in results has the same fields as a single compress call. The aggregate fields (total_original_tokens, total_compressed_tokens, total_tokens_saved) are summed across the batch so you can log one number per batch instead of looping.

See the batch endpoint reference for the full response schema, and the pricing estimate endpoint for batch size and cost limits.

Scaling beyond a single batch

If you have more contexts than fit in a single batch, dispatch a handful of batch requests in parallel rather than one giant one. The sweet spot is usually a few dozen contexts per batch, run with a small concurrency limit (e.g. 4-8 batches in flight) - that keeps each request fast enough to retry cheaply while still amortizing HTTP overhead.

When to use batch vs parallel single calls

Use batchUse parallel compress() calls
Same model + ratio across all contextsPer-context model or ratio
Want one rate-limit-friendly requestWant independent retry per context
OK with all-or-nothing failureWant partial results if some calls fail
Optimizing for cost and latency at scaleLong contexts where total payload would exceed batch size limits