Guides
Batch compression
Compress many contexts in a single round trip - one HTTP call, one billing transaction, friendlier rate-limit footprint.
compress_batch takes a list of contexts and compresses them all in one HTTP call. Use it whenever you have a list of contexts that share the same compression model - typically RAG re-ranking against a single user question, or independent documents each with their own query.
This guide covers when batch beats N parallel single calls, the two shapes the queries field can take, the response envelope, and how batching interacts with concurrency.
Why batch beats N parallel calls
Firing compress() N times in parallel works, but it costs more than it should:
- One HTTP round trip instead of N - TLS handshake, request parsing, and response framing happen once.
- One billing transaction in usage logs instead of N - cleaner audit, lower per-call overhead.
- Friendlier rate-limit footprint - a batch of 10 counts as a single request against your per-second limit, not 10.
- Simpler retry logic - one retry decision for the whole batch instead of partial state across N in-flight requests.
The tradeoff: a batch is atomic. If any single context fails validation, the whole request fails (422 Unprocessable). For heterogeneous or untrusted inputs, sanitize first, or fall back to parallel single calls so one bad item doesn't invalidate the rest.
Same query for all contexts
The most common batch shape is RAG re-ranking: a handful of retrieved chunks, a single user question, each chunk filtered against that question. Pass queries as a single string and it's applied to every context.
Different query per context
Pass queries as a list, the same length as contexts, to compress each context against its own question. Useful when you have a batch of independent documents, each with its own intent.
contexts and queries length must match
When queries is a list, it must be the same length as contexts. A mismatch returns 422 Unprocessable. Mixing types (e.g. partial-list, partial-string) is also a ValidationError. Fix the request, don't retry.
Response shape
compress_batch returns one envelope with a results array and aggregate totals. Each entry in results has the same fields as a single compress call. The aggregate fields (total_original_tokens, total_compressed_tokens, total_tokens_saved) are summed across the batch so you can log one number per batch instead of looping.
See the batch endpoint reference for the full response schema, and the pricing estimate endpoint for batch size and cost limits.
Scaling beyond a single batch
If you have more contexts than fit in a single batch, dispatch a handful of batch requests in parallel rather than one giant one. The sweet spot is usually a few dozen contexts per batch, run with a small concurrency limit (e.g. 4-8 batches in flight) - that keeps each request fast enough to retry cheaply while still amortizing HTTP overhead.
When to use batch vs parallel single calls
| Use batch | Use parallel compress() calls |
|---|---|
| Same model + ratio across all contexts | Per-context model or ratio |
| Want one rate-limit-friendly request | Want independent retry per context |
| OK with all-or-nothing failure | Want partial results if some calls fail |
| Optimizing for cost and latency at scale | Long contexts where total payload would exceed batch size limits |