API reference
Models
The Compresr compression model surface (`latte_v1`), its parameters, and the canonical meaning of target_compression_ratio.
Compresr exposes one query-specific compression model on the public API: latte_v1. This page is the canonical reference for the model's parameters and for target_compression_ratio. Every endpoint and SDK that accepts a compression ratio follows the value semantics defined below — no other page redefines the bounds.
latte_v1
Query-specific compression. You supply a context and a query; latte_v1 scores spans of the context against the query and keeps only the spans that carry signal for it. Tokens that don't help answer the query are dropped.
Reach for latte_v1 whenever you have a clear intent for the compressed output (a question, a tool description, a routing decision). Without a query, the model has no signal for what to preserve — query is required.
The model is exposed on these endpoints:
POST /compress/question-specific/— single compressionPOST /compress/question-specific/stream— SSE streamPOST /compress/question-specific/batch— up to 100 rows per call
And surfaced as compression_model_name="latte_v1" (Python) / compressionModelName: 'latte_v1' (TypeScript) in both official SDKs.
Parameters
These are the parameters accepted by latte_v1 across the Python SDK, the TypeScript SDK, and the raw HTTP API. The wire format is always snake_case; the TypeScript SDK accepts the camelCase form shown via tsName. Required parameters must be supplied on every request; optional parameters fall back to the model defaults documented below.
contextstringRequiredquerystringRequiredlatte_v1 keeps only spans of context that help answer this query, so it cannot be empty.compression_model_name"latte_v1"Required"latte_v1"; any other value is rejected with 422 Unprocessable Entity.target_compression_rationumberOptionalmodel default0 < r ≤ 1 and as an Nx target when r > 1. See target_compression_ratio below for the canonical bounds. Omit to let the model pick a ratio appropriate for the input.coarsebooleanOptionaltruelatte_v1. Paragraph-level scoring (the default). Faster and cheaper than the token-level pass. Set to false to opt into token-level precision at the cost of latency.heuristic_chunkingbooleanOptionalfalselatte_v1 only. Use a structure-aware splitter (paragraphs, code blocks, markdown sections) instead of the default fixed-size chunker. Helps when input has strong structural boundaries.disable_placeholdersbooleanOptionalfalselatte_v1 only. Skip the [...] placeholders the model normally inserts where content was dropped. Useful when you want the output to read as continuous prose.target_compression_ratio
target_compression_ratio controls how aggressive the compression is. It is interpreted two different ways depending on the value you pass. Every page in this documentation that mentions a ratio refers back to this table.
| Value | Meaning | Example |
|---|---|---|
0 < r ≤ 1 | Removal strength | 0.5 removes ~50% of tokens |
r > 1 | Nx target (max 200) | 4 → ~¼ original |
| omit | Model default | – |
Pick a single mental model and stick to it inside a project. The removal-strength form reads more naturally for "compress by X%"; the Nx form is more natural when you have a hard target token budget.
Bounds
r = 0 is rejected with 422 Unprocessable Entity. Values above 200 are rejected at the same status — the API does not silently clamp. Omitting the field lets the model pick a ratio appropriate for the input.
Not a keep-fraction
target_compression_ratio is removal strength (when 0 < r ≤ 1) or an Nx target (when r > 1) — never a keep-fraction. 0.3 does not mean "keep 30%"; it means "remove ~30%". Keep-fraction is a benchmark-wrapper convention used elsewhere in the ecosystem; the SDK's surface does not follow it.
Example
A minimal latte_v1 call that uses both query and target_compression_ratio:
When to use latte_v1
latte_v1 shines on query-shaped tasks — workloads where you can name the intent the compressed output has to serve:
- RAG: trim retrieved chunks to the spans that actually answer the user's question before sending them to the LLM.
- Agent retrieval: shrink tool descriptions, observations, and intermediate steps against the current agent goal.
- Search-result trimming: collapse a list of long results to the parts relevant to the search query.
For deeper patterns and end-to-end examples, see the query-specific compression guide.