API reference
Models
The Compresr compression model surface — `latte_v1` and `latte_v2`, their shared parameters, the latte_v2-only dynamic mode, and the canonical meaning of target_compression_ratio.
Compresr exposes two query-specific compression models on the public API:
latte_v1— the original GemFilter backbone. Battle-tested, predictable.latte_v2— the newer Masker backbone. Up to 5x faster thanlatte_v1at the same compression quality, and unlocks adynamicmode that picks the compression ratio per-input automatically.
Both consume a context plus a query and return only the spans of context that carry signal for the query. latte_v2 is a strict superset of latte_v1's parameter surface — every knob latte_v1 accepts is also accepted by latte_v2, with the same defaults and semantics. Swapping between them is a single string change to compression_model_name.
This page is the canonical reference for both models' parameters and for target_compression_ratio. Every endpoint and SDK page links back here for the value semantics.
Both models are exposed on the same endpoints:
POST /compress/question-specific/: single compressionPOST /compress/question-specific/stream: SSE streamPOST /compress/question-specific/batch: up to 100 rows per call
Which model should I use?
| You want… | Pick |
|---|---|
| The simplest possible setup, predictable behavior | latte_v1 with target_compression_ratio |
| Lower latency on the same compression quality | latte_v2 with target_compression_ratio |
| A ratio chosen per-input (mixed-difficulty workloads) | latte_v2 with dynamic=true |
| A hard token budget per request | Either model with a fixed target_compression_ratio |
When in doubt: start with latte_v2 + target_compression_ratio=0.5. It is the fastest path to a working pipeline, and every other knob is incremental.
Supported parameters at a glance
Every parameter, every model. Defaults and shapes match the wire format; the TypeScript SDK accepts the camelCase form (targetCompressionRatio, dynamicMinRatio, …) but everything serializes to snake_case on the wire.
| Parameter | latte_v1 | latte_v2 | Default | Purpose |
|---|---|---|---|---|
context | ✓ required | ✓ required | — | Source text to compress |
query | ✓ required | ✓ required | — | Question/intent that grounds relevance |
compression_model_name | ✓ required | ✓ required | — | "latte_v1" or "latte_v2" |
target_compression_ratio | ✓ | ✓ (ignored if dynamic=true) | model default | Fixed compression strength |
coarse | ✓ | ✓ | true | Paragraph-level scoring vs token-level |
heuristic_chunking | ✓ | ✓ | false | Structure-aware chunker before scoring |
disable_placeholders | ✓ | ✓ | false | Drop [...] markers between kept spans |
dynamic | — | ✓ | false | Kneedle elbow selection (overrides ratio) |
dynamic_min_ratio | — | ✓ (when dynamic=true) | 1.5 | Floor on the chosen Nx ratio |
dynamic_max_ratio | — | ✓ (when dynamic=true) | 10.0 | Ceiling on the chosen Nx ratio |
A — means the parameter is rejected with 422 Unprocessable Entity if you send it to that model.
Shared parameters (both models)
These knobs are accepted by latte_v1 and latte_v2 with identical semantics. Reach for them on either model.
contextstringRequiredquerystringRequiredcompression_model_name"latte_v1" | "latte_v2"Required"latte_v1" → GemFilter backbone, "latte_v2" → Masker backbone. Any other value is rejected with 422.target_compression_rationumberOptionalmodel default0 < r ≤ 1, Nx target when r > 1. See target_compression_ratio below for the canonical bounds. Omit to let the model pick a ratio appropriate for the input. On latte_v2, ignored when dynamic=true.coarsebooleanOptionaltruefalse for token-level precision at the cost of latency.heuristic_chunkingbooleanOptionalfalsedisable_placeholdersbooleanOptionalfalse[...] placeholder markers between dropped regions. Useful when the downstream LLM is sensitive to gap markers.latte_v2-only: dynamic compression ratio
latte_v2 adds a single feature on top of the shared surface: dynamic mode. Instead of pinning a fixed target_compression_ratio, the model inspects the per-span score curve and picks an inflection-point (Kneedle elbow) ratio per input. Short, easy contexts get compressed aggressively; long, dense contexts back off. The chosen ratio is always inside [dynamic_min_ratio, dynamic_max_ratio].
Use it when your inputs vary in difficulty and you would otherwise be guessing one ratio for everything. Stick with a fixed target_compression_ratio when you need a predictable token budget per call.
dynamicbooleanOptionalfalselatte_v2 only. When true, the model picks the ratio per-input via Kneedle elbow selection inside [dynamic_min_ratio, dynamic_max_ratio] and target_compression_ratio is ignored. Rejected on latte_v1 with 422.dynamic_min_rationumberOptional1.5latte_v2 only. Floor on the chosen Nx ratio when dynamic=true: the elbow is never weaker than this. Must be ≥ 1.0. Only consulted when dynamic=true.dynamic_max_rationumberOptional10.0latte_v2 only. Ceiling on the chosen Nx ratio when dynamic=true: the elbow is never more aggressive than this. Must be ≥ 1.0 and ≥ dynamic_min_ratio. Only consulted when dynamic=true.Fixed vs dynamic — pick one per call
target_compression_ratio and dynamic are mutually exclusive in effect: sending both is not an error, but dynamic=true always wins and target_compression_ratio is silently ignored for that request. Switch per-call, not per-client.
target_compression_ratio
target_compression_ratio controls how aggressive the compression is. It is interpreted two different ways depending on the value you pass. Every page in this documentation that mentions a ratio refers back to this table, and both models share these semantics.
| Value | Meaning | Example |
|---|---|---|
0 < r ≤ 1 | Removal strength | 0.5 removes ~50% of tokens |
r > 1 | Nx target (max 200) | 4 → ~¼ original |
| omit | Model default | – |
Pick a single mental model and stick to it inside a project. The removal-strength form reads more naturally for "compress by X%"; the Nx form is more natural when you have a hard target token budget.
Bounds
r = 0 is rejected with 422 Unprocessable Entity. Values above 200 are rejected at the same status; the API does not silently clamp. Omitting the field lets the model pick a ratio appropriate for the input.
Not a keep-fraction
target_compression_ratio is removal strength (when 0 < r ≤ 1) or an Nx target (when r > 1), never a keep-fraction. 0.3 does not mean "keep 30%"; it means "remove ~30%". Keep-fraction is a benchmark-wrapper convention used elsewhere in the ecosystem; the SDK's surface does not follow it.
Examples
Three call shapes, in order of how often you'll reach for them: latte_v2 with a fixed ratio (the default recommendation), latte_v2 with dynamic=true (for mixed-difficulty inputs), and latte_v1 with the shared structural knobs (when you need the older backbone explicitly).
When to use these models
Both models shine on query-shaped tasks — workloads where you can name the intent the compressed output has to serve:
- RAG: trim retrieved chunks to the spans that actually answer the user's question before sending them to the LLM.
- Agent retrieval: shrink tool descriptions, observations, and intermediate steps against the current agent goal.
- Search-result trimming: collapse a list of long results to the parts relevant to the search query.
For deeper patterns and end-to-end examples, see the query-specific compression guide.