Skip to content
Compresr docs

API reference

Models

The Compresr compression model surface — `latte_v1` and `latte_v2`, their shared parameters, the latte_v2-only dynamic mode, and the canonical meaning of target_compression_ratio.

Compresr exposes two query-specific compression models on the public API:

  • latte_v1 — the original GemFilter backbone. Battle-tested, predictable.
  • latte_v2 — the newer Masker backbone. Up to 5x faster than latte_v1 at the same compression quality, and unlocks a dynamic mode that picks the compression ratio per-input automatically.

Both consume a context plus a query and return only the spans of context that carry signal for the query. latte_v2 is a strict superset of latte_v1's parameter surface — every knob latte_v1 accepts is also accepted by latte_v2, with the same defaults and semantics. Swapping between them is a single string change to compression_model_name.

This page is the canonical reference for both models' parameters and for target_compression_ratio. Every endpoint and SDK page links back here for the value semantics.

Both models are exposed on the same endpoints:

Which model should I use?

You want…Pick
The simplest possible setup, predictable behaviorlatte_v1 with target_compression_ratio
Lower latency on the same compression qualitylatte_v2 with target_compression_ratio
A ratio chosen per-input (mixed-difficulty workloads)latte_v2 with dynamic=true
A hard token budget per requestEither model with a fixed target_compression_ratio

When in doubt: start with latte_v2 + target_compression_ratio=0.5. It is the fastest path to a working pipeline, and every other knob is incremental.

Supported parameters at a glance

Every parameter, every model. Defaults and shapes match the wire format; the TypeScript SDK accepts the camelCase form (targetCompressionRatio, dynamicMinRatio, …) but everything serializes to snake_case on the wire.

Parameterlatte_v1latte_v2DefaultPurpose
context✓ required✓ requiredSource text to compress
query✓ required✓ requiredQuestion/intent that grounds relevance
compression_model_name✓ required✓ required"latte_v1" or "latte_v2"
target_compression_ratio✓ (ignored if dynamic=true)model defaultFixed compression strength
coarsetrueParagraph-level scoring vs token-level
heuristic_chunkingfalseStructure-aware chunker before scoring
disable_placeholdersfalseDrop [...] markers between kept spans
dynamicfalseKneedle elbow selection (overrides ratio)
dynamic_min_ratio✓ (when dynamic=true)1.5Floor on the chosen Nx ratio
dynamic_max_ratio✓ (when dynamic=true)10.0Ceiling on the chosen Nx ratio

A means the parameter is rejected with 422 Unprocessable Entity if you send it to that model.

Shared parameters (both models)

These knobs are accepted by latte_v1 and latte_v2 with identical semantics. Reach for them on either model.

contextstringRequired
The source text you want compressed: RAG chunks, document body, chat history, tool output — anything you would otherwise pay tokens to send to the LLM. Passing an empty string returns an empty result with no billing.
querystringRequired
The user question (or intent) that grounds the relevance signal. Both models keep only spans of context that help answer this query, so it cannot be empty.
compression_model_name"latte_v1" | "latte_v2"Required
Routes the call. "latte_v1" → GemFilter backbone, "latte_v2" → Masker backbone. Any other value is rejected with 422.
target_compression_rationumberOptional
Default: model default
Compression strength. Removal fraction when 0 < r ≤ 1, Nx target when r > 1. See target_compression_ratio below for the canonical bounds. Omit to let the model pick a ratio appropriate for the input. On latte_v2, ignored when dynamic=true.
coarsebooleanOptional
Default: true
Skip span-level scoring and compress at a coarser (paragraph-level) granularity. Faster and cheaper on long inputs where sentence-level precision is not needed. Set to false for token-level precision at the cost of latency.
heuristic_chunkingbooleanOptional
Default: false
Chunk the input with a structure-aware splitter (paragraphs, code blocks, markdown sections) before scoring. Helps on structured inputs — logs, transcripts, tables — where the default chunker over- or under-splits.
disable_placeholdersbooleanOptional
Default: false
Return only the kept spans, with no [...] placeholder markers between dropped regions. Useful when the downstream LLM is sensitive to gap markers.

latte_v2-only: dynamic compression ratio

latte_v2 adds a single feature on top of the shared surface: dynamic mode. Instead of pinning a fixed target_compression_ratio, the model inspects the per-span score curve and picks an inflection-point (Kneedle elbow) ratio per input. Short, easy contexts get compressed aggressively; long, dense contexts back off. The chosen ratio is always inside [dynamic_min_ratio, dynamic_max_ratio].

Use it when your inputs vary in difficulty and you would otherwise be guessing one ratio for everything. Stick with a fixed target_compression_ratio when you need a predictable token budget per call.

dynamicbooleanOptional
Default: false
latte_v2 only. When true, the model picks the ratio per-input via Kneedle elbow selection inside [dynamic_min_ratio, dynamic_max_ratio] and target_compression_ratio is ignored. Rejected on latte_v1 with 422.
dynamic_min_rationumberOptional
Default: 1.5
latte_v2 only. Floor on the chosen Nx ratio when dynamic=true: the elbow is never weaker than this. Must be ≥ 1.0. Only consulted when dynamic=true.
dynamic_max_rationumberOptional
Default: 10.0
latte_v2 only. Ceiling on the chosen Nx ratio when dynamic=true: the elbow is never more aggressive than this. Must be ≥ 1.0 and ≥ dynamic_min_ratio. Only consulted when dynamic=true.

Fixed vs dynamic — pick one per call

target_compression_ratio and dynamic are mutually exclusive in effect: sending both is not an error, but dynamic=true always wins and target_compression_ratio is silently ignored for that request. Switch per-call, not per-client.

target_compression_ratio

target_compression_ratio controls how aggressive the compression is. It is interpreted two different ways depending on the value you pass. Every page in this documentation that mentions a ratio refers back to this table, and both models share these semantics.

ValueMeaningExample
0 < r ≤ 1Removal strength0.5 removes ~50% of tokens
r > 1Nx target (max 200)4 → ~¼ original
omitModel default

Pick a single mental model and stick to it inside a project. The removal-strength form reads more naturally for "compress by X%"; the Nx form is more natural when you have a hard target token budget.

Bounds

r = 0 is rejected with 422 Unprocessable Entity. Values above 200 are rejected at the same status; the API does not silently clamp. Omitting the field lets the model pick a ratio appropriate for the input.

Not a keep-fraction

target_compression_ratio is removal strength (when 0 < r ≤ 1) or an Nx target (when r > 1), never a keep-fraction. 0.3 does not mean "keep 30%"; it means "remove ~30%". Keep-fraction is a benchmark-wrapper convention used elsewhere in the ecosystem; the SDK's surface does not follow it.

Examples

Three call shapes, in order of how often you'll reach for them: latte_v2 with a fixed ratio (the default recommendation), latte_v2 with dynamic=true (for mixed-difficulty inputs), and latte_v1 with the shared structural knobs (when you need the older backbone explicitly).

python

When to use these models

Both models shine on query-shaped tasks — workloads where you can name the intent the compressed output has to serve:

  • RAG: trim retrieved chunks to the spans that actually answer the user's question before sending them to the LLM.
  • Agent retrieval: shrink tool descriptions, observations, and intermediate steps against the current agent goal.
  • Search-result trimming: collapse a list of long results to the parts relevant to the search query.

For deeper patterns and end-to-end examples, see the query-specific compression guide.