API reference

Models

The Compresr compression model surface — `latte_v1` and `latte_v2`, their shared parameters, the latte_v2-only dynamic mode, and the canonical meaning of target_compression_ratio.

Compresr exposes two query-specific compression models on the public API:

latte_v1 — the original GemFilter backbone. Battle-tested, predictable.
latte_v2 — the newer Masker backbone. Up to 5x faster than latte_v1 at the same compression quality, and unlocks a dynamic mode that picks the compression ratio per-input automatically.

Both consume a context plus a query and return only the spans of context that carry signal for the query. latte_v2 is a strict superset of latte_v1's parameter surface — every knob latte_v1 accepts is also accepted by latte_v2, with the same defaults and semantics. Swapping between them is a single string change to compression_model_name.

This page is the canonical reference for both models' parameters and for target_compression_ratio. Every endpoint and SDK page links back here for the value semantics.

Both models are exposed on the same endpoints:

POST /compress/question-specific/: single compression
POST /compress/question-specific/stream: SSE stream
POST /compress/question-specific/batch: up to 100 rows per call

Which model should I use?

You want…	Pick
The simplest possible setup, predictable behavior	`latte_v1` with `target_compression_ratio`
Lower latency on the same compression quality	`latte_v2` with `target_compression_ratio`
A ratio chosen per-input (mixed-difficulty workloads)	`latte_v2` with `dynamic=true`
A hard token budget per request	Either model with a fixed `target_compression_ratio`

When in doubt: start with latte_v2 + target_compression_ratio=0.5. It is the fastest path to a working pipeline, and every other knob is incremental.

Supported parameters at a glance

Every parameter, every model. Defaults and shapes match the wire format; the TypeScript SDK accepts the camelCase form (targetCompressionRatio, dynamicMinRatio, …) but everything serializes to snake_case on the wire.

Parameter	`latte_v1`	`latte_v2`	Default	Purpose
`context`	✓ required	✓ required	—	Source text to compress
`query`	✓ required	✓ required	—	Question/intent that grounds relevance
`compression_model_name`	✓ required	✓ required	—	`"latte_v1"` or `"latte_v2"`
`target_compression_ratio`	✓	✓ (ignored if `dynamic=true`)	model default	Fixed compression strength
`coarse`	✓	✓	`true`	Paragraph-level scoring vs token-level
`heuristic_chunking`	✓	✓	`false`	Structure-aware chunker before scoring
`disable_placeholders`	✓	✓	`false`	Drop `[...]` markers between kept spans
`dynamic`	—	✓	`false`	Kneedle elbow selection (overrides ratio)
`dynamic_min_ratio`	—	✓ (when `dynamic=true`)	`1.5`	Floor on the chosen Nx ratio
`dynamic_max_ratio`	—	✓ (when `dynamic=true`)	`10.0`	Ceiling on the chosen Nx ratio

A — means the parameter is rejected with 422 Unprocessable Entity if you send it to that model.

Shared parameters (both models)

These knobs are accepted by latte_v1 and latte_v2 with identical semantics. Reach for them on either model.

contextstringRequired

The source text you want compressed: RAG chunks, document body, chat history, tool output — anything you would otherwise pay tokens to send to the LLM. Passing an empty string returns an empty result with no billing.

querystringRequired

The user question (or intent) that grounds the relevance signal. Both models keep only spans of context that help answer this query, so it cannot be empty.

compression_model_name"latte_v1" | "latte_v2"Required

Routes the call. "latte_v1" → GemFilter backbone, "latte_v2" → Masker backbone. Any other value is rejected with 422.

target_compression_rationumberOptional

Default: model default

Compression strength. Removal fraction when 0 < r ≤ 1, Nx target when r > 1. See target_compression_ratio below for the canonical bounds. Omit to let the model pick a ratio appropriate for the input. On latte_v2, ignored when dynamic=true.

coarsebooleanOptional

Default: true

Skip span-level scoring and compress at a coarser (paragraph-level) granularity. Faster and cheaper on long inputs where sentence-level precision is not needed. Set to false for token-level precision at the cost of latency.

heuristic_chunkingbooleanOptional

Default: false

Chunk the input with a structure-aware splitter (paragraphs, code blocks, markdown sections) before scoring. Helps on structured inputs — logs, transcripts, tables — where the default chunker over- or under-splits.

disable_placeholdersbooleanOptional

Default: false

Return only the kept spans, with no [...] placeholder markers between dropped regions. Useful when the downstream LLM is sensitive to gap markers.

latte_v2-only: dynamic compression ratio

latte_v2 adds a single feature on top of the shared surface: dynamic mode. Instead of pinning a fixed target_compression_ratio, the model inspects the per-span score curve and picks an inflection-point (Kneedle elbow) ratio per input. Short, easy contexts get compressed aggressively; long, dense contexts back off. The chosen ratio is always inside [dynamic_min_ratio, dynamic_max_ratio].

Use it when your inputs vary in difficulty and you would otherwise be guessing one ratio for everything. Stick with a fixed target_compression_ratio when you need a predictable token budget per call.

dynamicbooleanOptional

Default: false

latte_v2 only. When true, the model picks the ratio per-input via Kneedle elbow selection inside [dynamic_min_ratio, dynamic_max_ratio] and target_compression_ratio is ignored. Rejected on latte_v1 with 422.

dynamic_min_rationumberOptional

Default: 1.5

latte_v2 only. Floor on the chosen Nx ratio when dynamic=true: the elbow is never weaker than this. Must be ≥ 1.0. Only consulted when dynamic=true.

dynamic_max_rationumberOptional

Default: 10.0

latte_v2 only. Ceiling on the chosen Nx ratio when dynamic=true: the elbow is never more aggressive than this. Must be ≥ 1.0 and ≥ dynamic_min_ratio. Only consulted when dynamic=true.

Fixed vs dynamic — pick one per call

target_compression_ratio and dynamic are mutually exclusive in effect: sending both is not an error, but dynamic=true always wins and target_compression_ratio is silently ignored for that request. Switch per-call, not per-client.

target_compression_ratio

target_compression_ratio controls how aggressive the compression is. It is interpreted two different ways depending on the value you pass. Every page in this documentation that mentions a ratio refers back to this table, and both models share these semantics.

Value	Meaning	Example
`0 < r ≤ 1`	Removal strength	`0.5` removes ~50% of tokens
`r > 1`	Nx target (max `200`)	`4` → ~¼ original
omit	Model default	–

Pick a single mental model and stick to it inside a project. The removal-strength form reads more naturally for "compress by X%"; the Nx form is more natural when you have a hard target token budget.

Bounds

r = 0 is rejected with 422 Unprocessable Entity. Values above 200 are rejected at the same status; the API does not silently clamp. Omitting the field lets the model pick a ratio appropriate for the input.

Not a keep-fraction

target_compression_ratio is removal strength (when 0 < r ≤ 1) or an Nx target (when r > 1), never a keep-fraction. 0.3 does not mean "keep 30%"; it means "remove ~30%". Keep-fraction is a benchmark-wrapper convention used elsewhere in the ecosystem; the SDK's surface does not follow it.

Examples

Three call shapes, in order of how often you'll reach for them: latte_v2 with a fixed ratio (the default recommendation), latte_v2 with dynamic=true (for mixed-difficulty inputs), and latte_v1 with the shared structural knobs (when you need the older backbone explicitly).

python

When to use these models

Both models shine on query-shaped tasks — workloads where you can name the intent the compressed output has to serve:

RAG: trim retrieved chunks to the spans that actually answer the user's question before sending them to the LLM.
Agent retrieval: shrink tool descriptions, observations, and intermediate steps against the current agent goal.
Search-result trimming: collapse a list of long results to the parts relevant to the search query.

For deeper patterns and end-to-end examples, see the query-specific compression guide.