# https://compresr.ai/docs/api-reference/models

> Human-readable page: https://compresr.ai/docs/api-reference/models

Compresr exposes two query-specific compression models on the public API:

- **`latte_v1`** — the original GemFilter backbone. Battle-tested, predictable.
- **`latte_v2`** — the newer Masker backbone. **Up to 5x faster** than `latte_v1` at the same compression quality, and unlocks a `dynamic` mode that picks the compression ratio per-input automatically.

Both consume a `context` plus a `query` and return only the spans of `context` that carry signal for the query. **`latte_v2` is a strict superset of `latte_v1`'s parameter surface** — every knob `latte_v1` accepts is also accepted by `latte_v2`, with the same defaults and semantics. Swapping between them is a single string change to `compression_model_name`.

This page is the **canonical reference** for both models' parameters and for `target_compression_ratio`. Every endpoint and SDK page links back here for the value semantics.

Both models are exposed on the same endpoints:

- [`POST /compress/question-specific/`](/docs/api-reference/compress-qs): single compression
- [`POST /compress/question-specific/stream`](/docs/api-reference/compress-qs-stream): SSE stream
- [`POST /compress/question-specific/batch`](/docs/api-reference/compress-qs-batch): up to 100 rows per call

## Which model should I use?

| You want… | Pick |
|---|---|
| The simplest possible setup, predictable behavior | **`latte_v1`** with `target_compression_ratio` |
| Lower latency on the same compression quality | **`latte_v2`** with `target_compression_ratio` |
| A ratio chosen per-input (mixed-difficulty workloads) | **`latte_v2`** with `dynamic=true` |
| A hard token budget per request | Either model with a fixed `target_compression_ratio` |

When in doubt: start with **`latte_v2` + `target_compression_ratio=0.5`**. It is the fastest path to a working pipeline, and every other knob is incremental.

## Supported parameters at a glance

Every parameter, every model. Defaults and shapes match the wire format; the TypeScript SDK accepts the camelCase form (`targetCompressionRatio`, `dynamicMinRatio`, …) but everything serializes to snake_case on the wire.

| Parameter | `latte_v1` | `latte_v2` | Default | Purpose |
|---|---|---|---|---|
| `context` | ✓ required | ✓ required | — | Source text to compress |
| `query` | ✓ required | ✓ required | — | Question/intent that grounds relevance |
| `compression_model_name` | ✓ required | ✓ required | — | `"latte_v1"` or `"latte_v2"` |
| `target_compression_ratio` | ✓ | ✓ (ignored if `dynamic=true`) | model default | Fixed compression strength |
| `coarse` | ✓ | ✓ | `true` | Paragraph-level scoring vs token-level |
| `heuristic_chunking` | ✓ | ✓ | `false` | Structure-aware chunker before scoring |
| `disable_placeholders` | ✓ | ✓ | `false` | Drop `[...]` markers between kept spans |
| `dynamic` | — | ✓ | `false` | Kneedle elbow selection (overrides ratio) |
| `dynamic_min_ratio` | — | ✓ (when `dynamic=true`) | `1.5` | Floor on the chosen Nx ratio |
| `dynamic_max_ratio` | — | ✓ (when `dynamic=true`) | `10.0` | Ceiling on the chosen Nx ratio |

A `—` means the parameter is **rejected with `422 Unprocessable Entity`** if you send it to that model.

## Shared parameters (both models)

These knobs are accepted by `latte_v1` **and** `latte_v2` with identical semantics. Reach for them on either model.

| Parameter | Type | Required | Description |
| --- | --- | --- | --- |
| `context` | string | yes | The source text you want compressed: RAG chunks, document body, chat history, tool output — anything you would otherwise pay tokens to send to the LLM. Passing an empty string returns an empty result with no billing. |
| `query` | string | yes | The user question (or intent) that grounds the relevance signal. Both models keep only spans of context that help answer this query, so it cannot be empty. |
| `compression_model_name` | "latte_v1" \| "latte_v2" | yes | Routes the call. `"latte_v1"` → GemFilter backbone, `"latte_v2"` → Masker backbone. Any other value is rejected with `422`. |
| `target_compression_ratio` | number | no | Compression strength. Removal fraction when `0 < r ≤ 1`, Nx target when `r > 1`. See [target_compression_ratio](#target_compression_ratio) below for the canonical bounds. Omit to let the model pick a ratio appropriate for the input. On `latte_v2`, ignored when `dynamic=true`. |
| `coarse` | boolean | no | Skip span-level scoring and compress at a coarser (paragraph-level) granularity. Faster and cheaper on long inputs where sentence-level precision is not needed. Set to `false` for token-level precision at the cost of latency. |
| `heuristic_chunking` | boolean | no | Chunk the input with a structure-aware splitter (paragraphs, code blocks, markdown sections) before scoring. Helps on structured inputs — logs, transcripts, tables — where the default chunker over- or under-splits. |
| `disable_placeholders` | boolean | no | Return only the kept spans, with no `[...]` placeholder markers between dropped regions. Useful when the downstream LLM is sensitive to gap markers. |

## latte_v2-only: dynamic compression ratio

`latte_v2` adds a single feature on top of the shared surface: **dynamic mode**. Instead of pinning a fixed `target_compression_ratio`, the model inspects the per-span score curve and picks an inflection-point (Kneedle elbow) ratio per input. Short, easy contexts get compressed aggressively; long, dense contexts back off. The chosen ratio is always inside `[dynamic_min_ratio, dynamic_max_ratio]`.

Use it when your inputs vary in difficulty and you would otherwise be guessing one ratio for everything. Stick with a fixed `target_compression_ratio` when you need a predictable token budget per call.

| Parameter | Type | Required | Description |
| --- | --- | --- | --- |
| `dynamic` | boolean | no | **`latte_v2` only.** When `true`, the model picks the ratio per-input via Kneedle elbow selection inside `[dynamic_min_ratio, dynamic_max_ratio]` and `target_compression_ratio` is ignored. Rejected on `latte_v1` with `422`. |
| `dynamic_min_ratio` | number | no | **`latte_v2` only.** Floor on the chosen Nx ratio when `dynamic=true`: the elbow is never weaker than this. Must be `≥ 1.0`. Only consulted when `dynamic=true`. |
| `dynamic_max_ratio` | number | no | **`latte_v2` only.** Ceiling on the chosen Nx ratio when `dynamic=true`: the elbow is never more aggressive than this. Must be `≥ 1.0` and `≥ dynamic_min_ratio`. Only consulted when `dynamic=true`. |

> **Fixed vs dynamic — pick one per call**
> `target_compression_ratio` and `dynamic` are mutually exclusive *in effect*: sending both is not an error, but `dynamic=true` always wins and `target_compression_ratio` is silently ignored for that request. Switch per-call, not per-client.

## target_compression_ratio

`target_compression_ratio` controls how aggressive the compression is. It is interpreted **two different ways** depending on the value you pass. Every page in this documentation that mentions a ratio refers back to this table, and **both models share these semantics**.

| Value | Meaning | Example |
|---|---|---|
| `0 < r ≤ 1` | Removal strength | `0.5` removes ~50% of tokens |
| `r > 1` | Nx target (max `200`) | `4` → ~¼ original |
| omit | Model default | – |

Pick a single mental model and stick to it inside a project. The removal-strength form reads more naturally for "compress by X%"; the Nx form is more natural when you have a hard target token budget.

> **Bounds**
> `r = 0` is rejected with `422 Unprocessable Entity`. Values above `200` are rejected at the same status; the API does not silently clamp. Omitting the field lets the model pick a ratio appropriate for the input.

> **Not a keep-fraction**
> `target_compression_ratio` is **removal strength** (when `0 < r ≤ 1`) or an **Nx target** (when `r > 1`), never a keep-fraction. `0.3` does **not** mean "keep 30%"; it means "remove ~30%". Keep-fraction is a benchmark-wrapper convention used elsewhere in the ecosystem; the SDK's surface does not follow it.

## Examples

Three call shapes, in order of how often you'll reach for them: `latte_v2` with a fixed ratio (the default recommendation), `latte_v2` with `dynamic=true` (for mixed-difficulty inputs), and `latte_v1` with the shared structural knobs (when you need the older backbone explicitly).

```python
from compresr import CompressionClient

client = CompressionClient(api_key="cmp_...")

# 1. latte_v2 — fixed ratio. Start here.
result = client.compress(
    context=long_document,
    query="What is JWST's primary mirror diameter?",
    compression_model_name="latte_v2",
    target_compression_ratio=0.5,
)

# 2. latte_v2 — dynamic ratio (Kneedle elbow, bounded by min/max).
#    Use when inputs vary a lot in difficulty.
result_dyn = client.compress(
    context=long_document,
    query="What is JWST's primary mirror diameter?",
    compression_model_name="latte_v2",
    dynamic=True,
    dynamic_min_ratio=1.5,
    dynamic_max_ratio=10.0,
)

# 3. latte_v1 — shared structural knobs are also accepted on latte_v2.
result_v1 = client.compress(
    context=long_document,
    query="What is JWST's primary mirror diameter?",
    compression_model_name="latte_v1",
    target_compression_ratio=0.5,
    coarse=False,
    heuristic_chunking=True,
    disable_placeholders=True,
)

print(result.data.compressed_context)
```

**TypeScript**

```typescript
import { CompressionClient } from '@compresr/sdk';

const client = new CompressionClient({ apiKey: 'cmp_...' });

// 1. latte_v2 — fixed ratio. Start here.
const result = await client.compress({
  context: longDocument,
  query: "What is JWST's primary mirror diameter?",
  compressionModelName: 'latte_v2',
  targetCompressionRatio: 0.5,
});

// 2. latte_v2 — dynamic ratio (Kneedle elbow, bounded by min/max).
//    Use when inputs vary a lot in difficulty.
const resultDyn = await client.compress({
  context: longDocument,
  query: "What is JWST's primary mirror diameter?",
  compressionModelName: 'latte_v2',
  dynamic: true,
  dynamicMinRatio: 1.5,
  dynamicMaxRatio: 10.0,
});

// 3. latte_v1 — shared structural knobs are also accepted on latte_v2.
const resultV1 = await client.compress({
  context: longDocument,
  query: "What is JWST's primary mirror diameter?",
  compressionModelName: 'latte_v1',
  targetCompressionRatio: 0.5,
  coarse: false,
  heuristicChunking: true,
  disablePlaceholders: true,
});

console.log(result.data.compressed_context);
```

**cURL**

```bash
# 1. latte_v2 — fixed ratio. Start here.
curl -X POST https://api.compresr.ai/api/compress/question-specific/ \
  -H "X-API-Key: cmp_..." \
  -H "Content-Type: application/json" \
  -d '{
    "context": "...",
    "query": "What is JWST'"'"'s primary mirror diameter?",
    "compression_model_name": "latte_v2",
    "target_compression_ratio": 0.5
  }'

# 2. latte_v2 — dynamic ratio (Kneedle elbow, bounded by min/max).
curl -X POST https://api.compresr.ai/api/compress/question-specific/ \
  -H "X-API-Key: cmp_..." \
  -H "Content-Type: application/json" \
  -d '{
    "context": "...",
    "query": "What is JWST'"'"'s primary mirror diameter?",
    "compression_model_name": "latte_v2",
    "dynamic": true,
    "dynamic_min_ratio": 1.5,
    "dynamic_max_ratio": 10.0
  }'

# 3. latte_v1 — shared structural knobs are also accepted on latte_v2.
curl -X POST https://api.compresr.ai/api/compress/question-specific/ \
  -H "X-API-Key: cmp_..." \
  -H "Content-Type: application/json" \
  -d '{
    "context": "...",
    "query": "What is JWST'"'"'s primary mirror diameter?",
    "compression_model_name": "latte_v1",
    "target_compression_ratio": 0.5,
    "coarse": false,
    "heuristic_chunking": true,
    "disable_placeholders": true
  }'
```

## When to use these models

Both models shine on **query-shaped tasks** — workloads where you can name the intent the compressed output has to serve:

- **RAG**: trim retrieved chunks to the spans that actually answer the user's question before sending them to the LLM.
- **Agent retrieval**: shrink tool descriptions, observations, and intermediate steps against the current agent goal.
- **Search-result trimming**: collapse a list of long results to the parts relevant to the search query.

For deeper patterns and end-to-end examples, see the [query-specific compression guide](/docs/guides/query-specific).