Guides

Query-specific compression

How the latte models use your query to decide what to keep - and how to write queries that get you the right output.

Query-specific compression takes your long context plus a short query describing what the downstream LLM is being asked to do, and returns a shorter version of the context with the spans that answer that query kept and the rest dropped. Both latte_v1 and latte_v2 support this. latte_v1 requires query. latte_v2 treats it as optional and falls back to adaptive dynamic selection when no query is provided.

SDK default is latte_v1

The SDK's implicit compression_model_name is latte_v1. Every sample on this page passes compression_model_name="latte_v2" explicitly to showcase the newer backbone. Drop that argument and you're back on latte_v1 - which means query becomes required.

This guide covers what "query-specific" actually means, what makes a good query string, a working RAG-shaped example, and when query-specific compression is the wrong tool.

What query-specific means

A query-agnostic compressor has to guess what the downstream task is and keep "generally interesting" sentences. A query-specific compressor doesn't have to guess - you tell it. The latte models score spans of the context against the query and keep the ones with the highest signal. Spans that don't contribute to answering the query are dropped or shortened.

The query is not a prompt the model answers. It's a steering signal. The output is still your original context - just shorter, and biased toward what the query asked about.

For most LLM workloads this beats query-agnostic compression by a wide margin: you almost always know what the next step is going to do with the context (answer a user question, extract a field, summarize against a goal), and that knowledge is exactly what the compressor is missing without it.

What makes a good query

Treat the query like a single-line task description for the next step. Concrete beats vague. Specific beats abstract. Both are fine; what matters is that it tells the model what to preserve.

Good: "What was the project's Q3 churn rate?"
Good: "Find the renewal date in the customer contract."
Good: "Summarize the customer's stated reasons for canceling."
Weak: "churn" - single keyword, no intent
Weak: "important info" - no signal at all
Rejected: "" - both SDK schemas enforce min_length=1 and raise ValidationError before the request goes out. To opt out of query-aware behavior on latte_v2, omit the field entirely (Python: don't pass query=; TS: don't set query:). On latte_v1, omitting query fails server-side with a 422 that the SDK maps to ValidationError.

The model does not require a question mark. The last user message in a chat, the current sub-goal in an agent loop, or the task summary in a job spec all work as query values.

A working example

The input below is a multi-paragraph description of JWST. The query asks for one specific fact (the primary mirror's diameter). The compressor keeps the mirror clause and drops the orbit, sunshield, and operator details.

Response nullability

On error, result.data is None in Python / null in TypeScript. Check it before dereferencing compressed_context. Samples below omit the guard for readability.

python

A typical result on this paragraph is substantially smaller than the input. The mirror-diameter clause stays and the rest is cut. Exact numbers vary by run; what matters is that the kept span is the one that answers the query.

Tuning compression strength

target_compression_ratio controls how aggressively the model cuts. Values in 0 < r <= 1 are removal-strength (fraction of tokens to drop). Values > 1 are Nx-factor mode - e.g. 60 targets a 60× smaller output, and the server caps the factor at 200. Canonical bounds live in the Models reference. Two ends of that range, on the same JWST paragraph, look like this.

Gentle (0.5, remove about half):

python

At 0.5, supporting context (that JWST is a telescope, that it does infrared astronomy) tends to survive alongside the mirror clause. Good for answers that need a little surrounding context to feel grounded.

Aggressive (4, target roughly a quarter of the original):

python

At 4, the output collapses to the mirror clause itself. The LLM still has enough to answer, but nothing else. Good for extractive Q&A with tight token budgets.

Compared to raw truncation, query-specific compression keeps the coherent spans that answer the query instead of slicing off whatever happened to be at the end. That matters when the relevant sentence isn't conveniently in the first half of the document.

Building a RAG pipeline?

The RAG guide shows the full retrieval, compression, and LLM shape with LangChain, LlamaIndex, and direct vector-DB examples.