Skip to content
Compresr docs

Guides

Query-specific compression

How latte_v1 uses your query to decide what to keep - and how to write queries that get you the right output.

latte_v1 is a query-specific compressor: it takes your long context plus a short query describing what the downstream LLM is being asked to do, and it returns a shorter version of the context with the spans that answer that query kept and the rest dropped.

This guide covers what "query-specific" actually means, what makes a good query string, a working RAG-shaped example, and when query-specific compression is the wrong tool.

What query-specific means

A query-agnostic compressor has to guess what the downstream task is and keep "generally interesting" sentences. A query-specific compressor doesn't have to guess - you tell it. latte_v1 scores spans of the context against the query and keeps the ones with the highest signal. Spans that don't contribute to answering the query are dropped or shortened.

The query is not a prompt the model answers. It's a steering signal. The output is still your original context - just shorter, and biased toward what the query asked about.

For most LLM workloads this beats query-agnostic compression by a wide margin: you almost always know what the next step is going to do with the context (answer a user question, extract a field, summarize against a goal), and that knowledge is exactly what the compressor is missing without it.

What makes a good query

Treat the query like a single-line task description for the next step. Concrete beats vague. Specific beats abstract. Both are fine; what matters is that it tells the model what to preserve.

  • Good: "What was the project's Q3 churn rate?"
  • Good: "Find the renewal date in the customer contract."
  • Good: "Summarize the customer's stated reasons for canceling."
  • Weak: "churn" - single keyword, no intent
  • Weak: "important info" - no signal at all
  • Weak: "" - degrades to query-agnostic behavior

The model does not require a question mark. The last user message in a chat, the current sub-goal in an agent loop, or the task summary in a job spec all work as query values.

A working example

The input below is a multi-paragraph description of JWST. The query asks for one specific fact (the primary mirror's diameter). latte_v1 keeps the mirror clause and drops the orbit, sunshield, and operator details.

python

A typical result on this paragraph: roughly 110 input tokens shrink to around 30 output tokens. The mirror-diameter clause stays and the rest is cut. Exact numbers vary by run; what matters is that the kept span is the one that answers the query.

Tuning compression strength

target_compression_ratio controls how aggressively the model cuts. The canonical bounds - the 0 < r <= 1 removal-strength range and the r > 1 Nx range - live in the Models reference. Two ends of that range, on the same JWST paragraph, look like this.

Gentle (0.5, remove about half):

python

At 0.5, supporting context (that JWST is a telescope, that it does infrared astronomy) tends to survive alongside the mirror clause. Good for answers that need a little surrounding context to feel grounded.

Aggressive (4, target roughly a quarter of the original):

python

At 4, the output collapses to the mirror clause itself. The LLM still has enough to answer, but nothing else. Good for extractive Q&A with tight token budgets.

Compared to raw truncation, query-specific compression keeps the coherent spans that answer the query instead of slicing off whatever happened to be at the end. That matters when the relevant sentence isn't conveniently in the first half of the document.

Building a RAG pipeline?

The RAG guide shows the full retrieval, compression, and LLM shape with LangChain, LlamaIndex, and direct vector-DB examples.