Public, dated, methodology-complete

LLM context compression benchmarks

Compresr compresses long context against your query, keeping the answer-bearing tokens and dropping the rest. On public long-document QA, light compression (~2x) matches or beats full-context accuracy, and past ~2x it becomes a deliberate cost-and-latency trade, where accuracy slips below baseline.

Two separate claims, never welded together: high compression ratios (~10x) save cost and latency; the accuracy win is a light-compression result. We publish the high-ratio degradation on purpose. All numbers below are from the public model latte_v1 (query-specific).

FinanceBench: SEC-filing QA

n = 128 / single-shot long-document QA / gpt-5.2 judge / 2026-04

Compresr FinanceBench results: accuracy, compression ratio, and evidence retrieval across compression levels.
ConfigAccuracyCompression ratioEvidence retrieval
Full context (baseline)73%1.0x91%
Light compressionBest77%1.9x91%
Medium compressionBelow baselineCost / latency regime — not an accuracy win70%4.6xn/a
High compressionBelow baselineCost / latency regime — not an accuracy win65%8.9xn/a

Light compression (~1.9x) is the best configuration: 77% accuracy versus a 73% full-context baseline, with evidence retrieval held at 91%. Beyond ~2x, accuracy drops below baseline, and that regime is for squeezing cost and latency, not for maximizing answer quality.

QMSum: meeting-transcript QA

n = 272 / single-shot long-document QA / gpt-5.4-mini answerer / gpt-5.4 judge / 2026-04

Compresr QMSum results: accuracy and compression ratio across compression levels.
ConfigAccuracyCompression ratio
Full context (baseline)55.9%1.0x
Query-specific (light)Best59.6%1.87x
High compressionBelow baselineCost / latency regime — not an accuracy win42.6%8.76x

Query-specific compression at ~1.87x reaches 59.6% versus a 55.9% baseline. At ~8.76x, accuracy falls to 42.6%, published as the honest cost-regime number.

vs. other methods at matched ~2x

QMSum accuracy at a comparable ~2x ratio. At matched ratio, Compresr tops scaledown, LongLLMLingua, LLMLingua-2, and Token Company.

QMSum accuracy by compression method at a matched ~2x ratio.
MethodQMSum accuracyRatio
Compresr (latte_v1)Us59.6%~1.87x
scaledown57.4%~2x
LongLLMLingua53.7%~3x
LLMLingua-250.7%~2x
Token Company (ttc)48.2%~2x

Methodology

FinanceBench
Question answering over SEC filings. 128 samples, judged by gpt-5.2, run 2026-04. Measures whether the compressed context still supports a correct answer and retains the evidence span.
QMSum
Query-focused QA over meeting transcripts. 272 samples, answered by gpt-5.4-mini and judged by gpt-5.4, run 2026-04.
Single-shot long-document QA
The whole filing or transcript is compressed first, then sent to the model, versus sending it in full. These are not RAG comparisons; there is no retriever in the loop.
Statistical rigor
Single-run deltas under 2pp sit within noise (5-run std ~0.9–1.5pp), so we frame light-compression results as “matched or beat” rather than a hard win. Publishing the high-ratio degradation is deliberate: it signals that these numbers are measured, not spun.

Read deeper, try it

The public model is latte_v1 (query-specific), evaluated with the methodology and setup described above.