LLM context compression benchmarks
Compresr compresses long context against your query, keeping the answer-bearing tokens and dropping the rest. On public long-document QA, light compression (~2x) matches or beats full-context accuracy, and past ~2x it becomes a deliberate cost-and-latency trade, where accuracy slips below baseline.
Two separate claims, never welded together: high compression ratios (~10x) save cost and latency; the accuracy win is a light-compression result. We publish the high-ratio degradation on purpose. All numbers below are from the public model latte_v1 (query-specific).
FinanceBench: SEC-filing QA
n = 128 / single-shot long-document QA / gpt-5.2 judge / 2026-04
| Config | Accuracy | Compression ratio | Evidence retrieval |
|---|---|---|---|
| Full context (baseline) | 73% | 1.0x | 91% |
| Light compressionBest | 77% | 1.9x | 91% |
| Medium compressionBelow baselineCost / latency regime — not an accuracy win | 70% | 4.6x | n/a |
| High compressionBelow baselineCost / latency regime — not an accuracy win | 65% | 8.9x | n/a |
Light compression (~1.9x) is the best configuration: 77% accuracy versus a 73% full-context baseline, with evidence retrieval held at 91%. Beyond ~2x, accuracy drops below baseline, and that regime is for squeezing cost and latency, not for maximizing answer quality.
QMSum: meeting-transcript QA
n = 272 / single-shot long-document QA / gpt-5.4-mini answerer / gpt-5.4 judge / 2026-04
| Config | Accuracy | Compression ratio |
|---|---|---|
| Full context (baseline) | 55.9% | 1.0x |
| Query-specific (light)Best | 59.6% | 1.87x |
| High compressionBelow baselineCost / latency regime — not an accuracy win | 42.6% | 8.76x |
Query-specific compression at ~1.87x reaches 59.6% versus a 55.9% baseline. At ~8.76x, accuracy falls to 42.6%, published as the honest cost-regime number.
vs. other methods at matched ~2x
QMSum accuracy at a comparable ~2x ratio. At matched ratio, Compresr tops scaledown, LongLLMLingua, LLMLingua-2, and Token Company.
| Method | QMSum accuracy | Ratio |
|---|---|---|
| Compresr (latte_v1)Us | 59.6% | ~1.87x |
| scaledown | 57.4% | ~2x |
| LongLLMLingua | 53.7% | ~3x |
| LLMLingua-2 | 50.7% | ~2x |
| Token Company (ttc) | 48.2% | ~2x |
Methodology
- FinanceBench
- Question answering over SEC filings. 128 samples, judged by gpt-5.2, run 2026-04. Measures whether the compressed context still supports a correct answer and retains the evidence span.
- QMSum
- Query-focused QA over meeting transcripts. 272 samples, answered by gpt-5.4-mini and judged by gpt-5.4, run 2026-04.
- Single-shot long-document QA
- The whole filing or transcript is compressed first, then sent to the model, versus sending it in full. These are not RAG comparisons; there is no retriever in the loop.
- Statistical rigor
- Single-run deltas under 2pp sit within noise (5-run std ~0.9–1.5pp), so we frame light-compression results as “matched or beat” rather than a hard win. Publishing the high-ratio degradation is deliberate: it signals that these numbers are measured, not spun.
Read deeper, try it
The public model is latte_v1 (query-specific), evaluated with the methodology and setup described above.