Public, dated, methodology-complete

LLM context compression benchmarks

Name: Compresr context compression: FinanceBench results
Creator: Compresr
Published: 2026-04

Compresr compresses long context against your query, keeping the answer-bearing tokens and dropping the rest. On public long-document QA, light compression (~2x) matches or beats full-context accuracy, and past ~2x it becomes a deliberate cost-and-latency trade, where accuracy slips below baseline.

Two separate claims, never welded together: high compression ratios (~10x) save cost and latency; the accuracy win is a light-compression result. We publish the high-ratio degradation on purpose. All numbers below are from the public model latte_v1 (query-specific).

FinanceBench: SEC-filing QA

n = 128 / single-shot long-document QA / gpt-5.2 judge / 2026-04

Compresr FinanceBench results: accuracy, compression ratio, and evidence retrieval across compression levels.
Config	Accuracy	Compression ratio	Evidence retrieval
Full context (baseline)	73%	1.0x	91%
Light compressionBest	77%	1.9x	91%
Medium compressionBelow baselineCost / latency regime — not an accuracy win	70%	4.6x	n/a
High compressionBelow baselineCost / latency regime — not an accuracy win	65%	8.9x	n/a

Light compression (~1.9x) is the best configuration: 77% accuracy versus a 73% full-context baseline, with evidence retrieval held at 91%. Beyond ~2x, accuracy drops below baseline, and that regime is for squeezing cost and latency, not for maximizing answer quality.

QMSum: meeting-transcript QA

n = 272 / single-shot long-document QA / gpt-5.4-mini answerer / gpt-5.4 judge / 2026-04

Compresr QMSum results: accuracy and compression ratio across compression levels.
Config	Accuracy	Compression ratio
Full context (baseline)	55.9%	1.0x
Query-specific (light)Best	59.6%	1.87x
High compressionBelow baselineCost / latency regime — not an accuracy win	42.6%	8.76x

Query-specific compression at ~1.87x reaches 59.6% versus a 55.9% baseline. At ~8.76x, accuracy falls to 42.6%, published as the honest cost-regime number.

vs. other methods at matched ~2x

QMSum accuracy at a comparable ~2x ratio. At matched ratio, Compresr tops scaledown, LongLLMLingua, LLMLingua-2, and Token Company.

QMSum accuracy by compression method at a matched ~2x ratio.
Method	QMSum accuracy	Ratio
Compresr (latte_v1)Us	59.6%	~1.87x
scaledown	57.4%	~2x
LongLLMLingua	53.7%	~3x
LLMLingua-2	50.7%	~2x
Token Company (ttc)	48.2%	~2x

Methodology

FinanceBench: Question answering over SEC filings. 128 samples, judged by gpt-5.2, run 2026-04. Measures whether the compressed context still supports a correct answer and retains the evidence span.
QMSum: Query-focused QA over meeting transcripts. 272 samples, answered by gpt-5.4-mini and judged by gpt-5.4, run 2026-04.
Single-shot long-document QA: The whole filing or transcript is compressed first, then sent to the model, versus sending it in full. These are not RAG comparisons; there is no retriever in the loop.
Statistical rigor: Single-run deltas under 2pp sit within noise (5-run std ~0.9–1.5pp), so we frame light-compression results as “matched or beat” rather than a hard win. Publishing the high-ratio degradation is deliberate: it signals that these numbers are measured, not spun.

Read deeper, try it

The public model is latte_v1 (query-specific), evaluated with the methodology and setup described above.

Quick start Start free, $10 credits