Compression ratio

Compression ratio is the factor by which a context is shortened. For example, a 4x ratio means the compressed context has roughly one quarter of the original tokens.

Compression ratio measures input reduction. It is commonly expressed as a multiple (2x, 4x, 9x) or as a percentage of tokens removed (50%, 75%, ~90%). A higher ratio means fewer tokens sent to the model, which directly lowers cost and latency.

Ratio and answer quality are different axes and must not be conflated. A high ratio (such as ~10x, i.e. removing about 90% of tokens) is a cost-and-latency claim; it does not by itself imply the answers stay as good. In published Compresr benchmarks the best answer quality appears at light compression (around 2x), while very aggressive ratios can fall below the full-context baseline.

Choosing a ratio is a trade-off. Light compression is the safe default when accuracy is paramount; heavier compression suits cost-sensitive or latency-critical workloads where some quality headroom exists. The right setting depends on the task, the model, and how answer-bearing the source content is.

Compresr lets you target a compression ratio per request, so you can dial between maximum savings and maximum fidelity for a given workload.

Related terms