Context compression

Context compression is the practice of shortening the text fed into a large language model (prompts, retrieved documents, chat history, or tool output) while preserving the information the model needs to produce the same answer.

Context compression operates on the input side of a language model. Rather than changing the model or its weights, it rewrites or prunes the context window so that fewer tokens carry the same task-relevant signal. The goal is to keep the answer-bearing content and drop redundant, boilerplate, or off-topic tokens before the model ever reads them.

It is distinct from output-side techniques. Context compression does not summarize the model’s response; it reduces what goes in. Because most LLM cost and latency scale with input length, compressing context cuts spend and speeds up responses on long-document, retrieval, and agentic workloads.

Compression can be lossless-feeling at low ratios and lossy at high ones. Light compression (roughly 2x) often preserves, and can even improve, answer quality by removing distracting tokens, while aggressive compression trades quality for cost and latency. The two effects should always be reasoned about separately.

Compresr is a context-compression API: you send long context plus your query, and get back a shorter context that keeps the answer-bearing tokens. It composes with prompt caching, long-context models, vector databases, and rerankers rather than replacing them.

Related terms