Glossary

Context compression, defined

A working vocabulary for shortening what large language models read. Rigorous, vendor-neutral definitions of the terms behind cheaper, faster, more accurate LLM context.

Context compression: Context compression is the practice of shortening the text fed into a large language model (prompts, retrieved documents, chat history, or tool output) while preserving the information the model needs to produce the same answer.
Prompt compression: Prompt compression is a form of context compression that reduces the number of tokens in the prompt sent to a language model while keeping the instructions and content the model needs to respond correctly.
Compression ratio: Compression ratio is the factor by which a context is shortened. For example, a 4x ratio means the compressed context has roughly one quarter of the original tokens.
Query-specific compression: Query-specific compression is context compression that conditions on the question being asked, keeping the spans relevant to that query and dropping the rest.
Context rot: Context rot is the degradation in a language model’s answer quality as its context window fills with long, noisy, or irrelevant content, causing it to lose track of the information that actually matters.
Token (LLM): A token is the unit of text a language model reads and generates, typically a word, sub-word, or character fragment produced by the model’s tokenizer.