Glossary

Context compression, defined

A working vocabulary for shortening what large language models read. Rigorous, vendor-neutral definitions of the terms behind cheaper, faster, more accurate LLM context.

Context compression
Context compression is the practice of shortening the text fed into a large language model (prompts, retrieved documents, chat history, or tool output) while preserving the information the model needs to produce the same answer.
Prompt compression
Prompt compression is a form of context compression that reduces the number of tokens in the prompt sent to a language model while keeping the instructions and content the model needs to respond correctly.
Compression ratio
Compression ratio is the factor by which a context is shortened. For example, a 4x ratio means the compressed context has roughly one quarter of the original tokens.
Query-specific compression
Query-specific compression is context compression that conditions on the question being asked, keeping the spans relevant to that query and dropping the rest.
Context rot
Context rot is the degradation in a language model’s answer quality as its context window fills with long, noisy, or irrelevant content, causing it to lose track of the information that actually matters.
Token (LLM)
A token is the unit of text a language model reads and generates, typically a word, sub-word, or character fragment produced by the model’s tokenizer.