Glossary
Context compression, defined
A working vocabulary for shortening what large language models read. Rigorous, vendor-neutral definitions of the terms behind cheaper, faster, more accurate LLM context.
- Context compression
- Context compression is the practice of shortening the text fed into a large language model (prompts, retrieved documents, chat history, or tool output) while preserving the information the model needs to produce the same answer.
- Prompt compression
- Prompt compression is a form of context compression that reduces the number of tokens in the prompt sent to a language model while keeping the instructions and content the model needs to respond correctly.
- Compression ratio
- Compression ratio is the factor by which a context is shortened. For example, a 4x ratio means the compressed context has roughly one quarter of the original tokens.
- Query-specific compression
- Query-specific compression is context compression that conditions on the question being asked, keeping the spans relevant to that query and dropping the rest.
- Context rot
- Context rot is the degradation in a language model’s answer quality as its context window fills with long, noisy, or irrelevant content, causing it to lose track of the information that actually matters.
- Token (LLM)
- A token is the unit of text a language model reads and generates, typically a word, sub-word, or character fragment produced by the model’s tokenizer.