Prompt caching vs context compression: use both
They are not rivals: cache the static prefix you keep resending, compress the unique bulk that changes every request, and together they cut more cost and latency than either alone.
They solve different problems
Prompt caching is about not re-paying for tokens you already sent. Context compression is about sending fewer tokens in the first place. Most teams confuse the two and then pick one when they should use both.
Compression reduces what you send
Pass long context plus your query; get back a shorter context that keeps the answer-bearing tokens. Works on content that is unique per request, lowers cost and latency, and at light compression can improve accuracy by cutting noise.
Caching reuses identical prefixes
When the same long prefix repeats, the provider skips re-processing it. It only helps when the prefix actually repeats, cached tokens are still billed (at a discount), and it does nothing when context is unique per request.
Side-by-side comparison
The same dimensions, head to head. Compression is horizontal; caching is per-provider.
| Dimension | Context compression (Compresr) | Prompt caching |
|---|---|---|
| What it optimizes | How much you send: fewer tokens in the prompt | Re-billing identical tokens you already sent before |
| When it helps | Always, every request gets smaller | Only when a long prefix repeats across requests |
| Unique-per-request content | Compresses it: that is the point | No benefit, nothing to reuse |
| Latency | Lower, fewer tokens to process | Lower on the cached prefix only |
| Accuracy effect | At ~2x, cutting noise can match or beat full context | None, identical tokens, identical answer |
| Cost on cached tokens | Tokens are removed, so they cost nothing | Cached tokens are discounted but still billed |
| Provider lock-in | Horizontal: works in front of any model/provider | Per-provider: each implements its own caching |
Figures measured under our harness on single-shot long-document QA (FinanceBench, QMSum), where the full document is compressed before the answer model sees it, not a RAG pipeline. Dated 2026-04. Competitor numbers measured at a matched compression ratio. Single-run accuracy deltas under ~2 points are within noise.
When prompt caching alone is enough
Caching is the right and only tool you need in a few real cases, and we would not add compression there.
- A large, fixed prefix repeats verbatim. A long system prompt or fixed tool schema reused on every call: cache it and stop re-billing. There is no unique bulk to compress.
- The unique part of each request is already tiny. If only a short user message changes and the rest is identical, compression has little to remove, and caching carries the win.
- You must keep every token verbatim. When the answer depends on the full text being present unchanged, caching reduces cost without touching content.
The moment a long, unique document, transcript, or retrieved context rides along on each request, caching stops helping, and that is the part compression shrinks. The best setup is usually caching for the static prefix and compression for the unique bulk.