Cache candidates
Repeated system prompts, stable RAG context, agent tool preambles, and long instruction blocks are natural candidates.
KV cache layer
A KV cache layer should start with token reuse assumptions, cache-key strategy, freshness windows, storage tiers, and miss-pattern monitoring.
Repeated system prompts, stable RAG context, agent tool preambles, and long instruction blocks are natural candidates.
Use GPU memory for hot reuse, CPU RAM for short-term reuse, SSD for lower-cost warm cache, and remote storage for durable reuse.
Define invalidation rules, tenant boundaries, and observability before routing live requests.