π€ AI Summary
Traditional exact-match caching fails for large language models when encountering structurally similar yet semantically variant prompts, while semantic caching suffers from erroneous matches. To address this, we propose a generative caching methodβthe first to explicitly identify and reuse response patterns. Our approach integrates prompt structure analysis, discrepancy-aware response pattern extraction, and generative response synthesis, enabling fine-grained modeling of minor prompt variations and on-demand generation of tailored responses. Evaluated in agent workflows, our method achieves approximately 20% higher cache hit rate and 34% lower end-to-end latency compared to state-of-the-art baselines, while reducing false hits to near zero. This demonstrates a principled trade-off between high hit efficiency and strong robustness against semantic drift.
π Abstract
Large Language Models (LLMs) are increasingly being used to plan, reason, and execute tasks across diverse scenarios. In use cases like repeatable workflows and agentic settings, prompts are often reused with minor variations while having a similar structure for recurring tasks. This opens up opportunities for caching. However, exact prompt matching fails on such structurally similar prompts, while semantic caching may produce incorrect responses by ignoring critical differences. To address this, we introduce ourmethod{}, a generative cache that produces variation-aware responses for structurally similar prompts. ourmethod{} identifies reusable response patterns across similar prompt structures and synthesizes customized outputs for new requests. We show that ourmethod{} achieves 83% cache hit rate, while having minimal incorrect hits on datasets without prompt repetition. In agentic workflows, it improves cache hit rate by $sim$20% and reduces end-to-end execution latency by $sim$34% compared to standard prompt matching.