Generative Caching for Structurally Similar Prompts and Responses

📅 2025-11-14

📈 Citations: 0

✨ Influential: 0

career value

174K/year

🤖 AI Summary

Traditional exact-match caching fails for large language models when encountering structurally similar yet semantically variant prompts, while semantic caching suffers from erroneous matches. To address this, we propose a generative caching method—the first to explicitly identify and reuse response patterns. Our approach integrates prompt structure analysis, discrepancy-aware response pattern extraction, and generative response synthesis, enabling fine-grained modeling of minor prompt variations and on-demand generation of tailored responses. Evaluated in agent workflows, our method achieves approximately 20% higher cache hit rate and 34% lower end-to-end latency compared to state-of-the-art baselines, while reducing false hits to near zero. This demonstrates a principled trade-off between high hit efficiency and strong robustness against semantic drift.

Technology Category

Application Category

📝 Abstract

Large Language Models (LLMs) are increasingly being used to plan, reason, and execute tasks across diverse scenarios. In use cases like repeatable workflows and agentic settings, prompts are often reused with minor variations while having a similar structure for recurring tasks. This opens up opportunities for caching. However, exact prompt matching fails on such structurally similar prompts, while semantic caching may produce incorrect responses by ignoring critical differences. To address this, we introduce ourmethod{}, a generative cache that produces variation-aware responses for structurally similar prompts. ourmethod{} identifies reusable response patterns across similar prompt structures and synthesizes customized outputs for new requests. We show that ourmethod{} achieves 83% cache hit rate, while having minimal incorrect hits on datasets without prompt repetition. In agentic workflows, it improves cache hit rate by $sim$20% and reduces end-to-end execution latency by $sim$34% compared to standard prompt matching.

Problem

Research questions and friction points this paper is trying to address.

Addresses caching limitations for structurally similar prompts in LLMs

Solves incorrect responses from semantic caching ignoring critical differences

Improves cache performance for repeatable workflows and agentic settings

Innovation

Methods, ideas, or system contributions that make the work stand out.

Generative cache for structurally similar prompts

Identifies reusable response patterns across prompts

Synthesizes customized outputs for new requests

🔎 Similar Papers

No similar papers found.