Don't Break the Cache: An Evaluation of Prompt Caching for Long-Horizon Agentic Tasks

📅 2026-01-09
🏛️ arXiv.org
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study addresses the underexplored challenge of efficiently leveraging prompt caching mechanisms offered by large language model (LLM) providers to reduce API costs and first-token latency in long-horizon, multi-turn agent tasks. We systematically evaluate the caching strategies of OpenAI, Anthropic, and Google on the DeepResearchBench benchmark, providing the first quantitative analysis of prompt caching benefits in complex agent workflows. To mitigate performance degradation caused by full-context caching, we propose dynamic content layout and cache block control methods. Through experiments on over 500 sessions—each featuring a 10,000-token system prompt—we compare three caching strategies and demonstrate that judicious caching can reduce API costs by 45–80% and improve first-token response speed by 13–31%. Our findings also reveal significant differences in caching behavior across platforms, offering practical guidance for real-world deployment.

Technology Category

Application Category

📝 Abstract
Recent advancements in Large Language Model (LLM) agents have enabled complex multi-turn agentic tasks requiring extensive tool calling, where conversations can span dozens of API calls with increasingly large context windows. However, although major LLM providers offer prompt caching to reduce cost and latency, its benefits for agentic workloads remain underexplored in the research literature. To our knowledge, no prior work quantifies these cost savings or compares caching strategies for multi-turn agentic tasks. We present a comprehensive evaluation of prompt caching across three major LLM providers (OpenAI, Anthropic, and Google) and compare three caching strategies, including full context caching, system prompt only caching, and caching that excludes dynamic tool results. We evaluate on DeepResearch Bench, a multi-turn agentic benchmark where agents autonomously execute real-world web search tool calls to answer complex research questions, measuring both API cost and time to first token (TTFT) across over 500 agent sessions with 10,000-token system prompts. Our results demonstrate that prompt caching reduces API costs by 41-80% and improves time to first token by 13-31% across providers. We find that strategic prompt cache block control, such as placing dynamic content at the end of the system prompt, avoiding dynamic traditional function calling, and excluding dynamic tool results, provides more consistent benefits than naive full-context caching, which can paradoxically increase latency. An ablation study across prompt sizes (500-50,000 tokens) and tool call counts (3-50) demonstrates universal linear cost and TTFT benefits, after the provider caching token minimum, and reveal provider-specific strategy discrepancies across variants. We provide nuanced discussion and guidance for implementing prompt caching in production agentic systems.
Problem

Research questions and friction points this paper is trying to address.

prompt caching
agentic tasks
LLM agents
cost reduction
latency
Innovation

Methods, ideas, or system contributions that make the work stand out.

prompt caching
agentic tasks
cost optimization
latency reduction
LLM evaluation
🔎 Similar Papers
No similar papers found.
E
Elias Lumer
PricewaterhouseCoopers, U.S.
F
Faheem Nizar
PricewaterhouseCoopers, U.S.
A
Akshaya Jangiti
PricewaterhouseCoopers, U.S.
K
Kevin Frank
PricewaterhouseCoopers, U.S.
Anmol Gulati
Anmol Gulati
Researcher, Google Deepmind
M
M. Phadate
PricewaterhouseCoopers, U.S.
V
V. K. Subbiah
PricewaterhouseCoopers, U.S.