Efficient Serving for Dynamic Agent Workflows with Prediction-based KV-Cache Management

📅 2026-05-07
📈 Citations: 0
Influential: 0
📄 PDF

career value

222K/year
🤖 AI Summary
Existing KV-cache management approaches struggle to handle context-dependent invocation sequences in dynamic agent workflows, resulting in inefficient cache reuse. This work proposes PBKV, the first system to introduce predictive mechanisms into dynamic agent scenarios. PBKV fuses historical workflow patterns with the current task context to forecast multi-step future agent calls and leverages these predictions to evaluate the reuse value of cached entries, enabling context-aware intelligent cache retention. To enhance robustness, the system conservatively utilizes prediction outcomes during both cache eviction and prefetching. Experimental results demonstrate that PBKV significantly outperforms existing methods across three workflow benchmarks: it achieves up to 1.85× speedup over LRU in dynamic workflows and up to 1.26× speedup over the state-of-the-art KVFlow in static workflows.
📝 Abstract
LLM-based workflows compose specialized agents to execute complex tasks, and these agents usually share substantial context, allowing KV-Cache reuse to save computation. Existing approaches either manage KV-Cache at agent level and fail to exploit the reuse opportunities within workflows, or manage cache at the workflow level but assume that each workflow calls a static sequence of agents. However, practical workflows are typically dynamic, where the sequence of invoked agents and thus induced cache reuse opportunities depend on the context of each task. To serve such dynamic workflows efficiently, we build a system dubbed PBKV (\textbf{P}rediction-\textbf{B}ased \textbf{KV}-Cache Management). For each workflow, PBKV predicts the agent invocations in several future steps by fusing the guidance from historical workflows and context of the target workflow. Based on the predictions, PBKV estimates the reuse potential of cache entries and keeps the high-potential entries in GPU memory. To be robust to prediction errors, PBKV utilizes the predictions conservatively during both cache eviction and prefetching. Experiments on three workflow benchmarks show that PBKV achieves up to $1.85\times$ speedup over LRU on dynamic workflows, and up to $1.26\times$ speedup over the SOTA baseline KVFlow on the static workflow.
Problem

Research questions and friction points this paper is trying to address.

Dynamic Agent Workflows
KV-Cache Management
Cache Reuse
LLM-based Workflows
Prediction-based Serving
Innovation

Methods, ideas, or system contributions that make the work stand out.

KV-Cache Management
Dynamic Agent Workflows
Prediction-based Caching
LLM Serving
Cache Reuse
🔎 Similar Papers