Efficient Serving for Dynamic Agent Workflows with Prediction-based KV-Cache Management

📅 2026-05-07

📈 Citations: 0

✨ Influential: 0

career value

198K/year

🤖 AI Summary

Existing KV-cache management approaches struggle to handle context-dependent invocation sequences in dynamic agent workflows, resulting in inefficient cache reuse. This work proposes PBKV, the first system to introduce predictive mechanisms into dynamic agent scenarios. PBKV fuses historical workflow patterns with the current task context to forecast multi-step future agent calls and leverages these predictions to evaluate the reuse value of cached entries, enabling context-aware intelligent cache retention. To enhance robustness, the system conservatively utilizes prediction outcomes during both cache eviction and prefetching. Experimental results demonstrate that PBKV significantly outperforms existing methods across three workflow benchmarks: it achieves up to 1.85× speedup over LRU in dynamic workflows and up to 1.26× speedup over the state-of-the-art KVFlow in static workflows.

📝 Abstract

LLM-based workflows compose specialized agents to execute complex tasks, and these agents usually share substantial context, allowing KV-Cache reuse to save computation. Existing approaches either manage KV-Cache at agent level and fail to exploit the reuse opportunities within workflows, or manage cache at the workflow level but assume that each workflow calls a static sequence of agents. However, practical workflows are typically dynamic, where the sequence of invoked agents and thus induced cache reuse opportunities depend on the context of each task. To serve such dynamic workflows efficiently, we build a system dubbed PBKV (\textbf{P}rediction-\textbf{B}ased \textbf{KV}-Cache Management). For each workflow, PBKV predicts the agent invocations in several future steps by fusing the guidance from historical workflows and context of the target workflow. Based on the predictions, PBKV estimates the reuse potential of cache entries and keeps the high-potential entries in GPU memory. To be robust to prediction errors, PBKV utilizes the predictions conservatively during both cache eviction and prefetching. Experiments on three workflow benchmarks show that PBKV achieves up to $1.85\times$ speedup over LRU on dynamic workflows, and up to $1.26\times$ speedup over the SOTA baseline KVFlow on the static workflow.

Problem

Research questions and friction points this paper is trying to address.

Dynamic Agent Workflows

KV-Cache Management

Cache Reuse

LLM-based Workflows

Prediction-based Serving

Innovation

Methods, ideas, or system contributions that make the work stand out.

KV-Cache Management

Dynamic Agent Workflows

Prediction-based Caching