Continuum: Efficient and Robust Multi-Turn LLM Agent Scheduling with KV Cache Time-to-Live

๐Ÿ“… 2025-11-04
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
In multi-turn LLM agents, tool invocations disrupt KV cache continuity, causing redundant prefilling and scheduling bubbles. Method: This paper proposes ToolSched, a tool-aware, program-level scheduling system. Its core innovations include (i) a tool-execution-timeโ€“predicted KV cache TTL mechanism for fine-grained cache lifetime management, and (ii) program-level FIFO scheduling coupled with dynamic GPU memory residency control to ensure uninterrupted multi-turn execution. ToolSched further supports DRAM offloading and heterogeneous hardware deployment. Results: Evaluated on real-world SWE-Bench and BFCL benchmarks using Llama-3.1 (8B/70B), ToolSched significantly reduces average task completion time while sustaining high throughput and low latency across diverse memory configurations.

Technology Category

Application Category

๐Ÿ“ Abstract
Agentic LLM applications interleave LLM generation requests with tool calls. These tool calls break the continuity of the workflow by creating pauses between LLM requests, bringing many challenges for the serving system, especially under multi-turn scenarios. Each pause potentially causes KV cache eviction and extra waiting time before entering the continuous batch for the following LLM request. Since these pauses happen for each call, this problem becomes increasingly severe as turn number grow for agentic programs. Previous works either fail to incorporate information from the tool call, evicting KV cache that leads to repetitive prefill or loading, or ignore the continuity of a multi-turn program, creating waiting time between turns that increases per-request latency. We present Continuum, a serving system to optimize job completion time for multi-turn agent workloads by combining tool-aware KV cache timeout with program-level scheduling. By predicting tool call durations in agentic workflows, Continuum selectively pins the KV cache in GPU memory with a time-to-live value based on total turn number. When combined with program-level first-come-first-serve, Continuum prevents scheduling bubbles, preserves multi-turn continuity, and optimizes for throughput for complex agentic workflows. By modeling the variability of tool call and agent program continuity, Continuum outperforms state-of-the-art baselines. Our evaluation on real-world agentic workloads (SWE-Bench and BFCL) with Llama-3.1 8B/70B models shows that Continuum significantly improves the average job completion times, and remains performant across different hardware setups and DRAM offloading schemes. Preview code is available at: https://github.com/Hanchenli/vllm-continuum
Problem

Research questions and friction points this paper is trying to address.

Optimizes KV cache management for multi-turn LLM agents
Reduces scheduling delays between tool calls and LLM requests
Improves job completion time for complex agentic workflows
Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses tool-aware KV cache timeout for scheduling
Implements program-level first-come-first-serve scheduling
Predicts tool call durations to optimize cache retention
H
Hanchen Li
UC Berkeley
Qiuyang Mang
Qiuyang Mang
University of California, Berkeley
Software EngineeringDatabase
R
Runyuan He
UC Berkeley
Q
Qizheng Zhang
Stanford University
H
Huanzhi Mao
UC Berkeley
X
Xiaokun Chen
TensorMesh, Inc.
A
Alvin Cheung
UC Berkeley
J
Joseph Gonzalez
UC Berkeley
Ion Stoica
Ion Stoica
Professor of Computer Science, UC Berkeley
Cloud ComputingNetworkingDistributed SystemsBig Data