Tokencake: A KV-Cache-centric Serving Framework for LLM-based Multi-Agent Applications

📅 2025-10-21

📈 Citations: 0

✨ Influential: 0

career value

239K/year

🤖 AI Summary

To address two critical bottlenecks in LLM-driven multi-agent systems—KV cache spatial contention (where high-priority agents’ caches are frequently evicted) and low temporal utilization (caused by prolonged tool-call latencies leading to cache idleness)—this paper proposes a KV-cache-centric service framework. Our method introduces: (1) an agent-behavior-aware dynamic memory partitioning mechanism that enforces cache priority for mission-critical agents; and (2) a proactive cache offloading and prefetching strategy guided by predicted function-call waiting times, enabling spatiotemporal coordination of cache resource allocation. Evaluated on representative multi-agent benchmarks, our framework reduces end-to-end latency by 47.06% and improves effective GPU memory utilization by 16.9% over vLLM.

Technology Category

Application Category

📝 Abstract

Large Language Models (LLMs) are increasingly deployed in complex multi-agent applications that use external function calls. This workload creates severe performance challenges for the KV Cache: space contention leads to the eviction of critical agents' caches and time underutilization leaves the cache of agents stalled on long-running tool calls idling in GPU memory. We present Tokencake, a KV-Cache-centric serving framework that co-optimizes scheduling and memory management with an agent-aware design. Tokencake's Space Scheduler uses dynamic memory partitioning to shield critical agents from contention, while its Time Scheduler employs a proactive offload and predictive upload mechanism to repurpose GPU memory during function call stalls. Our evaluation on representative multi-agent benchmarks shows that Tokencake can reduce end-to-end latency by over 47.06%, improve effective GPU memory utilization by up to 16.9% compared to vLLM.

Problem

Research questions and friction points this paper is trying to address.

Addresses KV cache space contention in multi-agent LLM applications

Solves GPU memory underutilization during agent tool call stalls

Optimizes scheduling and memory management for LLM-based agents

Innovation

Methods, ideas, or system contributions that make the work stand out.

Dynamic memory partitioning shields critical agents from contention

Proactive offload and upload repurposes GPU memory during stalls

Co-optimizes scheduling and memory management for multi-agent applications

🔎 Similar Papers

No similar papers found.