KVCOMM: Online Cross-context KV-cache Communication for Efficient LLM-based Multi-agent Systems

📅 2025-10-14

📈 Citations: 0

✨ Influential: 0

career value

227K/year

🤖 AI Summary

In multi-agent large language models, inconsistent context prefixes across agents prevent cross-agent KV cache reuse, leading to redundant computation of overlapping content and high inference overhead. This paper proposes a training-free, efficient inference framework that, for the first time, explicitly models and resolves the KV cache misalignment problem in multi-agent settings. Our approach dynamically identifies shared subsequences via an online anchor pool, enabling prefix-agnostic, cross-agent KV cache alignment and reuse. It requires no architectural modification or model fine-tuning, and supports real-time cache updates and dynamic context expansion. Experiments on multi-agent dialogue and collaborative reasoning tasks demonstrate over 70% cache reuse rate; in five-agent scenarios, end-to-end latency is reduced by 7.8×, and time-to-first-token (TTFT) drops from 430 ms to 55 ms—without any accuracy degradation.

Technology Category

Application Category

📝 Abstract

Multi-agent large language model (LLM) systems are increasingly adopted for complex language processing tasks that require communication and coordination among agents. However, these systems often suffer substantial overhead from repeated reprocessing of overlapping contexts across agents. In typical pipelines, once an agent receives a message from its predecessor, the full context-including prior turns-must be reprocessed from scratch, leading to inefficient processing. While key-value (KV) caching is an effective solution for avoiding redundant computation in single-agent settings where prefixes remain unchanged, it cannot be directly reused in multi-agent scenarios due to diverging prefixes introduced by agent-specific context extensions. We identify that the core challenge lies in the offset variance of KV-caches across agents. To address this, we propose KVCOMM, a training-free framework that enables efficient prefilling in multi-agent inference by reusing KV-caches and aligning cache offsets of overlapping contexts under diverse prefix contexts. KVCOMM estimates and adjusts KV-caches for shared content by referencing a pool of cached examples-termed anchors-that store observed cache deviations under varying prefixes. The anchor pool is maintained and updated online, allowing dynamic adaptation to distinct user requests and context structures. KVCOMM achieves over 70% reuse rate across diverse multi-agent workloads, including retrieval-augmented generation, math reasoning, and collaborative coding tasks, all without quality degradation. Particularly, when each fully-connected agent receives 1K input tokens with 512 prefix tokens and 512 output tokens under a five-agent setting, KVCOMM achieves up to 7.8x speedup compared to the standard prefill pipeline, reducing TTFT from ~430 ms to ~55 ms.

Problem

Research questions and friction points this paper is trying to address.

Addresses redundant computation in multi-agent LLM systems

Solves KV-cache offset variance across diverging agent contexts

Enables efficient KV-cache reuse for overlapping content processing

Innovation

Methods, ideas, or system contributions that make the work stand out.

KVCOMM enables KV-cache reuse across multi-agent systems

It aligns cache offsets using anchor-based deviation adjustments

Framework achieves dynamic adaptation through online anchor pool updates

🔎 Similar Papers

PQCache: Product Quantization-based KVCache for Long Context LLM Inference