Krul: Efficient State Restoration for Multi-turn Conversations with Dynamic Cross-layer KV Sharing

📅 2025-07-09
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
In multi-turn dialogue, large language models (LLMs) suffer from high first-token latency and excessive KV cache storage overhead, primarily due to static reloading of the full historical KV cache. Existing cross-layer compression methods employ fixed layer pairs, ignoring dynamic inter-dialogue variations in attention patterns—leading to accuracy degradation. This paper proposes a **dynamic cross-layer KV sharing mechanism**: (1) a token-wise heterogeneous similarity estimator for dialogue-adaptive inter-layer attention similarity modeling; (2) a predictive policy selector and bubble-free recovery scheduler that jointly optimize cache reconstruction timing. By integrating dynamic compression, precomputation–loading pipelining, and cross-layer KV reuse, our approach achieves 1.5×–2.68× speedup in first-token latency and 1.33×–2.35× reduction in KV cache storage, while preserving generation quality.

Technology Category

Application Category

📝 Abstract
Efficient state restoration in multi-turn conversations with large language models (LLMs) remains a critical challenge, primarily due to the overhead of recomputing or loading full key-value (KV) caches for all historical tokens. To address this, existing approaches compress KV caches across adjacent layers with highly similar attention patterns. However, these methods often apply a fixed compression scheme across all conversations, selecting the same layer pairs for compression without considering conversation-specific attention dynamics. This static strategy overlooks variability in attention pattern similarity across different conversations, which can lead to noticeable accuracy degradation. We present Krul, a multi-turn LLM inference system that enables accurate and efficient KV cache restoration. Krul dynamically selects compression strategies based on attention similarity across layer pairs and uses a recomputation-loading pipeline to restore the KV cache. It introduces three key innovations: 1) a preemptive compression strategy selector to preserve critical context for future conversation turns and selects a customized strategy for the conversation; 2) a token-wise heterogeneous attention similarity estimator to mitigate the attention similarity computation and storage overhead during model generation; 3) a bubble-free restoration scheduler to reduce potential bubbles brought by the imbalance of recomputing and loading stream due to compressed KV caches. Empirical evaluations on real-world tasks demonstrate that Krul achieves a 1.5x-2.68x reduction in time-to-first-token (TTFT) and a 1.33x-2.35x reduction in KV cache storage compared to state-of-the-art methods without compromising generation quality.
Problem

Research questions and friction points this paper is trying to address.

Efficient KV cache restoration in multi-turn LLM conversations
Dynamic compression strategy for conversation-specific attention patterns
Reducing TTFT and KV cache storage without quality loss
Innovation

Methods, ideas, or system contributions that make the work stand out.

Dynamic compression strategy based on attention similarity
Token-wise heterogeneous attention similarity estimator
Bubble-free restoration scheduler for KV cache
🔎 Similar Papers
No similar papers found.
J
Junyi Wen
Sun Yat-sen University, Zhuhai, China
J
Junyuan Liang
Sun Yat-sen University, Guangzhou, China
Zicong Hong
Zicong Hong
Department of Computer Science and Engineering, Hong Kong University of Science and Technology
BlockchainML SystemEdge/Cloud Computing
W
Wuhui Chen
Sun Yat-sen University, Zhuhai, China; Peng Cheng Laboratory, Shenzhen, China
Zibin Zheng
Zibin Zheng
IEEE Fellow, Highly Cited Researcher, Sun Yat-sen University, China
BlockchainSmart ContractServices ComputingSoftware Reliability