🤖 AI Summary
This work addresses the challenge of linearly growing key-value (KV) cache in multi-turn dialogues, which severely hampers the deployment efficiency of large language models. Existing compression methods often disregard dialogue structure, leading to the loss of critical contextual information. To overcome this limitation, the authors propose SONIC, a structure-aware KV cache compression framework that introduces, for the first time, a segmented compression mechanism coupled with learnable semantic Nexus tokens to compactly represent historical dialogue content. SONIC employs dynamic budget training, enabling flexible adaptation to varying memory constraints without requiring model retraining. Experimental results demonstrate that SONIC consistently outperforms H2O and StreamingLLM across four multi-turn dialogue benchmarks under 80% and 50% compression rates, achieving a 35.55% average score improvement on MT-Bench101 and a 50.1% acceleration in inference speed.
📝 Abstract
The linear growth of Key-Value (KV) cache remains a bottleneck for multi-turn LLM deployment. Existing KV cache compression methods often fail to account for the structural properties of multi-turn dialogues, relying on heuristic eviction that risks losing critical context. We propose \textbf{SONIC}, a learning-based framework that compresses historical segments into compact and semantically rich \textbf{Nexus} tokens. By integrating dynamic budget training, SONIC allows flexible adaptation to varying memory constraints without retraining. Experiments show that at compression ratios of 80\% and 50\%, SONIC consistently outperforms baselines such as H2O and StreamingLLM on four diverse multi-turn benchmarks. Specifically, on the widely used MTBench101 benchmark, SONIC achieves an average score improvement of 35.55\% over state-of-the-art baselines, validating its effectiveness in sustaining coherent multi-turn dialogues. Furthermore, SONIC enhances deployment efficiency, accelerating the overall inference process by 50.1\% compared to full-context generation.