🤖 AI Summary
This work addresses the high memory overhead of key-value (KV) caching that limits large language models in long-context reasoning. To this end, the authors propose EchoKV, a lightweight and reversible KV cache compression and reconstruction method that leverages inter-head and intra-layer attention similarity, enabling on-demand switching between standard and compressed modes. The core innovations include a similarity-driven residual reconstruction network, a partial KV subset sampling strategy, and a two-stage low-resource fine-tuning procedure, which collectively reduce training costs and overcome the limitations of conventional irreversible compression. Experimental results demonstrate that EchoKV consistently outperforms existing methods across various compression ratios on the LongBench and RULER benchmarks while maintaining high throughput in short-context scenarios.
📝 Abstract
The increasing memory demand of the Key-Value (KV) cache poses a significant bottleneck for Large Language Models (LLMs) in long-context applications. Existing low-rank compression methods often rely on irreversible parameter transformations, sacrificing the flexibility to switch back to full-precision inference when memory is abundant. In this paper, we propose EchoKV, a flexible KV cache compression scheme that enables on-demand transitions between standard and compressed inference. Unlike traditional compression-decompression paradigms, EchoKV utilizes a lightweight network to reconstruct the residual KV components from a partial subset, leveraging intrinsic inter-layer and intra-layer similarities among attention heads. We further introduce a two-stage fine-tuning strategy that allows for rapid, low-cost training (e.g., ~1 A100 GPU-hour for a 7B model). Experimental results on LongBench and RULER demonstrate that EchoKV consistently outperforms existing methods across various compression ratios while maintaining high throughput for short-context scenarios.