🤖 AI Summary
This work addresses the substantial memory overhead of KV caching in long-context large language model inference, a challenge exacerbated by existing compression methods that employ fixed segmentation strategies and often disrupt semantic boundaries, leading to significant accuracy degradation. To overcome this limitation, the authors propose a dynamic semantic segmentation approach that leverages an importance-aware mechanism to adaptively select delimiters, thereby preserving semantic integrity. Variable-length semantic blocks are then uniformly mapped into a fixed-length format, enabling efficient compression without compromising semantic alignment. Experimental results demonstrate that, compared to FlashAttention, the proposed method achieves a 2.2× speedup in inference, reduces peak memory consumption by 2.6×, and improves accuracy by up to 49.9% in long-context scenarios, effectively balancing computational efficiency and model fidelity.
📝 Abstract
Although Key-Value (KV) Cache is essential for efficient large language models (LLMs) inference, its growing memory footprint in long-context scenarios poses a significant bottleneck, making KVCache compression crucial. Current compression methods rely on rigid splitting strategies, such as fixed intervals or pre-defined delimiters. We observe that rigid splitting suffers from significant accuracy degradation (ranging from 5.5% to 55.1%) across different scenarios, owing to the scenario-dependent nature of the semantic boundaries. This highlights the necessity of dynamic semantic splitting to match semantics. To achieve this, we face two challenges. (1) Improper delimiter selection misaligns semantics with the KVCache, resulting in 28.6% accuracy loss. (2) Variable-length blocks after splitting introduce over 73.1% additional inference overhead. To address the above challenges, we propose DynSplit-KV, a KVCache compression method that dynamically identifies delimiters for splitting. We propose: (1) a dynamic importance-aware delimiter selection strategy, improving accuracy by 49.9%. (2) A uniform mapping strategy that transforms variable-length semantic blocks into a fixed-length format, reducing inference overhead by 4.9x. Experiments show that DynSplit-KV achieves the highest accuracy, 2.2x speedup compared with FlashAttention and 2.6x peak memory reduction in long-context scenarios.