🤖 AI Summary
To address the memory bottleneck imposed by Key-Value caching (KVC) in large language model (LLM) inference, this paper proposes the first zero-overhead dimensionality compression method. Our approach integrates low-rank KV cache compression, cross-head-and-layer adaptive rate allocation, and load-balanced optimized attention kernels—enabling efficient compression and acceleration without introducing additional latency. The method is modular: it can be deployed standalone or synergistically combined with eviction- or quantization-based techniques. Experiments demonstrate up to 68% reduction in KV cache volume, 44% decrease in time-to-first-token (TTFT), 55% reduction in time-between-tokens (TBT), and 1.72× throughput improvement—all while preserving 99% of original model accuracy. The core innovations lie in (i) a zero-overhead compression mechanism that avoids runtime overhead, and (ii) a dynamic load-adaptive attention computation architecture that optimizes hardware utilization across heads and layers.
📝 Abstract
In large-language models, memory constraints in the Key-Value Cache (KVC) pose a challenge during inference. In this work, we propose ZACK, the first KV dimensionality compression system that achieves zero-overhead compression and decompression and also reduces attention computation time. It complements and can be combined with eviction-based and quantization-based methods to further enhance KV compression. Moreover, ZACK employs adaptive compression, tailoring KV compression rates across heads and layers based on their contributions to inference to maximize overall compression while maintaining an accuracy loss constraint. Additionally, ZACK enhances the self-attention kernel to balance the uneven workloads caused by the adaptive compression approach to further reduce attention computation latency. Comprehensive experiments demonstrate that when combined with ZACK, state-of-the-art eviction-based and quantization-based methods for KV compression further reduce KV size by up to 68%, Time-To-First-Token (TTFT) by up to 44%, and Time-Between-Tokens (TBT) by up to 55% and achieve up to 1.72X throughput under the same latency, while maintaining 99% of the baseline accuracy. We open-sourced the code.