🤖 AI Summary
Large language models (LLMs) suffer from excessive KV cache memory overhead during inference, while uniform quantization incurs severe accuracy degradation. Method: We propose an information-aware adaptive mixed-precision quantization framework. We first empirically reveal that key matrices exhibit higher spectral norm and greater quantization sensitivity than value matrices; leveraging this insight, we introduce a “more bits for keys, fewer for values” paradigm (e.g., 4-bit keys / 2-bit values). We model quantization error propagation across layers using singular value distribution, spectral norm, and Frobenius norm to jointly optimize layer-wise stability. Contribution/Results: Evaluated on LLMs ranging from 1B to 70B parameters, our method achieves up to 75.2% task accuracy—surpassing the reverse configuration by +20.5 percentage points—and delivers substantial memory savings, effectively overcoming the limitations of conventional uniform quantization.
📝 Abstract
This paper introduces an information-aware quantization framework that adaptively compresses the key-value (KV) cache in large language models (LLMs). Although prior work has underscored the distinct roles of key and value cache during inference, our systematic analysis -- examining singular value distributions, spectral norms, and Frobenius norms -- reveals, for the first time, that key matrices consistently exhibit higher norm values and are more sensitive to quantization than value matrices. Furthermore, our theoretical analysis shows that matrices with higher spectral norms amplify quantization errors more significantly. Motivated by these insights, we propose a mixed-precision quantization strategy, KV-AdaQuant, which allocates more bit-width for keys and fewer for values since key matrices have higher norm values. With the same total KV bit budget, this approach effectively mitigates error propagation across transformer layers while achieving significant memory savings. Our extensive experiments on multiple LLMs (1B--70B) demonstrate that our mixed-precision quantization scheme maintains high model accuracy even under aggressive compression. For instance, using 4-bit for Key and 2-bit for Value achieves an accuracy of 75.2%, whereas reversing the assignment (2-bit for Key and 4-bit for Value) yields only 54.7% accuracy. The code is available at https://tinyurl.com/kv-adaquant