KVLinC : KV Cache Quantization with Hadamard Rotation and Linear Correction

📅 2025-10-06
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address attention error propagation and degraded generation quality caused by 2-bit KV cache quantization in large language model (LLM) inference, this paper proposes KVLinC. The method introduces three key components: (i) Hadamard rotation preprocessing of the value (V) tensor to reduce quantization sensitivity; (ii) a lightweight low-rank linear correction adapter that explicitly compensates for quantization errors in the key (K) tensor; and (iii) a customized attention kernel enabling efficient decompression and computation. Evaluated on LLaMA, Qwen2.5, and Qwen3 models, KVLinC preserves near-full-precision generation quality under 2-bit KV quantization while achieving up to 2.55× inference speedup over FlashAttention. It significantly outperforms existing quantization baselines in both accuracy and efficiency, offering a practical solution for memory-constrained, high-throughput LLM deployment.

Technology Category

Application Category

📝 Abstract
Quantizing the key-value (KV) cache is a promising strategy for improving the inference efficiency of large language models (LLMs). However, aggressive quantization to very low precision (e.g., 2 bits) introduces significant errors in the stored key and value tensors, which propagate through the dot-product attention mechanism and ultimately degrade generation quality. To address this, we propose KVLinC, a framework to mitigate attention errors introduced by KV cache quantization in the extreme low-precision regime. KVLinC combines a Hadamard rotation, which reduces quantization error in values, with lightweight linear correction adapters that explicitly compensate for errors introduced by quantized keys. Across extensive evaluations on the LLaMA, Qwen2.5, and Qwen3 model families, KVLinC consistently matches or surpasses strong baselines while achieving higher KV-cache compression. Furthermore, we implement a custom attention kernel that results in upto 2.55x faster inference compared to Flash Attention baseline, enabling efficient long-context LLM inference.
Problem

Research questions and friction points this paper is trying to address.

Reduces KV cache quantization errors in low-precision LLM inference
Compensates attention errors using Hadamard rotation and linear adapters
Enables faster long-context inference while maintaining generation quality
Innovation

Methods, ideas, or system contributions that make the work stand out.

Hadamard rotation reduces quantization error in values
Lightweight linear adapters correct errors in quantized keys
Custom attention kernel accelerates inference by 2.55x
🔎 Similar Papers
No similar papers found.
Utkarsh Saxena
Utkarsh Saxena
Student, Purdue University
Machine LearningIn-Memory ComputingNon Volatile Memories
K
Kaushik Roy
Department of Electrical and Computer Engineering, Purdue University