🤖 AI Summary
This work addresses the critical challenge of severe accuracy degradation in low-bit quantization of large language models (LLMs) deployed on edge devices, primarily caused by activation outliers. We propose a progressive binary search quantization framework and theoretically prove that Hadamard transforms yield superior outlier suppression compared to random rotations. To our knowledge, this is the first method enabling full-model 3-bit quantization—including weights, activations, and KV cache—for LLMs with non-power-of-two embedding dimensions (e.g., Qwen). By extending the Paley construction and introducing a coordinated quantization strategy, we achieve end-to-end 3-bit quantization across mainstream models including Mistral, LLaMA, and Qwen. On standard benchmarks, our approach improves accuracy by 40% over state-of-the-art methods, significantly advancing the practical deployment of ultra-low-bit LLMs on resource-constrained edge platforms.
📝 Abstract
Large language models (LLMs) have become pivotal in artificial intelligence, demonstrating strong capabilities in reasoning, understanding, and generating data. However, their deployment on edge devices is hindered by their substantial size, often reaching several billion parameters. Quantization is a widely used method to reduce memory usage and inference time, however LLMs present unique challenges due to the prevalence of outliers in their activations. In this work, we leverage the theoretical advantages of Hadamard matrices over random rotation matrices to push the boundaries of quantization in LLMs. We demonstrate that Hadamard matrices are more effective in reducing outliers, which are a significant obstacle in achieving low-bit quantization. Our method based on a gradual binary search enables 3-bit quantization for weights, activations, and key-value (KV) caches, resulting in a 40% increase in accuracy on common benchmarks compared to SoTA methods. We extend the use of rotation matrices to support non-power-of-2 embedding dimensions, similar to the Qwen architecture, by employing the Paley algorithm. We theoretically demonstrates the superiority of Hadamard matrices in reducing outliers.We achieved 3-bit quantization for weights, activations, and KV cache, significantly enhancing model performance. Our experimental results on multiple models family like Mistral, LLaMA, and Qwen demonstrate the effectiveness of our approach, outperforming existing methods and enabling practical 3-bit quantization.