🤖 AI Summary
4-bit quantization of large language models (LLMs) often incurs substantial accuracy degradation, necessitating complex fine-tuning or auxiliary overheads. Method: This paper proposes a tuning-free, full-stack 4-bit quantization framework that uniformly quantizes weights, activations, and KV caches. Its core innovation is Significant Data Razor (SDR)—a technique retaining only the four most significant bits per value, coupled with a compression-free integer arithmetic unit enabling end-to-end native 4-bit computation. The framework integrates absolute maximum scaling with 8/16-bit pre-quantization for joint optimization. Contribution/Results: Our method matches or exceeds the accuracy of state-of-the-art 4-bit approaches while reducing hardware area and power consumption by 61.2% and 57.8%, respectively. It significantly improves energy efficiency and practical deployability without sacrificing model fidelity.
📝 Abstract
Large-scale language models (LLMs) have demonstrated outstanding performance in language processing tasks, yet their deployment is often hindered by high memory demands and computational complexity. Although low-bit quantization techniques, such as 4-bit quantization, present a potential solution, they frequently lead to significant accuracy degradation or require substantial effort for such aggressive quantization approaches. To overcome these challenges, we introduce QRazor, a reliable and effortless quantization scheme designed to enable 4-bit quantization for weights, activations, and KV cache in transformer-based LLMs. The scheme involves two main stages: quantization and compression. During the quantization stage, weights, activations, and KV cache values are quantized with wider 8 or 16-bit integers as a basis to achieve nearly identical accuracy to the original full-precision LLM models, using the absolute max scaling. Subsequently, all data are compressed to 4-bit using our proposed significant data razoring (SDR) technique, which retains only the four most salient bits while discarding the others. Furthermore, we present an integer-based arithmetic unit dedicated to QRazor, enabling direct low-precision arithmetic operations without decompressing the SDR data. Despite the reduced quantization effort, QRazor achieves LLM accuracies better or comparable to state-of-the-art 4-bit methods. By also validating the hardware efficiency, our decompression-free arithmetic unit achieves 61.2% and 57.8% reduction in area and power consumption, respectively.