🤖 AI Summary
To address the low dequantization efficiency and significant accuracy degradation of low-bit (2–3 bit) power-of-two (PoT) quantization for large language models (LLMs) on GPUs, this paper proposes a two-stage post-training PoT quantization method. It decouples the sign bit, optimizes scale initialization and calibration strategies, and employs fixed-point addition for efficient dequantization—thereby mitigating GPU bit-operation overhead and sign-bit entanglement. Using only a small calibration dataset, the method restores high accuracy at 2–3 bits, achieving 3.67× and 1.63× dequantization speedup on NVIDIA V100 and RTX 4090, respectively, while outperforming existing integer-only quantization schemes in accuracy. The core contribution is the first practical, efficient low-bit PoT quantization deployment for LLMs on GPUs, uniquely balancing both inference speed and model fidelity.
📝 Abstract
Large Language Models (LLMs) have demonstrated remarkable performance across various natural language processing (NLP) tasks. However, their deployment is challenging due to the substantial computational resources required. Power-of-two (PoT) quantization is a general tool to counteract this difficulty. Albeit previous works on PoT quantization can be efficiently dequantized on CPUs using fixed-point addition, it showed less effectiveness on GPUs. The reason is entanglement of the sign bit and sequential bit manipulations needed for dequantization. We propose a novel POT quantization framework for LLM weights that (i) outperforms state-of-the-art accuracy in extremely low-precision number formats, and (ii) enables faster inference through more efficient dequantization. To maintain the accuracy of the quantized model, we introduce a two-step post-training algorithm: (i) initialize the quantization scales with a robust starting point, and (ii) refine these scales using a minimal calibration set. The performance of our PoT post-training algorithm surpasses the current state-of-the-art in integer quantization, particularly at low precisions such as 2- and 3-bit formats. Our PoT quantization accelerates the dequantization step required for the floating point inference and leads to $3.67 imes$ speed up on a NVIDIA V100, and $1.63 imes$ on a NVIDIA RTX 4090, compared to uniform integer dequantization.