PoTPTQ: A Two-step Power-of-Two Post-training for LLMs

📅 2025-07-16
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the low dequantization efficiency and significant accuracy degradation of low-bit (2–3 bit) power-of-two (PoT) quantization for large language models (LLMs) on GPUs, this paper proposes a two-stage post-training PoT quantization method. It decouples the sign bit, optimizes scale initialization and calibration strategies, and employs fixed-point addition for efficient dequantization—thereby mitigating GPU bit-operation overhead and sign-bit entanglement. Using only a small calibration dataset, the method restores high accuracy at 2–3 bits, achieving 3.67× and 1.63× dequantization speedup on NVIDIA V100 and RTX 4090, respectively, while outperforming existing integer-only quantization schemes in accuracy. The core contribution is the first practical, efficient low-bit PoT quantization deployment for LLMs on GPUs, uniquely balancing both inference speed and model fidelity.

Technology Category

Application Category

📝 Abstract
Large Language Models (LLMs) have demonstrated remarkable performance across various natural language processing (NLP) tasks. However, their deployment is challenging due to the substantial computational resources required. Power-of-two (PoT) quantization is a general tool to counteract this difficulty. Albeit previous works on PoT quantization can be efficiently dequantized on CPUs using fixed-point addition, it showed less effectiveness on GPUs. The reason is entanglement of the sign bit and sequential bit manipulations needed for dequantization. We propose a novel POT quantization framework for LLM weights that (i) outperforms state-of-the-art accuracy in extremely low-precision number formats, and (ii) enables faster inference through more efficient dequantization. To maintain the accuracy of the quantized model, we introduce a two-step post-training algorithm: (i) initialize the quantization scales with a robust starting point, and (ii) refine these scales using a minimal calibration set. The performance of our PoT post-training algorithm surpasses the current state-of-the-art in integer quantization, particularly at low precisions such as 2- and 3-bit formats. Our PoT quantization accelerates the dequantization step required for the floating point inference and leads to $3.67 imes$ speed up on a NVIDIA V100, and $1.63 imes$ on a NVIDIA RTX 4090, compared to uniform integer dequantization.
Problem

Research questions and friction points this paper is trying to address.

Enhance LLM deployment via efficient PoT quantization
Improve GPU dequantization speed for PoT quantization
Maintain accuracy in low-precision 2-3 bit formats
Innovation

Methods, ideas, or system contributions that make the work stand out.

Two-step post-training algorithm for accuracy
Power-of-two quantization for faster inference
Efficient dequantization on GPUs
🔎 Similar Papers
No similar papers found.
X
Xinyu Wang
McGill University, Canada
Vahid Partovi Nia
Vahid Partovi Nia
Huawei Noah's Ark Lab and Ecole Polytechnique de Montreal
high-dimensional datastatistical learningdeep learningedge intelligence
P
Peng Lu
Université de Montréal, Canada
J
Jerry Huang
Université de Montréal, Canada; Mila – Quebec AI Institute, Canada
X
Xiao-Wen Chang
McGill University, Canada
Boxing Chen
Boxing Chen
Huawei Technologies Canada
Natual Language ProcessingArtificial Intelligence
Yufei Cui
Yufei Cui
McGill University, MILA
Medical AIRAGLLM AgentPredictive Uncertainty