PoTPTQ: A Two-step Power-of-Two Post-training for LLMs

📅 2025-07-16

📈 Citations: 0

✨ Influential: 0

career value

194K/year

🤖 AI Summary

To address the low dequantization efficiency and significant accuracy degradation of low-bit (2–3 bit) power-of-two (PoT) quantization for large language models (LLMs) on GPUs, this paper proposes a two-stage post-training PoT quantization method. It decouples the sign bit, optimizes scale initialization and calibration strategies, and employs fixed-point addition for efficient dequantization—thereby mitigating GPU bit-operation overhead and sign-bit entanglement. Using only a small calibration dataset, the method restores high accuracy at 2–3 bits, achieving 3.67× and 1.63× dequantization speedup on NVIDIA V100 and RTX 4090, respectively, while outperforming existing integer-only quantization schemes in accuracy. The core contribution is the first practical, efficient low-bit PoT quantization deployment for LLMs on GPUs, uniquely balancing both inference speed and model fidelity.

Technology Category

Application Category

📝 Abstract

Large Language Models (LLMs) have demonstrated remarkable performance across various natural language processing (NLP) tasks. However, their deployment is challenging due to the substantial computational resources required. Power-of-two (PoT) quantization is a general tool to counteract this difficulty. Albeit previous works on PoT quantization can be efficiently dequantized on CPUs using fixed-point addition, it showed less effectiveness on GPUs. The reason is entanglement of the sign bit and sequential bit manipulations needed for dequantization. We propose a novel POT quantization framework for LLM weights that (i) outperforms state-of-the-art accuracy in extremely low-precision number formats, and (ii) enables faster inference through more efficient dequantization. To maintain the accuracy of the quantized model, we introduce a two-step post-training algorithm: (i) initialize the quantization scales with a robust starting point, and (ii) refine these scales using a minimal calibration set. The performance of our PoT post-training algorithm surpasses the current state-of-the-art in integer quantization, particularly at low precisions such as 2- and 3-bit formats. Our PoT quantization accelerates the dequantization step required for the floating point inference and leads to $3.67 imes$ speed up on a NVIDIA V100, and $1.63 imes$ on a NVIDIA RTX 4090, compared to uniform integer dequantization.

Problem

Research questions and friction points this paper is trying to address.

Enhance LLM deployment via efficient PoT quantization

Improve GPU dequantization speed for PoT quantization

Maintain accuracy in low-precision 2-3 bit formats

Innovation

Methods, ideas, or system contributions that make the work stand out.

Two-step post-training algorithm for accuracy

Power-of-two quantization for faster inference

Efficient dequantization on GPUs

🔎 Similar Papers

No similar papers found.