BPDQ: Bit-Plane Decomposition Quantization on a Variable Grid for Large Language Models

πŸ“… 2026-02-04
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
This work addresses the challenge of deploying large language models on resource-constrained devices, where high memory consumption and bandwidth requirements pose significant barriers, and existing post-training quantization methods suffer severe performance degradation at 2–3 bits. The authors propose Bit-Plane Decomposition Quantization (BPDQ), a novel approach that constructs adaptive quantization grids using bit planes and scalar coefficients, iteratively optimized with approximate second-order information to align the quantization process with output error minimization in a Hessian-induced geometric space. This strategy effectively expands the feasible solution space and enables efficient error compensation. Experiments demonstrate that BPDQ successfully deploys Qwen2.5-72B on a single RTX 3090 GPU at 2 bits, achieving 83.85% accuracy on GSM8Kβ€”compared to 90.83% with 16-bit precision.

Technology Category

Application Category

πŸ“ Abstract
Large language model (LLM) inference is often bounded by memory footprint and memory bandwidth in resource-constrained deployments, making quantization a fundamental technique for efficient serving. While post-training quantization (PTQ) maintains high fidelity at 4-bit, it deteriorates at 2-3 bits. Fundamentally, existing methods enforce a shape-invariant quantization grid (e.g., the fixed uniform intervals of UINT2) for each group, severely restricting the feasible set for error minimization. To address this, we propose Bit-Plane Decomposition Quantization (BPDQ), which constructs a variable quantization grid via bit-planes and scalar coefficients, and iteratively refines them using approximate second-order information while progressively compensating quantization errors to minimize output discrepancy. In the 2-bit regime, BPDQ enables serving Qwen2.5-72B on a single RTX 3090 with 83.85% GSM8K accuracy (vs. 90.83% at 16-bit). Moreover, we provide theoretical analysis showing that the variable grid expands the feasible set, and that the quantization process consistently aligns with the optimization objective in Hessian-induced geometry. Code: github.com/KingdalfGoodman/BPDQ.
Problem

Research questions and friction points this paper is trying to address.

large language models
post-training quantization
low-bit quantization
memory-constrained inference
quantization grid
Innovation

Methods, ideas, or system contributions that make the work stand out.

Bit-Plane Decomposition
Variable Quantization Grid
Post-Training Quantization
Second-Order Optimization
Low-Bit LLM Inference
πŸ”Ž Similar Papers
No similar papers found.
J
Junyu Chen
Southwestern University of Finance and Economics
J
Jungang Li
The Hong Kong University of Science and Technology (Guangzhou)
Jing Xiong
Jing Xiong
The University of Hong Kong
Natural Language ProcessingAutomated Theorem Proving
W
Wenjie Wang
Southwestern University of Finance and Economics
Q
Qingyao Yang
The University of Hong Kong
H
He Xiao
The University of Hong Kong
Z
Zhen Li
The Hong Kong Polytechnic University
Taiqiang Wu
Taiqiang Wu
University of Hong Kong | Tsinghua University
Model CompressionEfficient Methods
M
Mengzhao Chen
The University of Hong Kong
Z
Zhen Peng
Sun Yat-sen University
Chaofan Tao
Chaofan Tao
The University of Hong Kong
Efficient MLNatural Language ProcessingMultimodal
L
Long Shi
Southwestern University of Finance and Economics
Hongxia Yang
Hongxia Yang
Professor, HK Polytechnic University
Machine LearningGenerative AICognitive IntelligenceStatistical Modeling
N
Ngai Wong
The University of Hong Kong