Mixed-Precision Graph Neural Quantization for Low Bit Large Language Models

πŸ“… 2025-01-30
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
To address severe weight distortion and drastic performance degradation in low-bit (<3-bit) post-training quantization (PTQ), this paper proposes the first graph neural network (GNN)-based mixed-precision PTQ framework. Our method models intra-layer weight dependencies to enable importance-aware, adaptive bit-width allocation, thereby overcoming the limitations of conventional uniform quantization. On WikiText2 and C4, our 3-bit quantized models achieve up to a 18.7% reduction in perplexity compared to GPTQβ€”setting a new state-of-the-art for low-bit PTQ. Key contributions include: (i) the first application of GNNs in PTQ to explicitly capture structural weight dependencies; and (ii) a differentiable, importance-driven optimization mechanism for mixed-precision quantization. This work establishes a novel paradigm for efficient large-model deployment under stringent resource constraints.

Technology Category

Application Category

πŸ“ Abstract
Post-Training Quantization (PTQ) is pivotal for deploying large language models (LLMs) within resource-limited settings by significantly reducing resource demands. However, existing PTQ strategies underperform at low bit levels<3 bits due to the significant difference between the quantized and original weights. To enhance the quantization performance at low bit widths, we introduce a Mixed-precision Graph Neural PTQ (MG-PTQ) approach, employing a graph neural network (GNN) module to capture dependencies among weights and adaptively assign quantization bit-widths. Through the information propagation of the GNN module, our method more effectively captures dependencies among target weights, leading to a more accurate assessment of weight importance and optimized allocation of quantization strategies. Extensive experiments on the WikiText2 and C4 datasets demonstrate that our MG-PTQ method outperforms previous state-of-the-art PTQ method GPTQ, setting new benchmarks for quantization performance under low-bit conditions.
Problem

Research questions and friction points this paper is trying to address.

Post-Training Quantization
Low-bit Quantization
Performance Degradation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Mixed-Precision Quantization
Graph Neural Networks
Post-Training Quantization
πŸ”Ž Similar Papers
Wanlong Liu
Wanlong Liu
University of Electronic Science and Technology of China
LLM ReasoningRAGMedical LLMInformation Extraction
Y
Yichen Xiao
School of Computer Science and Engineering, University of Electronic Science and Technology of China
D
Dingyi Zeng
School of Computer Science and Engineering, University of Electronic Science and Technology of China
H
Hongyang Zhao
School of Computer Science and Engineering, University of Electronic Science and Technology of China
Wenyu Chen
Wenyu Chen
Massachusetts Institute of Technology
optimizationstatistical learning
M
Malu Zhang
School of Computer Science and Engineering, University of Electronic Science and Technology of China