MPQ-DMv2: Flexible Residual Mixed Precision Quantization for Low-Bit Diffusion Models with Temporal Distillation

πŸ“… 2025-07-06
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
Diffusion models suffer from high computational overhead, hindering deployment on edge devices; meanwhile, existing ultra-low-bit (2–4 bit) quantization methods are vulnerable to outliers, suffer from suboptimal initialization, and fail to preserve temporal consistency in sequential generation. To address these challenges, we propose a flexible mixed-precision quantization framework comprising: (1) flexible Z-order residual quantization to mitigate outlier sensitivity; (2) object-aware low-rank initialization for enhanced training stability; and (3) memory-augmented temporal relation distillation to improve long-sequence generation consistency. Leveraging binary residual branches, LoRA-driven module-wise analysis, and an online pixel queue mechanism, our method significantly outperforms state-of-the-art approaches across diverse diffusion architectures and generation tasks. It maintains high-fidelity synthesis and efficient inference at 2–4 bits, achieving a synergistic optimization of compression ratio and stability.

Technology Category

Application Category

πŸ“ Abstract
Diffusion models have demonstrated remarkable performance on vision generation tasks. However, the high computational complexity hinders its wide application on edge devices. Quantization has emerged as a promising technique for inference acceleration and memory reduction. However, existing quantization methods do not generalize well under extremely low-bit (2-4 bit) quantization. Directly applying these methods will cause severe performance degradation. We identify that the existing quantization framework suffers from the outlier-unfriendly quantizer design, suboptimal initialization, and optimization strategy. We present MPQ-DMv2, an improved extbf{M}ixed extbf{P}recision extbf{Q}uantization framework for extremely low-bit extbf{D}iffusion extbf{M}odels. For the quantization perspective, the imbalanced distribution caused by salient outliers is quantization-unfriendly for uniform quantizer. We propose extit{Flexible Z-Order Residual Mixed Quantization} that utilizes an efficient binary residual branch for flexible quant steps to handle salient error. For the optimization framework, we theoretically analyzed the convergence and optimality of the LoRA module and propose extit{Object-Oriented Low-Rank Initialization} to use prior quantization error for informative initialization. We then propose extit{Memory-based Temporal Relation Distillation} to construct an online time-aware pixel queue for long-term denoising temporal information distillation, which ensures the overall temporal consistency between quantized and full-precision model. Comprehensive experiments on various generation tasks show that our MPQ-DMv2 surpasses current SOTA methods by a great margin on different architectures, especially under extremely low-bit widths.
Problem

Research questions and friction points this paper is trying to address.

Handles severe performance degradation in low-bit diffusion models
Addresses outlier-unfriendly quantizer design and suboptimal initialization
Ensures temporal consistency in quantized diffusion models
Innovation

Methods, ideas, or system contributions that make the work stand out.

Flexible Z-Order Residual Mixed Quantization
Object-Oriented Low-Rank Initialization
Memory-based Temporal Relation Distillation
πŸ”Ž Similar Papers
No similar papers found.
Weilun Feng
Weilun Feng
Institute of Computing Technology, Chinese Academy of Sciences
Model CompressionMachine Learning
Chuanguang Yang
Chuanguang Yang
Institute of Computing Technology, Chinese Academy of Sciences
Computer VisionKnowledge DistillationRepresentation Learning
Haotong Qin
Haotong Qin
ETH ZΓΌrich
TinyMLModel CompressionComputer VisionDeep Learning
Y
Yuqi Li
Institute of Computing Technology, Chinese Academy of Sciences, Beijing 100190, China
X
Xiangqi Li
Institute of Computing Technology, Chinese Academy of Sciences, Beijing 100190, China and University of Chinese Academy of Sciences, Beijing 100049, China
Zhulin An
Zhulin An
Institute Of Computing Technology Chinese Academy Of Sciences
Automatic Deep LearningLifelong Learning
Libo Huang
Libo Huang
Institute of Computing Technology, Chinese Academy of Sciences
Continual LearningNeural Data Analysis
B
Boyu Diao
Institute of Computing Technology, Chinese Academy of Sciences, Beijing 100190, China
F
Fuzhen Zhuang
Institute of Artificial Intelligence, Beihang University, Beijing, China and Zhongguancun Laboratory, Beijing, China
Michele Magno
Michele Magno
ETH Zurich
Wireless sensor networksSmart Sensors and Internet of ThingsWake up RadioPower managementEnergy harvesters
Y
Yongjun Xu
Institute of Computing Technology, Chinese Academy of Sciences, Beijing 100190, China
Yingli Tian
Yingli Tian
Distinguished Professor, EE of The City College and CS of the Graduate Center, CUNY
Computer VisionMachine LearningMedical Imaging Analysis
Tingwen Huang
Tingwen Huang
Shenzhen University of Advanced Technology, China
Dynamics of Nonlinear SystemsComputational IntelligenceSmart GridControl and Optimization