DilateQuant: Accurate and Efficient Diffusion Quantization via Weight Dilation

📅 2024-09-22
🏛️ arXiv.org
📈 Citations: 5
Influential: 2
📄 PDF
🤖 AI Summary
Diffusion models face significant challenges in low-bit quantization: post-training quantization (PTQ) fails, while quantization-aware training (QAT) suffers from low efficiency—primarily due to time-varying, large-magnitude activations that induce high quantization error and distort weight distributions. To address this, we propose Weight Dilation (WD), a mathematically equivalent scaling of unsaturated channel weights that losslessly compresses activation ranges and implicitly incorporates activation quantization error into weight quantization. We further design a Temporal Parallel Quantizer (TPQ) with timestep-adaptive parameters to enhance temporal modeling stability, and introduce Block-level Knowledge Distillation (BKD) for joint, cooperative optimization of weights and activations at ultra-low bit-widths. Our method achieves INT4 quantization across multiple diffusion models, yielding only +0.3 FID degradation, 2.1× inference speedup, and 58% and 42% reductions in memory and training memory usage, respectively—setting a new state-of-the-art in accuracy-efficiency trade-off.

Technology Category

Application Category

📝 Abstract
Diffusion models have shown excellent performance on various image generation tasks, but the substantial computational costs and huge memory footprint hinder their low-latency applications in real-world scenarios. Quantization is a promising way to compress and accelerate models. Nevertheless, due to the wide range and time-varying activations in diffusion models, existing methods cannot maintain both accuracy and efficiency simultaneously for low-bit quantization. To tackle this issue, we propose DilateQuant, a novel quantization framework for diffusion models that offers comparable accuracy and high efficiency. Specifically, we keenly aware of numerous unsaturated in-channel weights, which can be cleverly exploited to reduce the range of activations without additional computation cost. Based on this insight, we propose Weight Dilation (WD) that maximally dilates the unsaturated in-channel weights to a constrained range through a mathematically equivalent scaling. WD costlessly absorbs the activation quantization errors into weight quantization. The range of activations decreases, which makes activations quantization easy. The range of weights remains constant, which makes model easy to converge in training stage. Considering the temporal network leads to time-varying activations, we design a Temporal Parallel Quantizer (TPQ), which sets time-step quantization parameters and supports parallel quantization for different time steps, significantly improving the performance and reducing time cost. To further enhance performance while preserving efficiency, we introduce a Block-wise Knowledge Distillation (BKD) to align the quantized models with the full-precision models at a block level. The simultaneous training of time-step quantization parameters and weights minimizes the time required, and the shorter backpropagation paths decreases the memory footprint of the quantization process.
Problem

Research questions and friction points this paper is trying to address.

Reduces quantization error in diffusion models
Preserves original weight distribution during QAT
Addresses time-varying activations efficiently
Innovation

Methods, ideas, or system contributions that make the work stand out.

Weight Dilation preserves original weight range
Temporal Parallel Quantizer handles time-varying activations
Block-wise Knowledge Distillation reduces training resource
🔎 Similar Papers
No similar papers found.
Xuewen Liu
Xuewen Liu
Institute of Automation, Chinese Academy of Sciences
Model compression
Z
Zhikai Li
Institute of Automation, Chinese Academy of Sciences; School of Artificial Intelligence, University of Chinese Academy of Sciences
Qingyi Gu
Qingyi Gu
Institute of Automation, Chinese Academy of Sciences
High-speed visioncell analysis