🤖 AI Summary
Diffusion models face significant challenges in low-bit quantization deployment—including imbalanced activation distributions, temporal information distortion, and sensitivity to perturbations—making it difficult for existing methods to preserve generation quality under 4-bit weight/activation (W4A4) quantization. To address this, we propose a selective fine-tuning paradigm: leveraging module-level sensitivity analysis to identify critical temporal layers and sensitive components, and integrating adaptive activation distribution calibration with temporal information preservation. This enables stable training with minimal parameter updates. Our approach achieves the first end-to-end W4A4 quantization of Stable Diffusion capable of generating photorealistic images. It attains state-of-the-art performance across three high-resolution generation benchmarks, significantly improving quantization stability and visual fidelity. This work establishes an efficient and practical pathway for deploying diffusion models on resource-constrained edge devices.
📝 Abstract
The practical deployment of diffusion models still suffers from the high memory and time overhead. While quantization paves a way for compression and acceleration, existing methods unfortunately fail when the models are quantized to low-bits. In this paper, we empirically unravel three properties in quantized diffusion models that compromise the efficacy of current methods: imbalanced activation distributions, imprecise temporal information, and vulnerability to perturbations of specific modules. To alleviate the intensified low-bit quantization difficulty stemming from the distribution imbalance, we propose finetuning the quantized model to better adapt to the activation distribution. Building on this idea, we identify two critical types of quantized layers: those holding vital temporal information and those sensitive to reduced bit-width, and finetune them to mitigate performance degradation with efficiency. We empirically verify that our approach modifies the activation distribution and provides meaningful temporal information, facilitating easier and more accurate quantization. Our method is evaluated over three high-resolution image generation tasks and achieves state-of-the-art performance under various bit-width settings, as well as being the first method to generate readable images on full 4-bit (i.e. W4A4) Stable Diffusion. Code is available href{https://github.com/hatchetProject/QuEST}{here}.