SVDQuant: Absorbing Outliers by Low-Rank Components for 4-Bit Diffusion Models

📅 2024-11-07

🏛️ arXiv.org

📈 Citations: 10

✨ Influential: 0

career value

199K/year

🤖 AI Summary

Diffusion model deployment faces a fundamental trade-off between high memory consumption and low-latency inference, exacerbated by severe accuracy degradation under 4-bit weight quantization. This work presents the first end-to-end 4-bit quantization framework jointly optimizing both weights and activations. We introduce a novel “outlier channel migration + low-rank absorption” paradigm: singular value decomposition (SVD) identifies outlier channels, which are then migrated to a dedicated low-rank branch—integrated via LoRA—to preserve sensitivity, replacing conventional smoothness-based calibration. Furthermore, we design Nunchaku, a kernel-level fused inference engine enabling plug-and-play LoRA execution. Evaluated on FLUX.1 (12B), our method achieves 3.5× GPU memory reduction and 3.0× speedup on an RTX 4090, while maintaining image fidelity comparable to FP16 baselines. We open-source both the quantization library and the Nunchaku inference engine.

Technology Category

Application Category

📝 Abstract

Diffusion models have been proven highly effective at generating high-quality images. However, as these models grow larger, they require significantly more memory and suffer from higher latency, posing substantial challenges for deployment. In this work, we aim to accelerate diffusion models by quantizing their weights and activations to 4 bits. At such an aggressive level, both weights and activations are highly sensitive, where conventional post-training quantization methods for large language models like smoothing become insufficient. To overcome this limitation, we propose SVDQuant, a new 4-bit quantization paradigm. Different from smoothing which redistributes outliers between weights and activations, our approach absorbs these outliers using a low-rank branch. We first consolidate the outliers by shifting them from activations to weights, then employ a high-precision low-rank branch to take in the weight outliers with Singular Value Decomposition (SVD). This process eases the quantization on both sides. However, na""{i}vely running the low-rank branch independently incurs significant overhead due to extra data movement of activations, negating the quantization speedup. To address this, we co-design an inference engine Nunchaku that fuses the kernels of the low-rank branch into those of the low-bit branch to cut off redundant memory access. It can also seamlessly support off-the-shelf low-rank adapters (LoRAs) without the need for re-quantization. Extensive experiments on SDXL, PixArt-$Sigma$, and FLUX.1 validate the effectiveness of SVDQuant in preserving image quality. We reduce the memory usage for the 12B FLUX.1 models by 3.5$ imes$, achieving 3.0$ imes$ speedup over the 4-bit weight-only quantized baseline on the 16GB laptop 4090 GPU, paving the way for more interactive applications on PCs. Our quantization library and inference engine are open-sourced.

Problem

Research questions and friction points this paper is trying to address.

Addresses high memory demands in diffusion models.

Proposes 4-bit quantization to reduce latency and memory usage.

Introduces SVDQuant to handle outliers in quantization effectively.

Innovation

Methods, ideas, or system contributions that make the work stand out.

SVDQuant absorbs outliers using low-rank branch.

Nunchaku engine fuses kernels to reduce memory access.

Supports LoRAs without re-quantization, enhancing efficiency.

🔎 Similar Papers

Diffusion Map Autoencoder