Post-Training Quantization via Residual Truncation and Zero Suppression for Diffusion Models

📅 2025-09-30

📈 Citations: 0

✨ Influential: 0

career value

182K/year

🤖 AI Summary

Diffusion models suffer from high computational overhead, hindering deployment; meanwhile, 4-bit post-training quantization (PTQ) degrades generation fidelity due to rounding errors on small-magnitude activations, resulting in texture loss. This work first identifies the critical role of low-magnitude activations in preserving generative fidelity. We propose an efficient, floating-point-free 4-bit PTQ method: outliers are quantized separately using 8-bit min-max scaling, while leading-zero compression enables compact 4-bit storage; residual truncation and zero suppression jointly preserve least-significant-bit (LSB) information under ultra-low bitwidth. Evaluated on FLUX.1-schnell, our method achieves a FID of 6.98—surpassing SVDQuant (which relies on FP16 auxiliary branches)—demonstrating superior generalizability across diverse activation distributions and setting a new state-of-the-art for float-free 4-bit diffusion model quantization.

Technology Category

Application Category

📝 Abstract

Diffusion models achieve high-quality image generation but face deployment challenges due to their high computational requirements. Although 8-bit outlier-aware post-training quantization (PTQ) matches full-precision performance, extending PTQ to 4 bits remains challenging. Larger step sizes in 4-bit quantization amplify rounding errors in dense, low-magnitude activations, leading to the loss of fine-grained textures. We hypothesize that not only outliers but also small activations are critical for texture fidelity. To this end, we propose Quantization via Residual Truncation and Zero Suppression (QuaRTZ), a 4-bit PTQ scheme for diffusion models. QuaRTZ applies 8-bit min-max quantization for outlier handling and compresses to 4 bits via leading-zero suppression to retain LSBs, thereby preserving texture details. Our approach reduces rounding errors and improves quantization efficiency by balancing outlier preservation and LSB precision. Both theoretical derivations and empirical evaluations demonstrate the generalizability of QuaRTZ across diverse activation distributions. Notably, 4-bit QuaRTZ achieves an FID of 6.98 on FLUX.1-schnell, outperforming SVDQuant that requires auxiliary FP16 branches.

Problem

Research questions and friction points this paper is trying to address.

Achieving 4-bit quantization for diffusion models without performance loss

Reducing rounding errors in low-magnitude activations during quantization

Preserving fine-grained texture details while compressing model weights

Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses 8-bit min-max quantization for outlier handling

Compresses to 4 bits via leading-zero suppression

Balances outlier preservation and LSB precision

🔎 Similar Papers

Temporal Feature Matters: A Framework for Diffusion Model Quantization