Timestep-Aware SVDQuant-GPTQ for W4A4 Quantization of Wan2.2-I2V

📅 2026-05-26

📈 Citations: 0

✨ Influential: 0

career value

196K/year

🤖 AI Summary

This work addresses three key challenges in W4A4 quantization of large video diffusion Transformers: sparse activation outliers, significant shifts in activation distributions across denoising timesteps, and varying quantization sensitivity among experts in Mixture-of-Experts (MoE) architectures. To tackle these issues, the authors propose the first expert- and timestep-aware post-training quantization framework, integrating SVDQuant for low-rank outlier compensation, GPTQ for reconstruction-aware weight quantization, and per-expert, per-timestep-interval optimization of layer-wise activation clipping ratios. Evaluated on the OpenS2V-Eval benchmark, the method achieves a 59.3% reduction in peak GPU memory usage with only a 0.9% drop in VBench average score and a 2.3% degradation in visual quality, substantially advancing high-fidelity W4A4 quantization for MoE-based video DiT models.

📝 Abstract

W4A4 quantization of large video diffusion Transformers offers substantial memory savings but is hindered by two main challenges: sparse large-magnitude activation outliers, and strongly timestep-dependent activation distributions across the multi-step denoising trajectory. These difficulties are compounded by Wan2.2-I2V's two-expert Mixture-of-Experts DiT design, whose high-noise and low-noise experts exhibit distinct quantization sensitivities that a single global calibration policy cannot capture. We propose a post-training quantization framework combining SVDQuant-based low-rank outlier compensation, GPTQ-based reconstruction-aware residual weight quantization, and timestep-bin-wise per-layer activation clipping-ratio search conducted independently for each expert. On the OpenS2V-Eval benchmark, our method reduces peak GPU memory by 59.3\% relative to the BF16 baseline while incurring only a 0.9\% drop in VBench average score and a 2.3\% drop in Imaging Quality, demonstrating that expert- and timestep-aware calibration is essential for high-fidelity W4A4 inference on MoE video DiTs.

Problem

Research questions and friction points this paper is trying to address.

W4A4 quantization

activation outliers

timestep-dependent activation

Mixture-of-Experts

video diffusion Transformers

Innovation

Methods, ideas, or system contributions that make the work stand out.

SVDQuant

GPTQ

timestep-aware quantization