🤖 AI Summary
Video diffusion Transformers suffer from large calibration variance, low quantization accuracy, and high inference overhead due to long spatiotemporal token sequences. To address these challenges, we propose a post-training quantization framework that jointly leverages Hessian-aware salient data selection and attention-guided sparse token distillation—precisely identifying calibration-critical samples and information-dense tokens within sequences to alleviate optimization difficulties inherent in long-sequence modeling. By incorporating the dynamic characteristics of the diffusion process and self-attention distributions, our method optimizes a W4A6 quantization scheme. It achieves lossless generation quality while delivering 3.9× model compression and 1.3× inference speedup. To the best of our knowledge, this is the first work to synergistically exploit second-order curvature (Hessian) information and attention mechanisms for quantizing video diffusion models, significantly enhancing their feasibility for efficient deployment.
📝 Abstract
Diffusion transformers have emerged as the mainstream paradigm for video generation models. However, the use of up to billions of parameters incurs significant computational costs. Quantization offers a promising solution by reducing memory usage and accelerating inference. Nonetheless, we observe that the joint modeling of spatial and temporal information in video diffusion models (V-DMs) leads to extremely long token sequences, which introduces high calibration variance and learning challenges. To address these issues, we propose extbf{$ ext{S}^2$Q-VDiT}, a post-training quantization framework for V-DMs that leverages extbf{S}alient data and extbf{S}parse token distillation. During the calibration phase, we identify that quantization performance is highly sensitive to the choice of calibration data. To mitigate this, we introduce extit{Hessian-aware Salient Data Selection}, which constructs high-quality calibration datasets by considering both diffusion and quantization characteristics unique to V-DMs. To tackle the learning challenges, we further analyze the sparse attention patterns inherent in V-DMs. Based on this observation, we propose extit{Attention-guided Sparse Token Distillation}, which exploits token-wise attention distributions to emphasize tokens that are more influential to the model's output. Under W4A6 quantization, $ ext{S}^2$Q-VDiT achieves lossless performance while delivering $3.9 imes$ model compression and $1.3 imes$ inference acceleration. Code will be available at href{https://github.com/wlfeng0509/s2q-vdit}{https://github.com/wlfeng0509/s2q-vdit}.