$ ext{S}^2$Q-VDiT: Accurate Quantized Video Diffusion Transformer with Salient Data and Sparse Token Distillation

📅 2025-08-05
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Video diffusion Transformers suffer from large calibration variance, low quantization accuracy, and high inference overhead due to long spatiotemporal token sequences. To address these challenges, we propose a post-training quantization framework that jointly leverages Hessian-aware salient data selection and attention-guided sparse token distillation—precisely identifying calibration-critical samples and information-dense tokens within sequences to alleviate optimization difficulties inherent in long-sequence modeling. By incorporating the dynamic characteristics of the diffusion process and self-attention distributions, our method optimizes a W4A6 quantization scheme. It achieves lossless generation quality while delivering 3.9× model compression and 1.3× inference speedup. To the best of our knowledge, this is the first work to synergistically exploit second-order curvature (Hessian) information and attention mechanisms for quantizing video diffusion models, significantly enhancing their feasibility for efficient deployment.

Technology Category

Application Category

📝 Abstract
Diffusion transformers have emerged as the mainstream paradigm for video generation models. However, the use of up to billions of parameters incurs significant computational costs. Quantization offers a promising solution by reducing memory usage and accelerating inference. Nonetheless, we observe that the joint modeling of spatial and temporal information in video diffusion models (V-DMs) leads to extremely long token sequences, which introduces high calibration variance and learning challenges. To address these issues, we propose extbf{$ ext{S}^2$Q-VDiT}, a post-training quantization framework for V-DMs that leverages extbf{S}alient data and extbf{S}parse token distillation. During the calibration phase, we identify that quantization performance is highly sensitive to the choice of calibration data. To mitigate this, we introduce extit{Hessian-aware Salient Data Selection}, which constructs high-quality calibration datasets by considering both diffusion and quantization characteristics unique to V-DMs. To tackle the learning challenges, we further analyze the sparse attention patterns inherent in V-DMs. Based on this observation, we propose extit{Attention-guided Sparse Token Distillation}, which exploits token-wise attention distributions to emphasize tokens that are more influential to the model's output. Under W4A6 quantization, $ ext{S}^2$Q-VDiT achieves lossless performance while delivering $3.9 imes$ model compression and $1.3 imes$ inference acceleration. Code will be available at href{https://github.com/wlfeng0509/s2q-vdit}{https://github.com/wlfeng0509/s2q-vdit}.
Problem

Research questions and friction points this paper is trying to address.

Reduces computational costs in video diffusion models
Addresses high calibration variance in long token sequences
Improves quantization performance with salient data selection
Innovation

Methods, ideas, or system contributions that make the work stand out.

Hessian-aware Salient Data Selection
Attention-guided Sparse Token Distillation
Post-training quantization for video diffusion