QuantSparse: Comprehensively Compressing Video Diffusion Transformer with Model Quantization and Attention Sparsification

📅 2025-09-28

📈 Citations: 0

✨ Influential: 0

career value

168K/year

🤖 AI Summary

Video diffusion Transformers incur prohibitive computational and memory overhead, hindering practical deployment. To address this, we propose a synergistic compression framework integrating quantization and attention sparsification. Our method innovatively combines multi-scale saliency-guided attention distillation with second-order sparse attention reparameterization, dynamically balancing quantization noise mitigation and sparse information preservation. We further enhance robustness and fidelity of the compressed attention mechanism via temporal-stability-driven residual reparameterization, global structural guidance, and local saliency supervision. Evaluated on HunyuanVideo-13B, our approach achieves 20.88 PSNR—surpassing the best quantization baseline by +4.03—while reducing model storage by 3.68× and accelerating end-to-end inference by 1.88×. These results significantly outperform existing single-dimension compression methods.

Technology Category

Application Category

📝 Abstract

Diffusion transformers exhibit remarkable video generation capability, yet their prohibitive computational and memory costs hinder practical deployment. Model quantization and attention sparsification are two promising directions for compression, but each alone suffers severe performance degradation under aggressive compression. Combining them promises compounded efficiency gains, but naive integration is ineffective. The sparsity-induced information loss exacerbates quantization noise, leading to amplified attention shifts. To address this, we propose extbf{QuantSparse}, a unified framework that integrates model quantization with attention sparsification. Specifically, we introduce extit{Multi-Scale Salient Attention Distillation}, which leverages both global structural guidance and local salient supervision to mitigate quantization-induced bias. In addition, we develop extit{Second-Order Sparse Attention Reparameterization}, which exploits the temporal stability of second-order residuals to efficiently recover information lost under sparsity. Experiments on HunyuanVideo-13B demonstrate that QuantSparse achieves 20.88 PSNR, substantially outperforming the state-of-the-art quantization baseline Q-VDiT (16.85 PSNR), while simultaneously delivering a extbf{3.68$ imes$} reduction in storage and extbf{1.88$ imes$} acceleration in end-to-end inference. Our code will be released in https://github.com/wlfeng0509/QuantSparse.

Problem

Research questions and friction points this paper is trying to address.

Compressing video diffusion transformers to reduce computational costs

Mitigating performance degradation from combined quantization and sparsification

Achieving storage reduction and inference acceleration while maintaining quality

Innovation

Methods, ideas, or system contributions that make the work stand out.

Integrates model quantization with attention sparsification

Uses multi-scale distillation to mitigate quantization bias

Employs second-order reparameterization to recover sparsity loss

🔎 Similar Papers

ViDiT-Q: Efficient and Accurate Quantization of Diffusion Transformers for Image and Video Generation