DSV: Exploiting Dynamic Sparsity to Accelerate Large-Scale Video DiT Training

📅 2025-02-11

📈 Citations: 0

✨ Influential: 0

career value

236K/year

🤖 AI Summary

Video Diffusion Transformers (DiTs) suffer from quadratic computational complexity and severe communication bottlenecks—accounting for 95% of end-to-end training time—due to dense 3D full-attention mechanisms. To address this, we propose DSV, a dynamic sparse acceleration framework. Our method introduces: (1) the first dynamic attention sparsification model coupled with a two-stage sparse training algorithm, enabling fine-grained, adaptive sparsity patterns; (2) a heterogeneous sparse-aware hybrid context parallelism scheme that jointly optimizes inter-head and inter-block sparsity heterogeneity; and (3) customized sparse attention kernels integrated with distributed sparse communication optimizations. Evaluated on high-resolution, long-duration video training, DSV achieves up to 3.02× throughput improvement while preserving near-lossless generation quality, significantly reducing the cost of large-scale DiT training.

Technology Category

Application Category

📝 Abstract

Diffusion Transformers (DiTs) have shown remarkable performance in modeling and generating high-quality videos. However, the quadratic computational complexity of 3D full attention mechanism presents significant challenges in scaling video DiT training, especially for high-definition and lengthy videos, where attention can dominate up to 95% of the end-to-end time and necessitate specialized communication paradigms to handle large input sizes. This paper introduces DSV, a novel framework designed to accelerate and scale the training of video DiTs by leveraging the inherent dynamic attention sparsity throughout the training process. DSV employs a two-stage training algorithm that exploits sparsity patterns, focusing on critical elements supported by efficient, tailored kernels. To accommodate the new sparsity dimension, we develop a hybrid sparsity-aware context parallelism that effectively scales to large inputs by addressing the heterogeneity of sparsity across attention heads and blocks, resulting in optimized sparse computation and communication. Extensive evaluations demonstrate that DSV achieves up to 3.02x gain in training throughput with nearly no quality degradation.

Problem

Research questions and friction points this paper is trying to address.

Accelerate video DiT training

Handle large video inputs

Optimize sparse computation and communication

Innovation

Methods, ideas, or system contributions that make the work stand out.

Dynamic Sparsity Exploitation

Two-stage Training Algorithm

Hybrid Sparsity-aware Parallelism

🔎 Similar Papers

VideoPrism: A Foundational Visual Encoder for Video Understanding