Mixture of Distributions Matters: Dynamic Sparse Attention for Efficient Video Diffusion Transformers

📅 2026-01-14
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the severe computational bottleneck imposed by the quadratic complexity of self-attention in video diffusion models, which hinders generation efficiency. Existing sparse attention methods struggle to balance quality and speed due to either static designs or high-overhead sampling procedures. To overcome this limitation, we propose MOD-DiT, a sampling-free dynamic sparse attention framework that introduces, for the first time, a distribution-mixture-based dynamic sparsification mechanism. Our approach models evolving attention patterns through a two-stage process: it first constructs a linear approximation from early denoising steps to predict attention masks, then dynamically preserves historical sparse structures via an online block masking strategy. Extensive experiments across multiple benchmarks and architectures demonstrate that MOD-DiT substantially improves computational efficiency while maintaining or even enhancing generation quality, confirming its effectiveness for high-quality, efficient video synthesis.

Technology Category

Application Category

📝 Abstract
While Diffusion Transformers (DiTs) have achieved notable progress in video generation, this long-sequence generation task remains constrained by the quadratic complexity inherent to self-attention mechanisms, creating significant barriers to practical deployment. Although sparse attention methods attempt to address this challenge, existing approaches either rely on oversimplified static patterns or require computationally expensive sampling operations to achieve dynamic sparsity, resulting in inaccurate pattern predictions and degraded generation quality. To overcome these limitations, we propose a \underline{\textbf{M}}ixture-\underline{\textbf{O}}f-\underline{\textbf{D}}istribution \textbf{DiT} (\textbf{MOD-DiT}), a novel sampling-free dynamic attention framework that accurately models evolving attention patterns through a two-stage process. First, MOD-DiT leverages prior information from early denoising steps and adopts a {distributed mixing approach} to model an efficient linear approximation model, which is then used to predict mask patterns for a specific denoising interval. Second, an online block masking strategy dynamically applies these predicted masks while maintaining historical sparsity information, eliminating the need for repetitive sampling operations. Extensive evaluations demonstrate consistent acceleration and quality improvements across multiple benchmarks and model architectures, validating MOD-DiT's effectiveness for efficient, high-quality video generation while overcoming the computational limitations of traditional sparse attention approaches.
Problem

Research questions and friction points this paper is trying to address.

video diffusion
sparse attention
quadratic complexity
dynamic sparsity
generation quality
Innovation

Methods, ideas, or system contributions that make the work stand out.

Dynamic Sparse Attention
Mixture of Distributions
Diffusion Transformers
Video Generation
Sampling-Free