DynamicRad: Content-Adaptive Sparse Attention for Long Video Diffusion

📅 2026-04-22
📈 Citations: 0
Influential: 0
📄 PDF

career value

227K/year
🤖 AI Summary
This work addresses the challenge in long-video diffusion models where static sparse attention mechanisms struggle to balance computational efficiency with effective long-range dynamic modeling. To overcome this limitation, the authors propose DynamicRad, a content-adaptive sparse attention framework grounded in a radial locality prior. DynamicRad integrates dual modes—static sparsity ratios and dynamic thresholds—to achieve a Pareto-optimal trade-off between generation quality and inference efficiency without requiring online search. The method incorporates semantic motion routing, mask-aware LoRA fine-tuning, and MSE optimization via a physics-inspired proxy task, complemented by offline Bayesian hyperparameter tuning. Evaluated on HunyuanVideo and Wan2.1-14B, DynamicRad achieves 1.7–2.5× inference speedup with over 80% sparsity, while its dynamic mode matches or even surpasses dense baselines on certain long-sequence tasks.

Technology Category

Application Category

📝 Abstract
Leveraging the natural spatiotemporal energy decay in video diffusion offers a path to efficiency, yet relying solely on rigid static masks risks losing critical long-range information in complex dynamics. To address this issue, we propose \textbf{DynamicRad}, a unified sparse-attention paradigm that grounds adaptive selection within a radial locality prior. DynamicRad introduces a \textbf{dual-mode} strategy: \textit{static-ratio} for speed-optimized execution and \textit{dynamic-threshold} for quality-first filtering. To ensure robustness without online search overhead, we integrate an offline Bayesian Optimization (BO) pipeline coupled with a \textbf{semantic motion router}. This lightweight projection module maps prompt embeddings to optimal sparsity regimes with \textbf{minimal runtime overhead}. Unlike online profiling methods, our offline BO optimizes attention reconstruction error (MSE) on a physics-based proxy task, ensuring rapid convergence. Experiments on HunyuanVideo and Wan2.1-14B demonstrate that DynamicRad pushes the efficiency--quality Pareto frontier, achieving \textbf{1.7$\times$--2.5$\times$ inference speedups} with \textbf{over 80\% effective sparsity}. In some long-sequence settings, the dynamic mode even matches or exceeds the dense baseline, while mask-aware LoRA further improves long-horizon coherence. Code is available at https://github.com/Adamlong3/DynamicRad.
Problem

Research questions and friction points this paper is trying to address.

long video diffusion
sparse attention
spatiotemporal energy decay
long-range information
attention sparsity
Innovation

Methods, ideas, or system contributions that make the work stand out.

sparse attention
video diffusion
content-adaptive
Bayesian optimization
radial locality