VMonarch: Efficient Video Diffusion Transformers with Structured Attention

📅 2026-01-29

📈 Citations: 0

✨ Influential: 0

career value

210K/year

🤖 AI Summary

This work addresses the quadratic computational complexity of attention mechanisms in video diffusion Transformers, which hinders long-video modeling. To overcome this limitation, the authors propose VMonarch, the first framework to integrate structured Monarch matrices into video diffusion models. By leveraging spatiotemporal attention decomposition, alternating minimization optimization, and online entropy-driven sparse updates, VMonarch achieves efficient dynamic sparse attention with sub-quadratic complexity. Combined with a FlashAttention fusion strategy, VMonarch matches or even surpasses the generation quality of full attention on VBench, while reducing attention FLOPs by 17.5× and accelerating long-video generation by over 5×—significantly outperforming existing methods at 90% sparsity.

Technology Category

Application Category

📝 Abstract

The quadratic complexity of the attention mechanism severely limits the context scalability of Video Diffusion Transformers (DiTs). We find that the highly sparse spatio-temporal attention patterns exhibited in Video DiTs can be naturally represented by the Monarch matrix. It is a class of structured matrices with flexible sparsity, enabling sub-quadratic attention via an alternating minimization algorithm. Accordingly, we propose VMonarch, a novel attention mechanism for Video DiTs that enables efficient computation over the dynamic sparse patterns with structured Monarch matrices. First, we adapt spatio-temporal Monarch factorization to explicitly capture the intra-frame and inter-frame correlations of the video data. Second, we introduce a recomputation strategy to mitigate artifacts arising from instabilities during alternating minimization of Monarch matrices. Third, we propose a novel online entropy algorithm fused into FlashAttention, enabling fast Monarch matrix updates for long sequences. Extensive experiments demonstrate that VMonarch achieves comparable or superior generation quality to full attention on VBench after minimal tuning. It overcomes the attention bottleneck in Video DiTs, reduces attention FLOPs by a factor of 17.5, and achieves a speedup of over 5x in attention computation for long videos, surpassing state-of-the-art sparse attention methods at 90% sparsity.

Problem

Research questions and friction points this paper is trying to address.

Video Diffusion Transformers

attention mechanism

quadratic complexity

context scalability

spatio-temporal attention

Innovation

Methods, ideas, or system contributions that make the work stand out.

Monarch matrix

structured attention

video diffusion transformer