MoGA: Mixture-of-Groups Attention for End-to-End Long Video Generation

📅 2025-10-21

📈 Citations: 0

✨ Influential: 0

career value

257K/year

🤖 AI Summary

In long-video generation, full attention is computationally prohibitive due to its quadratic complexity in sequence length; existing sparse attention methods rely on coarse block-level approximations, compromising both accuracy and efficiency. To address this, we propose Mixture-of-Group Attention (MoGA), a semantic-aware, fine-grained grouping mechanism enabled by a lightweight, learnable token router—eliminating the need for block-wise approximations while precisely identifying critical token pairs. MoGA is fully compatible with modern acceleration techniques, including FlashAttention and sequence parallelism. Experiments demonstrate that MoGA enables end-to-end generation of multi-shot videos up to 1 minute in duration, at 480p resolution and 24 fps, with context lengths reaching 580k tokens. It consistently outperforms state-of-the-art sparse attention baselines across multiple video generation benchmarks, effectively breaking the traditional accuracy–efficiency trade-off.

Technology Category

Application Category

📝 Abstract

Long video generation with Diffusion Transformers (DiTs) is bottlenecked by the quadratic scaling of full attention with sequence length. Since attention is highly redundant, outputs are dominated by a small subset of query-key pairs. Existing sparse methods rely on blockwise coarse estimation, whose accuracy-efficiency trade-offs are constrained by block size. This paper introduces Mixture-of-Groups Attention (MoGA), an efficient sparse attention that uses a lightweight, learnable token router to precisely match tokens without blockwise estimation. Through semantic-aware routing, MoGA enables effective long-range interactions. As a kernel-free method, MoGA integrates seamlessly with modern attention stacks, including FlashAttention and sequence parallelism. Building on MoGA, we develop an efficient long video generation model that end-to-end produces minute-level, multi-shot, 480p videos at 24 fps, with a context length of approximately 580k. Comprehensive experiments on various video generation tasks validate the effectiveness of our approach.

Problem

Research questions and friction points this paper is trying to address.

Addresses quadratic scaling of attention in long video generation

Replaces blockwise estimation with learnable token routing

Enables end-to-end generation of minute-level high-resolution videos

Innovation

Methods, ideas, or system contributions that make the work stand out.

Mixture-of-Groups Attention replaces full attention

Lightweight token router enables semantic-aware token matching

Kernel-free design integrates with FlashAttention stacks

🔎 Similar Papers

No similar papers found.