MoGA: Mixture-of-Groups Attention for End-to-End Long Video Generation

📅 2025-10-21
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
In long-video generation, full attention is computationally prohibitive due to its quadratic complexity in sequence length; existing sparse attention methods rely on coarse block-level approximations, compromising both accuracy and efficiency. To address this, we propose Mixture-of-Group Attention (MoGA), a semantic-aware, fine-grained grouping mechanism enabled by a lightweight, learnable token router—eliminating the need for block-wise approximations while precisely identifying critical token pairs. MoGA is fully compatible with modern acceleration techniques, including FlashAttention and sequence parallelism. Experiments demonstrate that MoGA enables end-to-end generation of multi-shot videos up to 1 minute in duration, at 480p resolution and 24 fps, with context lengths reaching 580k tokens. It consistently outperforms state-of-the-art sparse attention baselines across multiple video generation benchmarks, effectively breaking the traditional accuracy–efficiency trade-off.

Technology Category

Application Category

📝 Abstract
Long video generation with Diffusion Transformers (DiTs) is bottlenecked by the quadratic scaling of full attention with sequence length. Since attention is highly redundant, outputs are dominated by a small subset of query-key pairs. Existing sparse methods rely on blockwise coarse estimation, whose accuracy-efficiency trade-offs are constrained by block size. This paper introduces Mixture-of-Groups Attention (MoGA), an efficient sparse attention that uses a lightweight, learnable token router to precisely match tokens without blockwise estimation. Through semantic-aware routing, MoGA enables effective long-range interactions. As a kernel-free method, MoGA integrates seamlessly with modern attention stacks, including FlashAttention and sequence parallelism. Building on MoGA, we develop an efficient long video generation model that end-to-end produces minute-level, multi-shot, 480p videos at 24 fps, with a context length of approximately 580k. Comprehensive experiments on various video generation tasks validate the effectiveness of our approach.
Problem

Research questions and friction points this paper is trying to address.

Addresses quadratic scaling of attention in long video generation
Replaces blockwise estimation with learnable token routing
Enables end-to-end generation of minute-level high-resolution videos
Innovation

Methods, ideas, or system contributions that make the work stand out.

Mixture-of-Groups Attention replaces full attention
Lightweight token router enables semantic-aware token matching
Kernel-free design integrates with FlashAttention stacks
🔎 Similar Papers
No similar papers found.
W
Weinan Jia
University of Science and Technology of China
Y
Yuning Lu
FanqieAI, ByteDance China
Mengqi Huang
Mengqi Huang
University of Science and Technology of China
Image GenerationVideo GenerationUnified Multimodal GenerationGenerative AI
H
Hualiang Wang
Hong Kong University of Science and Technology
B
Binyuan Huang
Wuhan University
N
Nan Chen
University of Science and Technology of China
M
Mu Liu
FanqieAI, ByteDance China
J
Jidong Jiang
FanqieAI, ByteDance China
Zhendong Mao
Zhendong Mao
University of Science and Technology of China
CV,NLP