🤖 AI Summary
Existing DiT-based text-to-video models exhibit quadratic computational complexity with respect to pixel count, rendering minute-long, high-resolution video generation infeasible. To address this, we propose MATE—a linear-complexity architecture replacing self-attention: the MA branch integrates an enhanced Mamba2 with Rotary Major Scan reordering and Review Tokens, while the TE branch introduces Temporal Swin Attention, jointly mitigating Mamba’s temporal modeling limitations and improving inter-frame consistency. Compared to DiT, MATE reduces FLOPs by 15× and inference latency by 11.5×, enabling—on a single GPU—the first-ever generation of 68-second, 1080p high-fidelity videos. Human evaluation yields a 75.6% win rate against prior methods; moreover, our LinGen-4B model achieves quality on par with Gen-3, LumaLabs, and Kling, with pairwise win rates ≈50%.
📝 Abstract
Text-to-video generation enhances content creation but is highly computationally intensive: The computational cost of Diffusion Transformers (DiTs) scales quadratically in the number of pixels. This makes minute-length video generation extremely expensive, limiting most existing models to generating videos of only 10-20 seconds length. We propose a Linear-complexity text-to-video Generation (LinGen) framework whose cost scales linearly in the number of pixels. For the first time, LinGen enables high-resolution minute-length video generation on a single GPU without compromising quality. It replaces the computationally-dominant and quadratic-complexity block, self-attention, with a linear-complexity block called MATE, which consists of an MA-branch and a TE-branch. The MA-branch targets short-to-long-range correlations, combining a bidirectional Mamba2 block with our token rearrangement method, Rotary Major Scan, and our review tokens developed for long video generation. The TE-branch is a novel TEmporal Swin Attention block that focuses on temporal correlations between adjacent tokens and medium-range tokens. The MATE block addresses the adjacency preservation issue of Mamba and improves the consistency of generated videos significantly. Experimental results show that LinGen outperforms DiT (with a 75.6% win rate) in video quality with up to 15$ imes$ (11.5$ imes$) FLOPs (latency) reduction. Furthermore, both automatic metrics and human evaluation demonstrate our LinGen-4B yields comparable video quality to state-of-the-art models (with a 50.5%, 52.1%, 49.1% win rate with respect to Gen-3, LumaLabs, and Kling, respectively). This paves the way to hour-length movie generation and real-time interactive video generation. We provide 68s video generation results and more examples in our project website: https://lineargen.github.io/.