MonarchRT: Efficient Attention for Real-Time Video Generation

📅 2026-02-12
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the efficiency bottleneck in real-time autoregressive video generation caused by the quadratic complexity of 3D self-attention, which existing sparse methods struggle to mitigate under few-step inference settings. We propose MonarchRT, the first efficient sparse attention mechanism tailored for real-time autoregressive video generation. By leveraging structured sparsity through Monarch matrix decomposition, combined with block-aligned and expanded tiling designs, our approach transcends conventional top-k sparsity assumptions while preserving high representational capacity. It effectively integrates periodic spatiotemporal structure with dynamic semantic correspondence. Coupled with a custom Triton kernel and fine-tuning strategy, MonarchRT achieves 95% attention sparsity without quality degradation on the Self-Forcing model, delivering 16 FPS on a single RTX 5090 GPU and outperforming FlashAttention variants by 1.4–11.8× in kernel speed.

Technology Category

Application Category

📝 Abstract
Real-time video generation with Diffusion Transformers is bottlenecked by the quadratic cost of 3D self-attention, especially in real-time regimes that are both few-step and autoregressive, where errors compound across time and each denoising step must carry substantially more information. In this setting, we find that prior sparse-attention approximations break down, despite showing strong results for bidirectional, many-step diffusion. Specifically, we observe that video attention is not reliably sparse, but instead combines pronounced periodic structure driven by spatiotemporal position with dynamic, sparse semantic correspondences and dense mixing, exceeding the representational capacity of even oracle top-k attention. Building on this insight, we propose Monarch-RT, a structured attention parameterization for video diffusion models that factorizes attention using Monarch matrices. Through appropriately aligned block structure and our extended tiled Monarch parameterization, we achieve high expressivity while preserving computational efficiency. We further overcome the overhead of parameterization through finetuning, with custom Triton kernels. We first validate the high efficacy of Monarch-RT over existing sparse baselines designed only for bidirectional models. We further observe that Monarch-RT attains up to 95% attention sparsity with no loss in quality when applied to the state-of-the-art model Self-Forcing, making Monarch-RT a pioneering work on highly-capable sparse attention parameterization for real-time video generation. Our optimized implementation outperforms FlashAttention-2, FlashAttention-3, and FlashAttention-4 kernels on Nvidia RTX 5090, H100, and B200 GPUs respectively, providing kernel speedups in the range of 1.4-11.8X. This enables us, for the first time, to achieve true real-time video generation with Self-Forcing at 16 FPS on a single RTX 5090.
Problem

Research questions and friction points this paper is trying to address.

real-time video generation
diffusion transformers
3D self-attention
sparse attention
computational efficiency
Innovation

Methods, ideas, or system contributions that make the work stand out.

Monarch matrices
structured attention
real-time video generation
diffusion transformers
sparse attention