SANA-Video: Efficient Video Generation with Block Linear Diffusion Transformer

📅 2025-09-29

📈 Citations: 0

✨ Influential: 0

career value

237K/year

🤖 AI Summary

To address the challenges of high computational cost, excessive memory consumption, and limited context modeling in high-resolution, long-duration text-to-video generation, this paper proposes Linear DiT—a novel architecture integrating linear attention with block-wise linear diffusion Transformers—coupled with a constant-memory KV caching mechanism. This enables linear-complexity global spatiotemporal modeling and supports block-wise autoregressive video synthesis. The method further incorporates NVFP4 quantization and an efficient data filtering strategy. Trained for only 12 days on 64 H100 GPUs, the model achieves minute-long 720×1280 video generation. At inference, it delivers a 16× speedup over comparable models, accelerates 720p video generation by 2.4×, and significantly reduces GPU memory usage, while maintaining text-alignment quality on par with leading compact models.

Technology Category

Application Category

📝 Abstract

We introduce SANA-Video, a small diffusion model that can efficiently generate videos up to 720x1280 resolution and minute-length duration. SANA-Video synthesizes high-resolution, high-quality and long videos with strong text-video alignment at a remarkably fast speed, deployable on RTX 5090 GPU. Two core designs ensure our efficient, effective and long video generation: (1) Linear DiT: We leverage linear attention as the core operation, which is more efficient than vanilla attention given the large number of tokens processed in video generation. (2) Constant-Memory KV cache for Block Linear Attention: we design block-wise autoregressive approach for long video generation by employing a constant-memory state, derived from the cumulative properties of linear attention. This KV cache provides the Linear DiT with global context at a fixed memory cost, eliminating the need for a traditional KV cache and enabling efficient, minute-long video generation. In addition, we explore effective data filters and model training strategies, narrowing the training cost to 12 days on 64 H100 GPUs, which is only 1% of the cost of MovieGen. Given its low cost, SANA-Video achieves competitive performance compared to modern state-of-the-art small diffusion models (e.g., Wan 2.1-1.3B and SkyReel-V2-1.3B) while being 16x faster in measured latency. Moreover, SANA-Video can be deployed on RTX 5090 GPUs with NVFP4 precision, accelerating the inference speed of generating a 5-second 720p video from 71s to 29s (2.4x speedup). In summary, SANA-Video enables low-cost, high-quality video generation.

Problem

Research questions and friction points this paper is trying to address.

Efficiently generates high-resolution, minute-long videos using diffusion models

Achieves fast video synthesis with strong text-video alignment

Reduces computational costs while maintaining competitive performance

Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses linear attention for efficient video token processing

Employs constant-memory KV cache for long video generation

Implements optimized data filters and training strategies

🔎 Similar Papers

Latte: Latent Diffusion Transformer for Video Generation