SANA-WM: Efficient Minute-Scale World Modeling with Hybrid Linear Diffusion Transformer

📅 2026-05-14

📈 Citations: 0

✨ Influential: 0

career value

208K/year

🤖 AI Summary

This work proposes an open-source world model with 2.6 billion parameters, designed for efficient, high-fidelity generation of 720p videos up to one minute in length and precise 6-DoF camera control. The approach introduces a hybrid linear attention mechanism combining GDN and softmax to balance long-sequence modeling with low memory consumption, alongside a dual-branch camera control module, a two-stage generation pipeline, and metric-scale pose annotations derived from publicly available videos. Trained on only 213K video clips over 15 days using 64×H100 GPUs, the model enables single-GPU generation of 60-second videos. After NVFP4 quantization and distillation, it achieves denoising in 34 seconds on an RTX 5090, delivering 36× higher throughput than open-source baselines while matching industrial-grade models in visual quality and motion-following accuracy.

📝 Abstract

We introduce SANA-WM, an efficient 2.6B-parameter open-source world model natively trained for one-minute generation, synthesizing high-fidelity, 720p, minute-scale videos with precise camera control. SANA-WM achieves visual quality comparable to large-scale industrial baselines such as LingBot-World and HY-WorldPlay, while significantly improving efficiency. Four core designs drive our architecture: (1) Hybrid Linear Attention combines frame-wise Gated DeltaNet (GDN) with softmax attention for memory-efficient long-context modeling. (2) Dual-Branch Camera Control ensures precise 6-DoF trajectory adherence. (3) Two-Stage Generation Pipeline applies a long-video refiner to stage-1 outputs, improving quality and consistency across sequences. (4) Robust Annotation Pipeline extracts accurate metric-scale 6-DoF camera poses from public videos to yield high-quality, spatiotemporally consistent action labels. Driven by these designs, SANA-WMdemonstrates remarkable efficiency across data, training compute, and inference hardware: it uses only $\sim$213K public video clips with metric-scale pose supervision, completes training in 15 days on 64 H100s, and generates each 60s clip on a single GPU; its distilled variant can be deployed on a single RTX 5090 with NVFP4 quantization to denoise a 60s 720p clip in 34s. On our one-minute world-model benchmark, SANA-WM demonstrates stronger action-following accuracy than prior open-source baselines and achieves comparable visual quality at $36\times$ higher throughput for scalable world modeling.

Problem

Research questions and friction points this paper is trying to address.

world modeling

minute-scale video generation

camera control

6-DoF trajectory

high-fidelity video synthesis

Innovation

Methods, ideas, or system contributions that make the work stand out.

Hybrid Linear Attention

6-DoF Camera Control

Two-Stage Generation Pipeline