🤖 AI Summary
Existing diffusion models struggle with high computational costs and the difficulty of simultaneously preserving global temporal coherence and fine spatial details when generating ultra-high-definition long videos. This work proposes a decoupled global-local modeling framework that first generates a low-resolution, low-frame-rate global semantic proxy, which then guides a high-resolution detail branch during joint denoising. The approach achieves resolution-agnostic training for the first time, enabling direct generation of 4K+ long videos using only 720p training data. By integrating temporally scaled RoPE, hierarchical local-preserving attention, and an asymmetric global-local attention mechanism—combined with lightweight LoRA adaptation—the method significantly reduces training costs while achieving a 60.9× faster inference speed compared to native 4K generators and producing superior visual quality.
📝 Abstract
Recent diffusion-based video generators have achieved remarkable visual fidelity and prompt controllability, yet scaling them to ultra-high-resolution (UHR) long videos remains prohibitively expensive. The difficulty is especially pronounced for long single-shot generation where a continuous scene must preserve global temporal coherence, and fine-grained spatial details without relying on clip transitions or autoregressive shot stitching. In this work, we revisit this challenge from the perspective of decoupled modeling. We argue that existing video diffusion models already encode strong local visual priors, while the main bottleneck lies in efficiently extending global spatiotemporal modeling as resolution and duration increase. Based on this insight, we propose AtlaVid, a decoupled global-local framework for efficient UHR long video generation. AtlaVid first generates a low-resolution and low-FPS global semantic proxy via temporally scaled RoPE, thereby extending the temporal horizon without increasing the training token count. Guided by this proxy, a high-resolution detail branch performs joint denoising with hierarchical locality-preserving attention. Reordered spatiotemporal windows preserve geometric locality and asymmetric global-local attention injects aligned semantic guidance and preserves the model's pretrained ability. This design enables resolution-agnostic training: the model is trained only at 720P with lightweight LoRA adaptation, yet generalizes directly to 4K and beyond for longer (>10s) video synthesis. Experiments show that AtlaVid substantially improves the efficiency of ultra-high-resolution long video generation, achieving high-quality UHR long video generation with 60.9x speed up and significantly less training cost and even better performance than native 4K video generators.