WorldWeaver: Generating Long-Horizon Video Worlds via Rich Perception

📅 2025-08-21

📈 Citations: 0

✨ Influential: 0

career value

227K/year

🤖 AI Summary

Long-term video generation suffers from structural degradation and temporal drift, primarily because existing methods rely solely on RGB signals and thus struggle to maintain long-horizon geometric consistency. To address this, we propose a depth-aware multimodal joint modeling framework: (1) a unified representation space jointly predicts RGB frames and depth maps; (2) a drift-resistant depth memory bank explicitly preserves inter-frame geometric constraints; and (3) a segment-wise noise scheduling strategy optimizes training dynamics within a diffusion-refinement hybrid framework. Evaluated on multiple long-video benchmarks, our method significantly suppresses temporal drift, enhances structural stability and motion coherence, and achieves state-of-the-art performance in both visual fidelity and dynamic consistency. These results empirically validate the critical role of depth guidance and memory mechanisms in long-sequence video synthesis.

Technology Category

Application Category

📝 Abstract

Generative video modeling has made significant strides, yet ensuring structural and temporal consistency over long sequences remains a challenge. Current methods predominantly rely on RGB signals, leading to accumulated errors in object structure and motion over extended durations. To address these issues, we introduce WorldWeaver, a robust framework for long video generation that jointly models RGB frames and perceptual conditions within a unified long-horizon modeling scheme. Our training framework offers three key advantages. First, by jointly predicting perceptual conditions and color information from a unified representation, it significantly enhances temporal consistency and motion dynamics. Second, by leveraging depth cues, which we observe to be more resistant to drift than RGB, we construct a memory bank that preserves clearer contextual information, improving quality in long-horizon video generation. Third, we employ segmented noise scheduling for training prediction groups, which further mitigates drift and reduces computational cost. Extensive experiments on both diffusion- and rectified flow-based models demonstrate the effectiveness of WorldWeaver in reducing temporal drift and improving the fidelity of generated videos.

Problem

Research questions and friction points this paper is trying to address.

Ensuring structural and temporal consistency in long video sequences

Addressing accumulated errors from RGB-only signal reliance

Mitigating object structure and motion drift over extended durations

Innovation

Methods, ideas, or system contributions that make the work stand out.

Jointly models RGB frames and perceptual conditions

Leverages depth cues to construct memory bank

Employs segmented noise scheduling for training

🔎 Similar Papers

MovieDreamer: Hierarchical Generation for Coherent Long Visual Sequence