LongVie 2: Multimodal Controllable Ultra-Long Video World Model

📅 2025-12-15

📈 Citations: 0

✨ Influential: 0

career value

235K/year

🤖 AI Summary

This work addresses the challenge of general spatiotemporal intelligence by proposing the first controllable video world model capable of generating 5-minute high-definition videos. To tackle long-range visual degradation, temporal inconsistency, and coarse-grained control, we design a three-stage progressive training framework: (1) multimodal-guided enhancement for fine-grained controllability; (2) input-frame degradation modeling to preserve long-range visual fidelity; and (3) cross-segment historical context alignment to improve temporal coherence. Key innovations include end-to-end autoregressive modeling, dense-sparse multimodal (text/image/trajectory) control fusion, and a novel cross-segment context alignment mechanism. We further introduce LongVGenBench—the first benchmark for minute-scale HD video generation evaluation. Experiments demonstrate state-of-the-art performance across long-range controllability, temporal coherence, and visual fidelity.

Technology Category

Application Category

📝 Abstract

Building video world models upon pretrained video generation systems represents an important yet challenging step toward general spatiotemporal intelligence. A world model should possess three essential properties: controllability, long-term visual quality, and temporal consistency. To this end, we take a progressive approach-first enhancing controllability and then extending toward long-term, high-quality generation. We present LongVie 2, an end-to-end autoregressive framework trained in three stages: (1) Multi-modal guidance, which integrates dense and sparse control signals to provide implicit world-level supervision and improve controllability; (2) Degradation-aware training on the input frame, bridging the gap between training and long-term inference to maintain high visual quality; and (3) History-context guidance, which aligns contextual information across adjacent clips to ensure temporal consistency. We further introduce LongVGenBench, a comprehensive benchmark comprising 100 high-resolution one-minute videos covering diverse real-world and synthetic environments. Extensive experiments demonstrate that LongVie 2 achieves state-of-the-art performance in long-range controllability, temporal coherence, and visual fidelity, and supports continuous video generation lasting up to five minutes, marking a significant step toward unified video world modeling.

Problem

Research questions and friction points this paper is trying to address.

Enhancing controllability in video world models

Maintaining high visual quality over long durations

Ensuring temporal consistency across video sequences

Innovation

Methods, ideas, or system contributions that make the work stand out.

Multimodal guidance integrates dense and sparse control signals

Degradation-aware training bridges training and long-term inference gap

History-context guidance aligns contextual information across clips

🔎 Similar Papers

LVBench: An Extreme Long Video Understanding Benchmark