MagicWorld: Interactive Geometry-driven Video World Exploration

📅 2025-11-24

📈 Citations: 0

✨ Influential: 0

career value

202K/year

🤖 AI Summary

Existing interactive video world models face two key challenges: (1) neglecting the coupling between instruction-driven motion and 3D geometry, leading to structural instability under viewpoint changes; and (2) forgetting historical interaction states in multi-step sequences, causing cumulative semantic and geometric drift. This paper introduces InterVWM, an interactive video world model that addresses these issues via two core innovations: (i) an Action-Guided 3D Geometry (AG3D) module, which explicitly models motion-geometry correspondence using point-cloud constraints; and (ii) a History Cache Retrieval (HCR) mechanism, enabling conditional state maintenance through cross-frame feature retrospection. InterVWM jointly processes visual and action inputs, performs autoregressive video generation, and injects geometry-aware conditioning. Experiments demonstrate substantial improvements in structural stability and temporal coherence over long, multi-step interactive sequences—establishing a new paradigm for reliable visual dynamics modeling.

Technology Category

Application Category

📝 Abstract

Recent interactive video world model methods generate scene evolution conditioned on user instructions. Although they achieve impressive results, two key limitations remain. First, they fail to fully exploit the correspondence between instruction-driven scene motion and the underlying 3D geometry, which results in structural instability under viewpoint changes. Second, they easily forget historical information during multi-step interaction, resulting in error accumulation and progressive drift in scene semantics and structure. To address these issues, we propose MagicWorld, an interactive video world model that integrates 3D geometric priors and historical retrieval. MagicWorld starts from a single scene image, employs user actions to drive dynamic scene evolution, and autoregressively synthesizes continuous scenes. We introduce the Action-Guided 3D Geometry Module (AG3D), which constructs a point cloud from the first frame of each interaction and the corresponding action, providing explicit geometric constraints for viewpoint transitions and thereby improving structural consistency. We further propose History Cache Retrieval (HCR) mechanism, which retrieves relevant historical frames during generation and injects them as conditioning signals, helping the model utilize past scene information and mitigate error accumulation. Experimental results demonstrate that MagicWorld achieves notable improvements in scene stability and continuity across interaction iterations.

Problem

Research questions and friction points this paper is trying to address.

Address structural instability in video scenes under viewpoint changes

Mitigate error accumulation during multi-step interactive scene evolution

Prevent progressive drift in scene semantics and structure over time

Innovation

Methods, ideas, or system contributions that make the work stand out.

Integrates 3D geometric priors for structural consistency

Uses Action-Guided 3D Geometry Module with point clouds

Implements History Cache Retrieval to mitigate error accumulation

🔎 Similar Papers

No similar papers found.