FantasyWorld: Geometry-Consistent World Modeling via Unified Video and 3D Prediction

📅 2025-09-25
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Current video foundation models lack explicit 3D modeling capabilities, resulting in spatial inconsistencies and poor support for downstream 3D reasoning tasks. To address this, we propose a geometry-enhanced world model framework: building upon a frozen video backbone, we introduce a trainable implicit 3D geometric branch. This branch jointly models video latent representations and 3D scene fields via cross-modal feature alignment, multi-view consistency constraints, and bidirectional cross-branch supervision. Crucially, our method requires no per-scene fine-tuning and simultaneously enables novel-view synthesis and embodied navigation. Experiments demonstrate substantial improvements over state-of-the-art approaches in multi-view geometric consistency and generation coherence. These results validate the critical role of a unified backbone coupled with collaborative cross-branch learning in enhancing 3D scene understanding and geometric reasoning.

Technology Category

Application Category

📝 Abstract
High-quality 3D world models are pivotal for embodied intelligence and Artificial General Intelligence (AGI), underpinning applications such as AR/VR content creation and robotic navigation. Despite the established strong imaginative priors, current video foundation models lack explicit 3D grounding capabilities, thus being limited in both spatial consistency and their utility for downstream 3D reasoning tasks. In this work, we present FantasyWorld, a geometry-enhanced framework that augments frozen video foundation models with a trainable geometric branch, enabling joint modeling of video latents and an implicit 3D field in a single forward pass. Our approach introduces cross-branch supervision, where geometry cues guide video generation and video priors regularize 3D prediction, thus yielding consistent and generalizable 3D-aware video representations. Notably, the resulting latents from the geometric branch can potentially serve as versatile representations for downstream 3D tasks such as novel view synthesis and navigation, without requiring per-scene optimization or fine-tuning. Extensive experiments show that FantasyWorld effectively bridges video imagination and 3D perception, outperforming recent geometry-consistent baselines in multi-view coherence and style consistency. Ablation studies further confirm that these gains stem from the unified backbone and cross-branch information exchange.
Problem

Research questions and friction points this paper is trying to address.

Bridging video imagination with explicit 3D geometric grounding
Enhancing spatial consistency for downstream 3D reasoning tasks
Unifying video and 3D prediction without per-scene optimization
Innovation

Methods, ideas, or system contributions that make the work stand out.

Geometry-enhanced framework with trainable branch
Joint modeling of video latents and 3D field
Cross-branch supervision between geometry and video
🔎 Similar Papers
No similar papers found.
Y
Yixiang Dai
AMAP, Alibaba Group
F
Fan Jiang
AMAP, Alibaba Group
C
Chiyu Wang
AMAP, Alibaba Group
M
Mu Xu
AMAP, Alibaba Group
Yonggang Qi
Yonggang Qi
Associate Professor, Beijing University of Posts and Telecommunications
computer visionsketch-based vision learning algorithms and applications