๐ค AI Summary
This work addresses the challenge of modeling structured uncertainty for embodied agents in partially observable environmentsโa limitation of current vision-based generative models that prioritize photorealism. The paper formulates world modeling as embodied belief inference in 3D space and introduces the first generative world model that explicitly represents uncertainty directly in 3D. This approach enables spatially consistent scene memory, multi-hypothesis belief sampling, temporal belief updates, and semantics-guided prediction of unobserved regions. By integrating multi-view geometry, probabilistic reasoning, and semantic priors, the method supports online inference and updating of 3D beliefs. It outperforms existing approaches in both 2D/3D scene reconstruction quality and downstream embodied tasks such as object navigation, with validation in both simulated and real-world environments.
๐ Abstract
Recent advances in visual generative models have highlighted the promise of learning generative world models. However, most existing approaches frame world modeling as novel-view synthesis or future-frame prediction, emphasizing visual realism rather than the structured uncertainty required by embodied agents acting under partial observability. In this work, we propose a different perspective: world modeling as embodied belief inference in 3D space. From this view, a world model should not merely render what may be seen, but maintain and update an agent's belief about the unobserved 3D world as new observations are acquired. We identify several key capabilities for such models, including spatially consistent scene memory, multi-hypothesis belief sampling, sequential belief updating, and semantically informed prediction of unseen regions. We instantiate these ideas in 3D-Belief, a generative 3D world model that infers explicit, actionable 3D beliefs from partial observations and updates them online over time. Unlike prior visual prediction models, 3D-Belief represents uncertainty directly in 3D, enabling embodied agents to imagine plausible scene completions and reason over partially observed environments. We evaluate 3D-Belief on 2D visual quality for scene memory and unobserved-scene imagination, object- and scene-level 3D imagination using our proposed 3D-CORE benchmark, and challenging object navigation tasks in both simulation and the real world. Experiments show that 3D-Belief improves 2D and 3D imagination quality and downstream embodied task performance compared to state-of-the-art methods.