Seeing through Imagination: Learning Scene Geometry via Implicit Spatial World Modeling

📅 2025-12-01

📈 Citations: 0

✨ Influential: 0

career value

205K/year

🤖 AI Summary

Current multimodal large language models (MLLMs) suffer from “visual illiteracy”: they rely on textual descriptions for spatial reasoning and lack implicit modeling capability for 3D visual structure. This work proposes MILO, a framework that constructs an implicit spatial world model, pioneering the generative-feedback-driven implicit alignment between symbolic spatial reasoning and visual perception experience. Key contributions include: (1) RePE—a relative positional encoding scheme that explicitly models camera pose and geometric relationships; (2) an end-to-end implicit spatial modeling mechanism; and (3) GeoGen, a large-scale geometry-aware dataset comprising 2,241 videos and 67,827 observation-action-outcome triplets. MILO achieves significant gains over state-of-the-art methods across multiple spatial reasoning benchmarks, demonstrating superior accuracy, robustness, and generalization in 3D structural understanding.

Technology Category

Application Category

📝 Abstract

Spatial reasoning, the ability to understand and interpret the 3D structure of the world, is a critical yet underdeveloped capability in Multimodal Large Language Models (MLLMs). Current methods predominantly rely on verbal descriptive tuning, which suffers from visual illiteracy, i.e., they learn spatial concepts through textual symbols alone, devoid of connection to their visual manifestations. To bridge this gap, this paper introduces MILO, an Implicit spatIaL wOrld modeling paradigm that simulates human-like spatial imagination. MILO integrates a visual generator to provide geometry-aware feedback, thereby implicitly grounding the MLLM's symbolic reasoning in perceptual experience. Complementing this paradigm, we propose RePE (Relative Positional Encoding), a novel encoding scheme that captures relative camera-pose transformations, offering superior performance over absolute coordinate systems. To support the training, we construct GeoGen, a large-scale Geometry-aware Generative dataset with approximately 2,241 videos and 67,827 observation-action-outcome triplets. Experiments demonstrate that our approach significantly enhances spatial reasoning capabilities across multiple baselines and benchmarks, offering a more holistic understanding of 3D space.

Problem

Research questions and friction points this paper is trying to address.

Enhance spatial reasoning in Multimodal Large Language Models

Bridge visual illiteracy by grounding symbolic reasoning in perception

Improve 3D understanding with implicit spatial world modeling

Innovation

Methods, ideas, or system contributions that make the work stand out.

Implicit spatial world modeling with geometry-aware feedback

Relative Positional Encoding for camera-pose transformations

Large-scale geometry-aware generative dataset for training

🔎 Similar Papers

MUVO: A Multimodal Generative World Model for Autonomous Driving with Geometric Representations