Spatial Mental Modeling from Limited Views

πŸ“… 2025-06-26
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
Existing vision-language models (VLMs) struggle to construct complete spatial mental models from limited viewpoints, exhibiting spatial mental modeling performance near chance level. Method: We propose a β€œdraw-then-reason” collaborative paradigm, introducing structured cognitive maps as trainable intermediate representations for the first time. This framework integrates multi-step natural language reasoning chains, cross-view intermediate view synthesis, and reward-based spatial reasoning fine-tuning. Contribution/Results: Our approach enables explicit modeling and end-to-end optimization of spatial reasoning. On the MindCube benchmark, it improves accuracy from 37.8% to 70.7% (+32.9 percentage points), substantially narrowing the gap between VLMs and human-level spatial mental modeling capability.

Technology Category

Application Category

πŸ“ Abstract
Can Vision Language Models (VLMs) imagine the full scene from just a few views, like humans do? Humans form spatial mental models, internal representations of unseen space, to reason about layout, perspective, and motion. Our new MindCube benchmark with 21,154 questions across 3,268 images exposes this critical gap, where existing VLMs exhibit near-random performance. Using MindCube, we systematically evaluate how well VLMs build robust spatial mental models through representing positions (cognitive mapping), orientations (perspective-taking), and dynamics (mental simulation for "what-if" movements). We then explore three approaches to help VLMs approximate spatial mental models, including unseen intermediate views, natural language reasoning chains, and cognitive maps. The significant improvement comes from a synergistic approach, "map-then-reason", that jointly trains the model to first generate a cognitive map and then reason upon it. By training models to reason over these internal maps, we boosted accuracy from 37.8% to 60.8% (+23.0%). Adding reinforcement learning pushed performance even further to 70.7% (+32.9%). Our key insight is that such scaffolding of spatial mental models, actively constructing and utilizing internal structured spatial representations with flexible reasoning processes, significantly improves understanding of unobservable space.
Problem

Research questions and friction points this paper is trying to address.

Assessing VLMs' ability to imagine full scenes from limited views
Evaluating VLMs' spatial mental modeling via cognitive mapping and simulation
Improving VLMs' spatial reasoning with cognitive maps and reinforcement learning
Innovation

Methods, ideas, or system contributions that make the work stand out.

Generating cognitive maps from limited views
Combining map-then-reason training approach
Enhancing accuracy with reinforcement learning
πŸ”Ž Similar Papers
No similar papers found.