MindJourney: Test-Time Scaling with World Models for Spatial Reasoning

๐Ÿ“… 2025-07-16
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
Current vision-language models (VLMs) are constrained by 2D image inputs and lack intrinsic capability for 3D dynamic scene modeling, leading to poor performance on spatial reasoning tasksโ€”such as predicting scene changes following egocentric motion. To address this, we propose a test-time adaptation framework that requires no fine-tuning, introducing the first integration of a controllable, video-diffusion-driven world model with VLMs. Our method iteratively plans motion trajectories and synthesizes virtual multi-view observations, dynamically generating post-motion scene sequences to enable explicit 3D spatial change modeling and reasoning. As a plug-and-play module, it enhances VLMs without architectural modification. Evaluated on the SAT benchmark, our approach achieves an average improvement of over 8%, significantly outperforming reinforcement-learning-based test-time inference VLMs. Results demonstrate the world modelโ€™s efficacy as a general-purpose reasoning augmentation module, highlighting its strong generalization potential across spatial reasoning tasks.

Technology Category

Application Category

๐Ÿ“ Abstract
Spatial reasoning in 3D space is central to human cognition and indispensable for embodied tasks such as navigation and manipulation. However, state-of-the-art vision-language models (VLMs) struggle frequently with tasks as simple as anticipating how a scene will look after an egocentric motion: they perceive 2D images but lack an internal model of 3D dynamics. We therefore propose MindJourney, a test-time scaling framework that grants a VLM with this missing capability by coupling it to a controllable world model based on video diffusion. The VLM iteratively sketches a concise camera trajectory, while the world model synthesizes the corresponding view at each step. The VLM then reasons over this multi-view evidence gathered during the interactive exploration. Without any fine-tuning, our MindJourney achieves over an average 8% performance boost on the representative spatial reasoning benchmark SAT, showing that pairing VLMs with world models for test-time scaling offers a simple, plug-and-play route to robust 3D reasoning. Meanwhile, our method also improves upon the test-time inference VLMs trained through reinforcement learning, which demonstrates the potential of our method that utilizes world models for test-time scaling.
Problem

Research questions and friction points this paper is trying to address.

Enhancing 3D spatial reasoning in vision-language models
Addressing limitations of VLMs in dynamic scene anticipation
Improving test-time scaling with world models for robust reasoning
Innovation

Methods, ideas, or system contributions that make the work stand out.

Couples VLM with controllable video diffusion world model
Iteratively sketches camera trajectory for multi-view reasoning
Boosts spatial reasoning without fine-tuning via test-time scaling
๐Ÿ”Ž Similar Papers
No similar papers found.