Probing the effectiveness of World Models for Spatial Reasoning through Test-time Scaling

📅 2025-12-05

📈 Citations: 0

✨ Influential: 0

career value

191K/year

🤖 AI Summary

Vision-language models (VLMs) exhibit systematic limitations in multi-view spatial reasoning and embodied viewpoint transformation tasks. Existing test-time verifiers suffer from bias and insufficient reliability, undermining world model efficacy in such settings. Method: We propose ViSA—a verification-aware world modeling framework that grounds high-confidence reward signals in *verifiable spatial assertions*. ViSA employs frame-anchored micro-claims, uncertainty-aware action-conditioned trajectory simulation, and rule-driven validation to correct exploration biases and enable balanced trajectory selection. Contribution/Results: ViSA significantly improves spatial reasoning performance on SAT-Real. However, inconsistent gains on MMSI-Bench reveal an information bottleneck in current world models. This work introduces verifiable spatial assertions into reinforcement learning–based world modeling for the first time, establishing a novel paradigm for embodied spatial reasoning.

Technology Category

Application Category

📝 Abstract

Vision-Language Models (VLMs) remain limited in spatial reasoning tasks that require multi-view understanding and embodied perspective shifts. Recent approaches such as MindJourney attempt to mitigate this gap through test-time scaling where a world model imagines action-conditioned trajectories and a heuristic verifier selects helpful views from such trajectories. In this work, we systematically examine how such test-time verifiers behave across benchmarks, uncovering both their promise and their pitfalls. Our uncertainty-based analyses show that MindJourney's verifier provides little meaningful calibration, and that random scoring often reduces answer entropy equally well, thus exposing systematic action biases and unreliable reward signals. To mitigate these, we introduce a Verification through Spatial Assertions (ViSA) framework that grounds the test-time reward in verifiable, frame-anchored micro-claims. This principled verifier consistently improves spatial reasoning on the SAT-Real benchmark and corrects trajectory-selection biases through more balanced exploratory behavior. However, on the challenging MMSI-Bench, none of the verifiers, including ours, achieve consistent scaling, suggesting that the current world models form an information bottleneck where imagined views fail to enrich fine-grained reasoning. Together, these findings chart the bad, good, and ugly aspects of test-time verification for world-model-based reasoning. Our code is available at https://github.com/chandar-lab/visa-for-mindjourney.

Problem

Research questions and friction points this paper is trying to address.

Evaluates test-time scaling for spatial reasoning in vision-language models.

Identifies biases and unreliable rewards in current world model verification.

Proposes a framework to improve spatial reasoning via verifiable micro-claims.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduces ViSA framework for verifiable spatial assertions

Grounds test-time reward in frame-anchored micro-claims

Corrects biases via balanced exploratory behavior in trajectories

🔎 Similar Papers

Elements of World Knowledge (EWOK): A cognition-inspired framework for evaluating basic world knowledge in language models