When and How Much to Imagine: Adaptive Test-Time Scaling with World Models for Visual Spatial Reasoning

📅 2026-02-09
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing visual spatial reasoning methods remain unreliable under unseen or alternative viewpoints, and indiscriminate invocation of visual imagination often introduces misleading information while increasing computational overhead. To address this, this work proposes the AVIC framework, which presents the first systematic analysis of the necessity and performance bounds of visual imagination during inference. AVIC adaptively schedules the invocation and scale of a world model based on the sufficiency of visual evidence, integrating a multimodal large language model with the world model to dynamically decide whether to engage visual imagination. Experiments demonstrate that AVIC significantly reduces both the number of world model calls and language token consumption on benchmarks such as SAT, MMSI, and R2R, while achieving or surpassing the reasoning performance of fixed-strategy baselines.

Technology Category

Application Category

📝 Abstract
Despite rapid progress in Multimodal Large Language Models (MLLMs), visual spatial reasoning remains unreliable when correct answers depend on how a scene would appear under unseen or alternative viewpoints. Recent work addresses this by augmenting reasoning with world models for visual imagination, but questions such as when imagination is actually necessary, how much of it is beneficial, and when it becomes harmful, remain poorly understood. In practice, indiscriminate imagination can increase computation and even degrade performance by introducing misleading evidence. In this work, we present an in-depth analysis of test-time visual imagination as a controllable resource for spatial reasoning. We study when static visual evidence is sufficient, when imagination improves reasoning, and how excessive or unnecessary imagination affects accuracy and efficiency. To support this analysis, we introduce AVIC, an adaptive test-time framework with world models that explicitly reasons about the sufficiency of current visual evidence before selectively invoking and scaling visual imagination. Across spatial reasoning benchmarks (SAT, MMSI) and an embodied navigation benchmark (R2R), our results reveal clear scenarios where imagination is critical, marginal, or detrimental, and show that selective control can match or outperform fixed imagination strategies with substantially fewer world-model calls and language tokens. Overall, our findings highlight the importance of analyzing and controlling test-time imagination for efficient and reliable spatial reasoning.
Problem

Research questions and friction points this paper is trying to address.

visual spatial reasoning
world models
test-time imagination
multimodal large language models
adaptive scaling
Innovation

Methods, ideas, or system contributions that make the work stand out.

adaptive test-time scaling
world models
visual imagination
spatial reasoning
selective control
🔎 Similar Papers
No similar papers found.