SpatialImaginer: Towards Adaptive Visual Imagination for Spatial Reasoning

📅 2026-04-19
📈 Citations: 0
Influential: 0
📄 PDF

career value

198K/year
🤖 AI Summary
This work addresses the fragility of spatial reasoning in multimodal large language models, which often stems from their overreliance on pure text and consequent inability to maintain geometric consistency. To overcome this limitation, the authors propose SpatialImaginer, a framework that synergistically combines textual chain-of-thought reasoning with visual imagination through a divide-and-conquer strategy: textual components handle high-level semantic planning, while a dedicated visual module manages geometry-sensitive state transitions and enforces spatial consistency. The approach introduces an adaptive visual imagination mechanism coupled with a difficulty-aware data engine that triggers visual generation only when necessary, thereby enhancing reasoning fidelity. Experimental results demonstrate that SpatialImaginer significantly outperforms existing models across multiple spatial intelligence benchmarks, exhibiting notably stronger robustness in complex, multi-step reasoning tasks.

Technology Category

Application Category

📝 Abstract
Spatial intelligence, which refers to the ability to reason about geometric and physical structure from visual observations, remains a core challenge for multimodal large language models. Despite promising performance, recent multimodal large language models (MLLMs) often exhibit fragile reasoning traces in spatial intelligence tasks that involve consistent spatial state recognition. We argue that these failures stem from a mismatch between the spatial recognition mechanism and the text-only reasoning behavior of these MLLMs. Effective spatial reasoning requires low-level geometric structure to be faithfully preserved and updated throughout the reasoning process, whereas textual representations tend to abstract away precisely these critical details. To address this issue, we propose SpatialImaginer, a unified multimodal generation framework that integrates textual reasoning with visual imagination. Our framework adopts a divide-and-conquer strategy, using text chain-of-thought for high-level semantic planning and the visual imagination for geometry-sensitive state transformation and consistency preservation. To support this capability, we further introduce a difficulty-aware data engine with closed-loop verification to train the model to invoke visual imagination selectively when stable spatial state tracking is required. Extensive experiments on diverse spatial intelligence benchmarks show that SpatialImaginer achieves state-of-the-art performance and substantially improves robustness on complex multi-step spatial reasoning tasks.
Problem

Research questions and friction points this paper is trying to address.

spatial intelligence
multimodal large language models
spatial reasoning
visual imagination
geometric structure
Innovation

Methods, ideas, or system contributions that make the work stand out.

visual imagination
spatial reasoning
multimodal large language models
geometry-sensitive state transformation
difficulty-aware data engine
🔎 Similar Papers
No similar papers found.