🤖 AI Summary
This work addresses the limitations of large multimodal models (LMMs) in viewpoint-dependent spatial reasoning tasks, where performance is constrained by reliance on a single static image. The authors propose the Think-with-New-Views (TwNV) paradigm, which integrates generative novel-view synthesis into the reasoning loop: a reasoning module identifies spatial ambiguities and directs a neural rendering module to synthesize alternative views based on specified camera poses, enabling iterative refinement of spatial analysis through the newly generated perspectives. This study provides the first systematic validation of novel-view generation as an effective mechanism for enhancing LMMs’ spatial intelligence, uncovering critical relationships among instruction formatting, rendering fidelity, and reasoning accuracy, and introducing a visual scaling mechanism during inference. Consistent gains of 1.3–3.9 percentage points are observed across four spatial subtasks and four LMM architectures, with particularly pronounced improvements on viewpoint-sensitive tasks.
📝 Abstract
Current Large Multimodal Models (LMMs) struggle with spatial reasoning tasks requiring viewpoint-dependent understanding, largely because they are confined to a single, static observation. We propose Thinking with Novel Views (TwNV), a paradigm that integrates generative novel-view synthesis into the reasoning loop: a Reasoner LMM identifies spatial ambiguity, instructs a Painter to synthesize an alternative viewpoint, and re-examines the scene with the additional evidence. Through systematic experiments we address three research questions. (1) Instruction format: numerical camera-pose specifications yield more reliable view control than free-form language. (2) Generation fidelity: synthesized view quality is tightly coupled with downstream spatial accuracy. (3) Inference-time visual scaling: iterative multi-turn view refinement further improves performance, echoing recent scaling trends in language reasoning. Across four spatial subtask categories and four LMM architectures (both closed- and open-source), TwNV consistently improves accuracy by +1.3 to +3.9 pp, with the largest gains on viewpoint-sensitive subtasks. These results establish novel-view generation as a practical lever for advancing spatial intelligence of LMMs.