🤖 AI Summary
This study investigates the impact of external information—such as spatial cues, commonsense knowledge, and chain-of-thought prompts—on visual spatial reasoning (VSR) performance. Through hypothesis-driven controlled experiments on two public benchmarks, the authors systematically evaluate three categories of vision-language models. Their findings reveal that a single, precise spatial cue consistently outperforms multi-context fusion; weakly relevant or excessive commonsense knowledge degrades performance; and chain-of-thought prompting is beneficial only when spatial localization is sufficiently accurate. This work is the first to demonstrate that “more information is not always better” in VSR, advocating instead for the selective injection of task-aligned signals and clarifying the conditions under which spatial localization and chain-of-thought reasoning synergistically enhance performance.
📝 Abstract
Visual spatial reasoning (VSR) remains challenging for modern vision-language models (VLMs), despite advances in multimodal architectures. A common strategy is to inject additional information at inference time, such as explicit spatial cues, external commonsense knowledge, or chain-of-thought (CoT) reasoning instructions. However, it remains unclear when such information genuinely improves reasoning and when it introduces noise. In this paper, we conduct a hypothesis-driven analysis of information injection for VSR across three representative VLMs and two public benchmarks. We examine (i) the type and number of spatial contexts, (ii) the amount and relevance of injected commonsense knowledge, and (iii) the interaction between spatial grounding and CoT prompting. Our results reveal a consistent pattern: more information does not necessarily yield better reasoning. Targeted single spatial cues outperform multi-context aggregation, excessive or weakly relevant commonsense knowledge degrades performance, and CoT prompting improves accuracy only when spatial grounding is sufficiently precise. These findings highlight the importance of selective, task-aligned information injection and provide practical guidance for designing reliable multimodal reasoning pipelines.