๐ค AI Summary
This study addresses a critical yet overlooked issue in multimodal large language models: their performance significantly degrades on visual spatial reasoning tasks when chain-of-thought (CoT) prompting is applied. Through a systematic evaluation of 17 models across 13 spatial reasoning benchmarks, the work reveals for the first time that CoT induces a detrimental effectโmodels overly rely on textual priors while neglecting visual input, leading to increased hallucination and reduced accuracy. The authors introduce the No-Image++ ablation method to demonstrate that models exhibit severe shortcut learning, prioritizing spurious textual cues over genuine visual evidence. These findings not only expose the fragility of current multimodal reasoning systems but also provide a crucial diagnostic tool to guide the development of more robust and grounded multimodal architectures.
๐ Abstract
Multimodal Reasoning Models (MRMs) leveraging Chain-of-Thought (CoT) based thinking have revolutionized mathematical and logical problem-solving. However, we show that this paradigm struggles with generalized spatial intelligence. We perform a comprehensive evaluation of seventeen models across thirteen spatial benchmarks and identify a critical gap: CoT prompting consistently degrades performance in visual spatial reasoning. Furthermore, through a novel No-Image++ ablation, we demonstrate that MRMs and CoT prompted MLMs suffer from severe shortcut learning, and hallucinate visual details from textual priors even when the image is absent. These findings challenge the efficacy of text-only CoT for spatial tasks and underscore the need for vision-centric reasoning paradigms.