Chain-of-Thought Degrades Visual Spatial Reasoning Capabilities of Multimodal LLMs

📅 2026-04-17

📈 Citations: 0

✨ Influential: 0

career value

194K/year

🤖 AI Summary

This study addresses a critical yet overlooked issue in multimodal large language models: their performance significantly degrades on visual spatial reasoning tasks when chain-of-thought (CoT) prompting is applied. Through a systematic evaluation of 17 models across 13 spatial reasoning benchmarks, the work reveals for the first time that CoT induces a detrimental effect—models overly rely on textual priors while neglecting visual input, leading to increased hallucination and reduced accuracy. The authors introduce the No-Image++ ablation method to demonstrate that models exhibit severe shortcut learning, prioritizing spurious textual cues over genuine visual evidence. These findings not only expose the fragility of current multimodal reasoning systems but also provide a crucial diagnostic tool to guide the development of more robust and grounded multimodal architectures.

Technology Category

Application Category

📝 Abstract

Multimodal Reasoning Models (MRMs) leveraging Chain-of-Thought (CoT) based thinking have revolutionized mathematical and logical problem-solving. However, we show that this paradigm struggles with generalized spatial intelligence. We perform a comprehensive evaluation of seventeen models across thirteen spatial benchmarks and identify a critical gap: CoT prompting consistently degrades performance in visual spatial reasoning. Furthermore, through a novel No-Image++ ablation, we demonstrate that MRMs and CoT prompted MLMs suffer from severe shortcut learning, and hallucinate visual details from textual priors even when the image is absent. These findings challenge the efficacy of text-only CoT for spatial tasks and underscore the need for vision-centric reasoning paradigms.

Problem

Research questions and friction points this paper is trying to address.

Chain-of-Thought

Visual Spatial Reasoning

Multimodal LLMs

Shortcut Learning

Hallucination

Innovation

Methods, ideas, or system contributions that make the work stand out.

Chain-of-Thought

visual spatial reasoning

multimodal LLMs