Better Eyes, Better Thoughts: Why Vision Chain-of-Thought Fails in Medicine

📅 2026-03-02
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
In medical visual question answering (VQA), chain-of-thought (CoT) reasoning often underperforms direct answering (DirA), exhibiting a performance degradation phenomenon. This work proposes the "medical perceptual bottleneck" hypothesis, attributing CoT’s failure primarily to insufficient cross-modal alignment. To address this, the authors introduce two training-free, inference-time visual grounding interventions: perception anchoring via region-of-interest prompts and description localization guided by high-quality textual cues. Extensive experiments across multiple medical VQA benchmarks and diverse model families demonstrate the effectiveness of these strategies, significantly improving CoT accuracy. The proposed methods not only mitigate CoT’s performance deficit relative to DirA but also successfully reverse the performance gap in several settings, establishing CoT as the superior approach under specific conditions.

Technology Category

Application Category

📝 Abstract
Large vision-language models (VLMs) often benefit from chain-of-thought (CoT) prompting in general domains, yet its efficacy in medical vision-language tasks remains underexplored. We report a counter-intuitive trend: on medical visual question answering, CoT frequently underperforms direct answering (DirA) across general-purpose and medical-specific models. We attribute this to a \emph{medical perception bottleneck}: subtle, domain-specific cues can weaken visual grounding, and CoT may compound early perceptual uncertainty rather than correct it. To probe this hypothesis, we introduce two training-free, inference-time grounding interventions: (i) \emph{perception anchoring} via region-of-interest cues and (ii) \emph{description grounding} via high-quality textual guidance. Across multiple benchmarks and model families, these interventions improve accuracy, mitigate CoT degradation, and in several settings reverse the CoT--DirA inversion. Our findings suggest that reliable clinical VLMs require robust visual grounding and cross-modal alignment, beyond extending text-driven reasoning chains. Code is available \href{https://github.com/TianYin123/Better_Eyes_Better_Thoughts}{here}.
Problem

Research questions and friction points this paper is trying to address.

medical vision-language models
chain-of-thought
visual grounding
perception bottleneck
visual question answering
Innovation

Methods, ideas, or system contributions that make the work stand out.

medical perception bottleneck
perception anchoring
description grounding
vision-language models
chain-of-thought prompting
Y
Yuan Wu
Department of Computer Science, City University of Hong Kong (Dongguan), China
Z
Zongxian Yang
Department of Computer Science, City University of Hong Kong (Dongguan), China
J
Jiayu Qian
Department of Computer Science, City University of Hong Kong (Dongguan), China
S
Songpan Gao
Department of Computer Science, City University of Hong Kong (Dongguan), China
Guanxing Chen
Guanxing Chen
Research Assistant Professor at City University of Hong Kong (Dongguan); Prev. UTokyo, SYSU.
AI for life sciences
Qiankun Li
Qiankun Li
Research Fellow@NTU, Ph.D.@USTC
MLLMAI4HealthComputer VisionPattern RecognitionTrustworthy AI
Y
Yu-An Huang
School of Computer Science, Northwestern Polytechnical University, China
Zhi-An Huang
Zhi-An Huang
City University of Hong Kong (Dongguan)
Artificial IntelligenceBioinformaticsMedical Image Analysis