Better Eyes, Better Thoughts: Why Vision Chain-of-Thought Fails in Medicine

📅 2026-03-02

📈 Citations: 0

✨ Influential: 0

career value

198K/year

🤖 AI Summary

In medical visual question answering (VQA), chain-of-thought (CoT) reasoning often underperforms direct answering (DirA), exhibiting a performance degradation phenomenon. This work proposes the "medical perceptual bottleneck" hypothesis, attributing CoT’s failure primarily to insufficient cross-modal alignment. To address this, the authors introduce two training-free, inference-time visual grounding interventions: perception anchoring via region-of-interest prompts and description localization guided by high-quality textual cues. Extensive experiments across multiple medical VQA benchmarks and diverse model families demonstrate the effectiveness of these strategies, significantly improving CoT accuracy. The proposed methods not only mitigate CoT’s performance deficit relative to DirA but also successfully reverse the performance gap in several settings, establishing CoT as the superior approach under specific conditions.

Technology Category

Application Category

📝 Abstract

Large vision-language models (VLMs) often benefit from chain-of-thought (CoT) prompting in general domains, yet its efficacy in medical vision-language tasks remains underexplored. We report a counter-intuitive trend: on medical visual question answering, CoT frequently underperforms direct answering (DirA) across general-purpose and medical-specific models. We attribute this to a \emph{medical perception bottleneck}: subtle, domain-specific cues can weaken visual grounding, and CoT may compound early perceptual uncertainty rather than correct it. To probe this hypothesis, we introduce two training-free, inference-time grounding interventions: (i) \emph{perception anchoring} via region-of-interest cues and (ii) \emph{description grounding} via high-quality textual guidance. Across multiple benchmarks and model families, these interventions improve accuracy, mitigate CoT degradation, and in several settings reverse the CoT--DirA inversion. Our findings suggest that reliable clinical VLMs require robust visual grounding and cross-modal alignment, beyond extending text-driven reasoning chains. Code is available \href{https://github.com/TianYin123/Better_Eyes_Better_Thoughts}{here}.

Problem

Research questions and friction points this paper is trying to address.

medical vision-language models

chain-of-thought

visual grounding

perception bottleneck

visual question answering

Innovation

Methods, ideas, or system contributions that make the work stand out.

medical perception bottleneck

perception anchoring

description grounding