🤖 AI Summary
In medical visual question answering (VQA), chain-of-thought (CoT) reasoning often underperforms direct answering (DirA), exhibiting a performance degradation phenomenon. This work proposes the "medical perceptual bottleneck" hypothesis, attributing CoT’s failure primarily to insufficient cross-modal alignment. To address this, the authors introduce two training-free, inference-time visual grounding interventions: perception anchoring via region-of-interest prompts and description localization guided by high-quality textual cues. Extensive experiments across multiple medical VQA benchmarks and diverse model families demonstrate the effectiveness of these strategies, significantly improving CoT accuracy. The proposed methods not only mitigate CoT’s performance deficit relative to DirA but also successfully reverse the performance gap in several settings, establishing CoT as the superior approach under specific conditions.
📝 Abstract
Large vision-language models (VLMs) often benefit from chain-of-thought (CoT) prompting in general domains, yet its efficacy in medical vision-language tasks remains underexplored. We report a counter-intuitive trend: on medical visual question answering, CoT frequently underperforms direct answering (DirA) across general-purpose and medical-specific models. We attribute this to a \emph{medical perception bottleneck}: subtle, domain-specific cues can weaken visual grounding, and CoT may compound early perceptual uncertainty rather than correct it. To probe this hypothesis, we introduce two training-free, inference-time grounding interventions: (i) \emph{perception anchoring} via region-of-interest cues and (ii) \emph{description grounding} via high-quality textual guidance. Across multiple benchmarks and model families, these interventions improve accuracy, mitigate CoT degradation, and in several settings reverse the CoT--DirA inversion. Our findings suggest that reliable clinical VLMs require robust visual grounding and cross-modal alignment, beyond extending text-driven reasoning chains. Code is available \href{https://github.com/TianYin123/Better_Eyes_Better_Thoughts}{here}.