🤖 AI Summary
This study addresses the underperformance of medical multimodal large language models (MLLMs) in zero-shot medical image understanding, which stems primarily from inadequate visual grounding capabilities—a limitation that has lacked systematic investigation. The work presents the first comprehensive analysis identifying visual grounding bias as a critical bottleneck in medical MLLMs and introduces VGMED, a novel evaluation benchmark annotated by clinical experts, along with new quantitative metrics and qualitative analysis protocols. Furthermore, the authors propose VGRefine, a training-free inference-time optimization method that enhances localization accuracy through attention redistribution. Extensive experiments across eight state-of-the-art medical MLLMs and six Med-VQA benchmarks—spanning eight imaging modalities and over 110,000 samples—demonstrate that VGRefine consistently improves both visual grounding and question-answering performance, achieving state-of-the-art results.
📝 Abstract
Generalist multimodal large language models (MLLMs) have achieved impressive performance across a wide range of vision-language tasks. However, their performance on medical tasks, particularly in zero-shot settings where generalization is critical, remains suboptimal. A key research gap is the limited understanding of why medical MLLMs underperform in medical image interpretation. In this work, we present a pioneering systematic investigation into the visual grounding capabilities of state-of-the-art medical MLLMs. To disentangle visual grounding from semantic grounding, we design VGMED, a novel evaluation dataset developed with expert clinical guidance, explicitly assessing the visual grounding capability of medical MLLMs. We introduce new quantitative metrics and conduct detailed qualitative analyses. Our study across eight state-of-the-art (SOTA) medical MLLMs validates that they often fail to ground their predictions in clinically relevant image regions. We note that this finding is specific to medical image analysis; in contrast, prior work has shown that MLLMs are capable of grounding their predictions in the correct image regions when applied to natural scene images. Motivated by these findings, we propose VGRefine, a simple yet effective inference-time method that refines attention distribution to improve visual grounding in medical settings. Our approach achieves SOTA performance across 6 diverse Med-VQA benchmarks (over 110K VQA samples from 8 imaging modalities) without requiring additional training or external expert models. Overall, our work, for the first time, systematically validates inadequate visual grounding as one of the key contributing factors for medical MLLMs' under-performance. Additional experiments are included in the Supp.