🤖 AI Summary
Medical vision-language models (Med-VLMs) exhibit inconsistent answers to semantically equivalent medical visual questions, primarily due to insufficient medical concept alignment and syntactic shortcut biases in training data. To address this, we introduce RoMed—a novel benchmark comprising 144K diverse, semantically equivalent question variants—enabling the first systematic characterization of answer inconsistency in medical VQA. We propose the Consistency and Contrastive Learning (CCL) framework: (i) knowledge-anchored consistency learning to enforce fine-grained medical concept alignment; (ii) bias-aware contrastive learning to suppress reliance on spurious syntactic cues; and (iii) multi-level semantic perturbation to enhance robustness. Evaluated on three mainstream medical VQA benchmarks, CCL achieves state-of-the-art performance. On RoMed, it improves answer consistency by 50% over prior methods, significantly enhancing model stability and reliability for clinical deployment.
📝 Abstract
In high-stakes medical applications, consistent answering across diverse question phrasings is essential for reliable diagnosis. However, we reveal that current Medical Vision-Language Models (Med-VLMs) exhibit concerning fragility in Medical Visual Question Answering, as their answers fluctuate significantly when faced with semantically equivalent rephrasings of medical questions. We attribute this to two limitations: (1) insufficient alignment of medical concepts, leading to divergent reasoning patterns, and (2) hidden biases in training data that prioritize syntactic shortcuts over semantic understanding. To address these challenges, we construct RoMed, a dataset built upon original VQA datasets containing 144k questions with variations spanning word-level, sentence-level, and semantic-level perturbations. When evaluating state-of-the-art (SOTA) models like LLaVA-Med on RoMed, we observe alarming performance drops (e.g., a 40% decline in Recall) compared to original VQA benchmarks, exposing critical robustness gaps. To bridge this gap, we propose Consistency and Contrastive Learning (CCL), which integrates two key components: (1) knowledge-anchored consistency learning, aligning Med-VLMs with medical knowledge rather than shallow feature patterns, and (2) bias-aware contrastive learning, mitigating data-specific priors through discriminative representation refinement. CCL achieves SOTA performance on three popular VQA benchmarks and notably improves answer consistency by 50% on the challenging RoMed test set, demonstrating significantly enhanced robustness. Code will be released.