Anatomical Region-Guided Contrastive Decoding: A Plug-and-Play Strategy for Mitigating Hallucinations in Medical VLMs

πŸ“… 2025-12-18
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
Medical vision-language models (MedVLMs) are prone to hallucination due to overreliance on textual priors; existing mitigation strategies either require costly expert annotations or apply global, non-targeted interventions, resulting in insufficient clinical robustness. To address this, we propose ARCDβ€”a three-tiered, anatomy-region-mask-guided contrastive decoding framework that performs fine-grained token-, attention-, and logits-level reweighting without model fine-tuning, enabling plug-and-play hallucination suppression. ARCD is the first inference-time intervention framework that jointly achieves anatomical interpretability, training-agnosticism, and cross-modal generalizability. Evaluated on chest X-rays, CT, brain MRI, and ocular ultrasound, ARCD significantly reduces hallucination rates, improves regional understanding accuracy by 12.7%, and boosts diagnostic accuracy by an average of 9.3%.

Technology Category

Application Category

πŸ“ Abstract
Medical Vision-Language Models (MedVLMs) show immense promise in clinical applicability. However, their reliability is hindered by hallucinations, where models often fail to derive answers from visual evidence, instead relying on learned textual priors. Existing mitigation strategies for MedVLMs have distinct limitations: training-based methods rely on costly expert annotations, limiting scalability, while training-free interventions like contrastive decoding, though data-efficient, apply a global, untargeted correction whose effects in complex real-world clinical settings can be unreliable. To address these challenges, we introduce Anatomical Region-Guided Contrastive Decoding (ARCD), a plug-and-play strategy that mitigates hallucinations by providing targeted, region-specific guidance. Our module leverages an anatomical mask to direct a three-tiered contrastive decoding process. By dynamically re-weighting at the token, attention, and logits levels, it verifiably steers the model's focus onto specified regions, reinforcing anatomical understanding and suppressing factually incorrect outputs. Extensive experiments across diverse datasets, including chest X-ray, CT, brain MRI, and ocular ultrasound, demonstrate our method's effectiveness in improving regional understanding, reducing hallucinations, and enhancing overall diagnostic accuracy.
Problem

Research questions and friction points this paper is trying to address.

Mitigate hallucinations in medical VLMs
Provide targeted anatomical region guidance
Enhance diagnostic accuracy across diverse datasets
Innovation

Methods, ideas, or system contributions that make the work stand out.

Plug-and-play strategy using anatomical region guidance
Three-tiered contrastive decoding at token, attention, logits
Dynamic re-weighting to suppress hallucinations in medical VLMs
πŸ”Ž Similar Papers
No similar papers found.
X
Xiao Liang
School of Computer Science and Technology, Xidian University, China
C
Chenxi Liu
School of Computer Science and Technology, Xidian University, China
Zhi Ma
Zhi Ma
China Mobile (Hangzhou) Information Technology Co., Ltd.
Edge Intelligence Deep Learning LLM
D
Di Wang
School of Computer Science and Technology, Xidian University, China
Bin Jing
Bin Jing
School of Biomedical Engineering, Capital Medical University, China
Q
Quan Wang
School of Computer Science and Technology, Xidian University, China
Yuanyuan Shi
Yuanyuan Shi
Assistant Professor, UCSD
Power systemsControlMachine learning