🤖 AI Summary
Medical vision-language models often suffer from hallucinations due to overreliance on linguistic priors, posing risks to clinical safety. This work proposes a training-free, inference-stage decoding method that dynamically reweights token probabilities by computing per-token Visual Grounding Scores (VGS) through visual perturbation contrast. By adaptively enhancing visually grounded outputs and suppressing hallucinatory content, the approach achieves the first perturbation-based, token-level adaptive hallucination mitigation, overcoming the limitations of conventional fixed-weight decoding strategies. Evaluated on MIMIC-Diff-VQA and VQA-RAD benchmarks with LLaVA-Med, CheXagent, and MedGemma, the method yields up to a 9.12% improvement in overall performance and an 8.98% gain in open-ended recall, with only a 2× increase in inference overhead.
📝 Abstract
Medical Vision-Language Models (VLMs) often hallucinate by generating responses based on language priors rather than visual evidence, posing risks in clinical applications. We propose Visual Grounding Score Guided Decoding (VGS-Decoding), a training-free method to mitigate hallucinations during inference. Our key insight is that hallucinated tokens maintain or increase their probability when visual information is degraded, while visually grounded tokens decrease in probability. We introduce the Visual Grounding Score (VGS), which measures each token's visual dependency by comparing distributions from original and distorted images. During decoding, we reweight probabilities by amplifying visually grounded tokens while suppressing hallucinations. Unlike fixed-weight contrastive methods, VGS-Decoding provides per-token adaptive control. Experiments on MIMIC-Diff-VQA and VQA-RAD across LLaVA-Med, CheXagent, and MedGemma demonstrate consistent improvements, with up to +9.12% overall gain and $+8.98\%$ in open-ended recall, while introducing only $2\times$ inference overhead and no additional training, making it practical for clinical deployment. Upon acceptance, code will be released publicly to facilitate reproducibility.