VGS-Decoding: Visual Grounding Score Guided Decoding for Hallucination Mitigation in Medical VLMs

📅 2026-03-19

📈 Citations: 0

✨ Influential: 0

career value

190K/year

🤖 AI Summary

Medical vision-language models often suffer from hallucinations due to overreliance on linguistic priors, posing risks to clinical safety. This work proposes a training-free, inference-stage decoding method that dynamically reweights token probabilities by computing per-token Visual Grounding Scores (VGS) through visual perturbation contrast. By adaptively enhancing visually grounded outputs and suppressing hallucinatory content, the approach achieves the first perturbation-based, token-level adaptive hallucination mitigation, overcoming the limitations of conventional fixed-weight decoding strategies. Evaluated on MIMIC-Diff-VQA and VQA-RAD benchmarks with LLaVA-Med, CheXagent, and MedGemma, the method yields up to a 9.12% improvement in overall performance and an 8.98% gain in open-ended recall, with only a 2× increase in inference overhead.

Technology Category

Application Category

📝 Abstract

Medical Vision-Language Models (VLMs) often hallucinate by generating responses based on language priors rather than visual evidence, posing risks in clinical applications. We propose Visual Grounding Score Guided Decoding (VGS-Decoding), a training-free method to mitigate hallucinations during inference. Our key insight is that hallucinated tokens maintain or increase their probability when visual information is degraded, while visually grounded tokens decrease in probability. We introduce the Visual Grounding Score (VGS), which measures each token's visual dependency by comparing distributions from original and distorted images. During decoding, we reweight probabilities by amplifying visually grounded tokens while suppressing hallucinations. Unlike fixed-weight contrastive methods, VGS-Decoding provides per-token adaptive control. Experiments on MIMIC-Diff-VQA and VQA-RAD across LLaVA-Med, CheXagent, and MedGemma demonstrate consistent improvements, with up to +9.12% overall gain and $+8.98\%$ in open-ended recall, while introducing only $2\times$ inference overhead and no additional training, making it practical for clinical deployment. Upon acceptance, code will be released publicly to facilitate reproducibility.

Problem

Research questions and friction points this paper is trying to address.

medical vision-language models

hallucination

visual grounding

clinical safety

language priors

Innovation

Methods, ideas, or system contributions that make the work stand out.

Visual Grounding Score

Hallucination Mitigation

Training-free Decoding