🤖 AI Summary
This work addresses the critical issue of hallucination in multimodal large language models (MLLMs) for medical visual question answering (VQA), where generated answers often contradict visual evidence, posing significant risks of misdiagnosis. To mitigate this, the authors propose a training-free, plug-and-play visual logical consistency verification framework that establishes a bidirectional reasoning loop grounded in image evidence. The framework extracts semantic units from model outputs, generates verification questions, and enforces consistency between the model’s visual attention and the factual content of the image. Unlike existing introspective approaches that rely on indirect uncertainty estimation, this method directly validates the factual consistency of answers. Extensive experiments demonstrate that the proposed approach substantially outperforms current introspective methods across multiple medical VQA benchmarks and models, while also enhancing the performance of uncertainty-based techniques—all with high computational efficiency.
📝 Abstract
Multimodal Large Language Models (MLLMs) have shown remarkable capability in assisting disease diagnosis in medical visual question answering (VQA). However, their outputs remain vulnerable to hallucinations (i.e., responses that contradict visual facts), posing significant risks in high-stakes medical scenarios. Recent introspective detection methods, particularly uncertainty-based approaches, offer computational efficiency but are fundamentally indirect, as they estimate predictive uncertainty for an image-question pair rather than verifying the factual correctness of a specific answer. To address this limitation, we propose Visual Logical Loop Verification (V-Loop), a training-free and plug-and-play framework for hallucination detection in medical VQA. V-Loop introduces a bidirectional reasoning process that forms a visually grounded logical loop to verify factual correctness. Given an input, the MLLM produces an answer for the primary input pair. V-Loop extracts semantic units from the primary QA pair, generates a verification question by conditioning on the answer unit to re-query the question unit, and enforces visual attention consistency to ensure answering both primary question and verification question rely on the same image evidence. If the verification answer matches the expected semantic content, the logical loop closes, indicating factual grounding; otherwise, the primary answer is flagged as hallucinated. Extensive experiments on multiple medical VQA benchmarks and MLLMs show that V-Loop consistently outperforms existing introspective methods, remains highly efficient, and further boosts uncertainty-based approaches when used in combination.