🤖 AI Summary
This study addresses the lack of systematic reliability evaluation in self-verification mechanisms for medical visual question answering, which poses risks of erroneous judgments. The authors propose a diagnostic framework that disentangles a verifier’s discriminative capability from its consistency bias, thereby uncovering— for the first time—the “verification phantom” phenomenon: verifiers erroneously accept generator answers due to an overreliance on consistency. This effect is shown to be task-dependent. Leveraging logistic mixed-effects models, saliency analysis, and cross-model validation, the authors systematically evaluate this behavior across six open-source vision-language models and five medical datasets. Findings reveal that knowledge-intensive clinical tasks are most susceptible to verification phantoms, multi-turn verification tends to entrench errors rather than correct them, and verifiers inadequately attend to image evidence, limiting their ability to serve as independent safety signals.
📝 Abstract
Self-verification, re-invoking the same vision language model (VLM) in a fresh context to check its own generated answer, is increasingly used as a default safety layer for medical visual question answering (VQA). We argue that this practice is fundamentally unreliable. We introduce [METHOD NAME], a diagnostic framework for mapping the reliability boundary of medical VLM self-verification by decomposing verifier behavior into discrimination capability and agreement bias. Because the verifier and answer generator are capacity-coupled, the verifier can overly agree with the generator, creating a verification mirage: a regime with both high verifier error and high agreement bias, driven by false acceptance of incorrect answers. Evaluating six open-weight VLMs across five medical VQA datasets and seven medical tasks, we find that this boundary is strongly task-conditioned. Knowledge-intensive clinical tasks fall deepest into the mirage, simpler tasks are more resistant, and perceptual tasks lie in between. Verification also fails to provide an independent safety signal: logistic mixed-effects analysis shows that verifier error and agreement bias become more likely when the generator is wrong, while saliency analyses show that verifiers under-attend to image evidence relative to generators, a phenomenon we call the lazy verifier. Cross-verification reduces but does not eliminate the mirage. Moreover, when verification is reused in multi-turn actor-verifier loops, most initially wrong answers become locked in by false verification. Since our experiments use clean benchmarks, the observed reliability boundary likely underestimates failures in real clinical deployment.