Wasserstein Equilibrium Decoding for Reliable Medical Visual Question Answering

📅 2026-05-18

📈 Citations: 0

✨ Influential: 0

career value

177K/year

🤖 AI Summary

This work addresses the challenge of small-scale vision-language models generating plausible yet incorrect responses in medical visual question answering, which undermines clinical reliability. It introduces game-theoretic decoding to open-domain multimodal medical QA—a first in this domain—and proposes a Wasserstein stopping criterion grounded in semantic consistency, replacing conventional token-order matching to achieve convergence toward semantic consensus among paraphrastic answers. This approach effectively mitigates redundant iterations caused by ranking instability among clinically equivalent responses. Evaluated on VQA-RAD and PathVQA, the Qwen3-VL-2B model achieves a 3.5-percentage-point improvement (p<0.01) over greedy and discriminative baselines, outperforming a 4B greedy counterpart while reducing inference iterations by approximately 20%, thus balancing accuracy and efficiency.

📝 Abstract

Small vision-language models (2-8B) are well-suited for clin- ical deployment due to privacy constraints, limited connectivity, and low-latency requirements favouring on-device or on-premise inference. However, their limited capacity exacerbates the generation of plausible but incorrect outputs. We extend game-theoretic decoding, previously restricted to text-only, closed-ended NLP tasks, to vision-language mod- els for open-ended Medical VQA. We introduce a semantically aware Wasserstein stopping criterion that replaces lexical order matching, en- abling convergence based on semantic consensus among near-synonymous candidate answers and avoiding unnecessary iterations caused by clini- cally equivalent ranking swaps. On VQA-RAD and PathVQA, we ob- tain consistent, statistically significant improvements over greedy and discriminative baselines. On VQA-RAD, we improve Qwen3-VL-2B by +3.5 percentage points (p < 0.01), surpassing the greedy 4B model, with similar trends at larger scales. On PathVQA, Gemma-3-4B with BDG matches MedGemma-4B under greedy decoding despite no domain- specific fine-tuning. At accuracy parity with classic BDG, the Wasser- stein criterion reduces average convergence iterations by approximately 20%, improving inference efficiency while preserving the game-theoretic equilibrium behaviour. Code is available at https://github.com/luca-hagen/ Wasserstein-BDG-medical-VQA.

Problem

Research questions and friction points this paper is trying to address.

Medical Visual Question Answering

vision-language models

hallucination

reliable generation

small language models

Innovation

Methods, ideas, or system contributions that make the work stand out.

Wasserstein equilibrium decoding

Medical Visual Question Answering

vision-language models