🤖 AI Summary
This work addresses the tendency of existing medical visual question answering (VQA) models to rely on language priors or dataset biases for “shortcut” answers, often neglecting critical visual evidence and thereby compromising clinical reliability. To mitigate this issue, the authors propose InViC—a lightweight plug-in framework that guides the model to focus on image regions aligned with the question intent through question-aware visual cue extraction. InViC innovatively integrates cue token extraction (CTE), an attention mask bottleneck mechanism, and LoRA fine-tuning within a two-stage training strategy. This approach compels the model to generate answers based on compressed, intent-relevant visual cues, effectively suppressing shortcut behaviors that ignore visual input. Evaluated on VQA-RAD, SLAKE, and ImageCLEF VQA-Med 2019 benchmarks, InViC substantially outperforms both zero-shot inference and standard LoRA fine-tuning, enhancing both accuracy and trustworthiness in medical VQA.
📝 Abstract
Medical visual question answering (Med-VQA) aims to answer clinically relevant questions grounded in medical images. However, existing multimodal large language models (MLLMs) often exhibit shortcut answering, producing plausible responses by exploiting language priors or dataset biases while insufficiently attending to visual evidence. This behavior undermines clinical reliability, especially when subtle imaging findings are decisive. We propose a lightweight plug-in framework, termed Intent-aware Visual Cues (InViC), to explicitly enhance image-based answer generation in medical VQA. InViC introduces a Cue Tokens Extraction (CTE) module that distills dense visual tokens into a compact set of K question-conditioned cue tokens, which serve as structured visual intermediaries injected into the LLM decoder to promote intent-aligned visual evidence. To discourage bypassing of visual information, we further design a two-stage fine-tuning strategy with a cue-bottleneck attention mask. In Stage I, we employ an attention mask to block the LLM's direct view of raw visual features, thereby funneling all visual evidence through the cue pathway. In Stage II, standard causal attention is restored to train the LLM to jointly exploit the visual and cue tokens. We evaluate InViC on three public Med-VQA benchmarks (VQA-RAD, SLAKE, and ImageCLEF VQA-Med 2019) across multiple representative MLLMs. InViC consistently improves over zero-shot inference and standard LoRA fine-tuning, demonstrating that intent-aware visual cues with bottlenecked training is a practical and effective strategy for improving trustworthy Med-VQA.