Prompting Medical Large Vision-Language Models to Diagnose Pathologies by Visual Question Answering

📅 2024-07-31
🏛️ arXiv.org
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Medical large vision-language models (MLVLMs) suffer from two critical limitations in pathological visual question answering (VQA): severe hallucination and poor long-tail pathology recognition. To address these, we propose two plug-and-play prompting strategies requiring no model fine-tuning: (1) pathology-semantic explanation prompting, which explicitly injects anatomy–pathology relational knowledge to constrain reasoning; and (2) weak-learner decision injection prompting, which distills confidence-based predictions from a lightweight classifier into structured textual tokens and fuses them into the LLM’s input. Our method operates within a VQA framework integrating knowledge-guided prompting, decision-text injection, and the POPE evaluation protocol. Evaluated on MIMIC-CXR-JPG and CheXpert, it achieves up to +0.27 in F1 score and +0.07 in recall. Notably, the approach generalizes effectively to generic LVLMs, significantly improving diagnostic robustness and few-shot generalization capability.

Technology Category

Application Category

📝 Abstract
Large Vision-Language Models (LVLMs) have achieved significant success in recent years, and they have been extended to the medical domain. Although demonstrating satisfactory performance on medical Visual Question Answering (VQA) tasks, Medical LVLMs (MLVLMs) suffer from the hallucination problem, which makes them fail to diagnose complex pathologies. Moreover, they readily fail to learn minority pathologies due to imbalanced training data. We propose two prompting strategies for MLVLMs that reduce hallucination and improve VQA performance. In the first strategy, we provide a detailed explanation of the queried pathology. In the second strategy, we fine-tune a cheap, weak learner to achieve high performance on a specific metric, and textually provide its judgment to the MLVLM. Tested on the MIMIC-CXR-JPG and Chexpert datasets, our methods significantly improve the diagnostic F1 score, with the highest increase being 0.27. We also demonstrate that our prompting strategies can be extended to general LVLM domains. Based on POPE metrics, it effectively suppresses the false negative predictions of existing LVLMs and improves Recall by approximately 0.07.
Problem

Research questions and friction points this paper is trying to address.

Reduce hallucination in Medical LVLMs for accurate pathology diagnosis.
Address imbalanced training data to improve learning of minority pathologies.
Enhance Visual Question Answering performance in medical diagnostics.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Detailed pathology explanations reduce hallucination.
Weak learner fine-tuning enhances specific metric performance.
Prompting strategies improve diagnostic F1 and Recall scores.
🔎 Similar Papers
No similar papers found.
D
Danfeng Guo
Computer Science Department, University of California, Los Angeles, CA, USA
D
D. Terzopoulos
Computer Science Department, University of California, Los Angeles, CA, USA; VoxelCloud, Inc., Los Angeles, CA, USA