🤖 AI Summary
State-of-the-art large multimodal models (LMMs) exhibit deceptively high performance on medical visual question answering (Med-VQA) benchmarks but demonstrate poor robustness in fine-grained clinical diagnosis: GPT-4o, GPT-4V, and Gemini Pro achieve <50% accuracy—below chance level—on diagnostic tasks, while LLaVA-Med also lags significantly. Method: We propose ProbMed, the first probe-based evaluation framework tailored for clinical diagnosis. Contribution/Results: (1) It introduces negation-based probe questions to expose logical inconsistencies in model reasoning; (2) it establishes a five-dimensional procedural benchmark—covering modality identification, organ localization, abnormality detection, differential reasoning, and diagnostic justification—to enable structured, multi-step diagnostic assessment; (3) it reveals critical bottlenecks in cross-modal knowledge transfer, empirically validating the irreplaceable role of domain specialization. Experiments demonstrate systematic failure of mainstream LMMs in professional diagnosis, establishing a new standard for Med-VQA evaluation.
📝 Abstract
Large Multimodal Models (LMMs) have shown remarkable progress in medical Visual Question Answering (Med-VQA), achieving high accuracy on existing benchmarks. However, their reliability under robust evaluation is questionable. This study reveals that when subjected to simple probing evaluation, state-of-the-art models perform worse than random guessing on medical diagnosis questions. To address this critical evaluation problem, we introduce the Probing Evaluation for Medical Diagnosis (ProbMed) dataset to rigorously assess LMM performance in medical imaging through probing evaluation and procedural diagnosis. Particularly, probing evaluation features pairing original questions with negation questions with hallucinated attributes, while procedural diagnosis requires reasoning across various diagnostic dimensions for each image, including modality recognition, organ identification, clinical findings, abnormalities, and positional grounding. Our evaluation reveals that top-performing models like GPT-4o, GPT-4V, and Gemini Pro perform worse than random guessing on specialized diagnostic questions, indicating significant limitations in handling fine-grained medical inquiries. Besides, models like LLaVA-Med struggle even with more general questions, and results from CheXagent demonstrate the transferability of expertise across different modalities of the same organ, showing that specialized domain knowledge is still crucial for improving performance. This study underscores the urgent need for more robust evaluation to ensure the reliability of LMMs in critical fields like medical diagnosis, and current LMMs are still far from applicable to those fields.