🤖 AI Summary
Existing large vision-language models (LVLMs) for medical applications are predominantly evaluated using simplistic visual question answering (VQA) benchmarks, overlooking critical clinical competencies required in radiology. Method: We introduce RadVUQA—the first multidimensional, fine-grained evaluation benchmark specifically designed for radiology—systematically assessing models across five clinically relevant dimensions: anatomical understanding, multimodal fusion, quantitative and spatial reasoning, physiological knowledge, and robustness. RadVUQA comprises a curated multimodal dataset grounded in real clinical imaging and expert-annotated textual reports, and incorporates structured evaluation metrics, adversarial testing, and cross-modal alignment analysis. Contribution/Results: Extensive experiments reveal substantial deficiencies in current general-purpose and medical-specialized LVLMs, particularly in multimodal comprehension and quantitative reasoning, indicating a significant gap between current capabilities and clinical deployment readiness. The benchmark, code, and dataset are publicly released to foster reproducible, clinically grounded LVLM evaluation.
📝 Abstract
Recent advancements in Large Vision-Language Models (LVLMs) have demonstrated remarkable capabilities across diverse tasks, garnering significant attention in AI communities. However, their performance and reliability in specialized domains such as medicine remain insufficiently assessed. In particular, most assessments over-concentrate on evaluating VLMs based on simple Visual Question Answering (VQA) on multi-modality data, while ignoring the in-depth characteristics of LVLMs. In this study, we introduce RadVUQA, a novel Radiological Visual Understanding and Question Answering benchmark, to comprehensively evaluate existing LVLMs. RadVUQA mainly validates LVLMs across five dimensions: 1) Anatomical understanding, assessing the models' ability to visually identify biological structures; 2) Multimodal comprehension, which involves the capability of interpreting linguistic and visual instructions to produce desired outcomes; 3) Quantitative and spatial reasoning, evaluating the models' spatial awareness and proficiency in combining quantitative analysis with visual and linguistic information; 4) Physiological knowledge, measuring the models' capability to comprehend functions and mechanisms of organs and systems; and 5) Robustness, which assesses the models' capabilities against unharmonized and synthetic data. The results indicate that both generalized LVLMs and medical-specific LVLMs have critical deficiencies with weak multimodal comprehension and quantitative reasoning capabilities. Our findings reveal the large gap between existing LVLMs and clinicians, highlighting the urgent need for more robust and intelligent LVLMs. The code is available at https://github.com/Nandayang/RadVUQA