Are Multimodal LLMs Ready for Clinical Dermatology? A Real-World Evaluation in Dermatology

📅 2026-04-30
📈 Citations: 0
Influential: 0
📄 PDF

career value

182K/year
📝 Abstract
Multimodal large language models (MLLMs) have demonstrated promise on publicly available dermatology benchmarks. However, benchmark performance may not generalize to real-world dermatologic decision-making. To quantify this benchmark-to-bedside gap, we evaluated four open-weight MLLMs (InternVL-Chat v1.5, LLaVA-Med v1.5, SkinGPT4 and MedGemma-4B-Instruct) and one commercial MLLM (GPT-4.1) across three publicly available dermatology datasets and a retrospective multi-site hospital-based dermatology consultation cohort comprising 5,811 cases and 46,405 clinical images. Models were evaluated on two clinically relevant tasks: differential diagnosis generation and severity-based triage. Diagnostic performance was modest on public datasets and declined substantially in the real-world cohort. On public benchmarks, top-3 diagnostic accuracy reached 26.55% for the best open-weight model and 42.25% for GPT-4.1. On real-world consultation cases using images alone, top-3 diagnostic accuracy fell to 1.50%-13.35% among open-weight models and 24.65% for GPT-4.1. Incorporating clinical context improved performance across all models, increasing top-3 diagnostic accuracy up to 28.75% among open-weight models and 38.93% for GPT-4.1. However, model outputs were highly sensitive to incomplete or erroneous consultation context. For severity-based triage, models achieved moderate sensitivity (above 60%), suggesting potential utility for screening but insufficient reliability for clinical deployment. These findings demonstrate that benchmark performance substantially overestimates the real-world clinical capability of current dermatology MLLMs.
Problem

Research questions and friction points this paper is trying to address.

Multimodal LLMs
Clinical Dermatology
Real-World Evaluation
Diagnostic Accuracy
Severity Triage
Innovation

Methods, ideas, or system contributions that make the work stand out.

multimodal large language models
real-world evaluation
clinical dermatology
diagnostic accuracy
severity-based triage
🔎 Similar Papers
No similar papers found.
R
Roy Jiang
Department of Dermatology, Yale School of Medicine, Yale University, New Haven, CT, USA
Hyunjae Kim
Hyunjae Kim
Yale University
Natural Language ProcessingBiomedical InformaticsHealthcare
Z
Zhenyue Qin
Department of Bioinformatics and Data Science, Yale School of Medicine, Yale University, New Haven, CT, USA
M
Morten Lee
Yale School of Medicine, Yale University, New Haven, CT, USA
M
Margaret MacGibeny
Department of Dermatology, Yale School of Medicine, Yale University, New Haven, CT, USA
A
Ailish Hanly
Department of Dermatology, Yale School of Medicine, Yale University, New Haven, CT, USA
A
Angela Sadlowski
Yale School of Medicine, Yale University, New Haven, CT, USA
S
Shanin Chowdhury
Yale School of Medicine, Yale University, New Haven, CT, USA
Xuguang Ai
Xuguang Ai
Biomedical Informatics & Data Science, Yale University
AI in HealthcareData ScienceNLPBiomedical Informatics
J
Jeffrey Gehlhausen
Department of Dermatology, Yale School of Medicine, Yale University, New Haven, CT, USA
Qingyu Chen
Qingyu Chen
Biomedical Informatics & Data Science, Yale University; NCBI-NLM, National Institutes of Health
Text miningMachine learningData curationBioNLPMedical Imaging Analysis