Towards Trustworthy Dermatology MLLMs: A Benchmark and Multimodal Evaluator for Diagnostic Narratives

📅 2025-11-12
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Multimodal large language models (MLLMs) lack reliable, scalable automatic evaluation methods for dermatological diagnostic report generation. Method: We introduce DermBench—the first real-world clinical image–report benchmark (4,000 cases)—and propose DermEval, a reference-free multimodal evaluation model. DermEval pioneers a clinically grounded, fine-grained scoring framework integrating structured critique with dimension-wise scoring, enabling fully automated, reproducible assessment without human-written references. Contribution/Results: On a 4,500-case test set, DermEval achieves a mean absolute deviation of only 0.117 from expert ratings (on a 5-point scale), significantly outperforming existing approaches. This work establishes the first integration of an expert-curated clinical benchmark with multidimensional automatic evaluation, substantially enhancing consistency, interpretability, and clinical credibility in diagnostic report assessment.

Technology Category

Application Category

📝 Abstract
Multimodal large language models (LLMs) are increasingly used to generate dermatology diagnostic narratives directly from images. However, reliable evaluation remains the primary bottleneck for responsible clinical deployment. We introduce a novel evaluation framework that combines DermBench, a meticulously curated benchmark, with DermEval, a robust automatic evaluator, to enable clinically meaningful, reproducible, and scalable assessment. We build DermBench, which pairs 4,000 real-world dermatology images with expert-certified diagnostic narratives and uses an LLM-based judge to score candidate narratives across clinically grounded dimensions, enabling consistent and comprehensive evaluation of multimodal models. For individual case assessment, we train DermEval, a reference-free multimodal evaluator. Given an image and a generated narrative, DermEval produces a structured critique along with an overall score and per-dimension ratings. This capability enables fine-grained, per-case analysis, which is critical for identifying model limitations and biases. Experiments on a diverse dataset of 4,500 cases demonstrate that DermBench and DermEval achieve close alignment with expert ratings, with mean deviations of 0.251 and 0.117 (out of 5), respectively, providing reliable measurement of diagnostic ability and trustworthiness across different multimodal LLMs.
Problem

Research questions and friction points this paper is trying to address.

Evaluating dermatology diagnostic narratives from multimodal LLMs reliably
Assessing clinical trustworthiness of AI-generated dermatology reports consistently
Measuring diagnostic accuracy and biases in skin condition AI narratives
Innovation

Methods, ideas, or system contributions that make the work stand out.

A benchmark pairs images with expert narratives
A multimodal evaluator produces structured diagnostic critiques
Framework enables reproducible clinical assessment of models
🔎 Similar Papers
No similar papers found.
Y
Yuhao Shen
School of Data Science, The Chinese University of Hong Kong, Shenzhen
J
Jiahe Qian
Institute of Automation, Chinese Academy of Sciences
S
Shuping Zhang
Department of Dermatology, The First Affiliated Hospital, Shantou University Medical College
Z
Zhangtianyi Chen
School of Data Science, The Chinese University of Hong Kong, Shenzhen
T
Tao Lu
Department of Dermatology, The First Affiliated Hospital, Shantou University Medical College
Juexiao Zhou
Juexiao Zhou
Assistant Professor, The Chinese University of Hong Kong, Shenzhen
AI for HealthcareEthical AIBioinformaticsPrivacyAGI