🤖 AI Summary
Multimodal large language models (MLLMs) lack reliable, scalable automatic evaluation methods for dermatological diagnostic report generation. Method: We introduce DermBench—the first real-world clinical image–report benchmark (4,000 cases)—and propose DermEval, a reference-free multimodal evaluation model. DermEval pioneers a clinically grounded, fine-grained scoring framework integrating structured critique with dimension-wise scoring, enabling fully automated, reproducible assessment without human-written references. Contribution/Results: On a 4,500-case test set, DermEval achieves a mean absolute deviation of only 0.117 from expert ratings (on a 5-point scale), significantly outperforming existing approaches. This work establishes the first integration of an expert-curated clinical benchmark with multidimensional automatic evaluation, substantially enhancing consistency, interpretability, and clinical credibility in diagnostic report assessment.
📝 Abstract
Multimodal large language models (LLMs) are increasingly used to generate dermatology diagnostic narratives directly from images. However, reliable evaluation remains the primary bottleneck for responsible clinical deployment. We introduce a novel evaluation framework that combines DermBench, a meticulously curated benchmark, with DermEval, a robust automatic evaluator, to enable clinically meaningful, reproducible, and scalable assessment. We build DermBench, which pairs 4,000 real-world dermatology images with expert-certified diagnostic narratives and uses an LLM-based judge to score candidate narratives across clinically grounded dimensions, enabling consistent and comprehensive evaluation of multimodal models. For individual case assessment, we train DermEval, a reference-free multimodal evaluator. Given an image and a generated narrative, DermEval produces a structured critique along with an overall score and per-dimension ratings. This capability enables fine-grained, per-case analysis, which is critical for identifying model limitations and biases. Experiments on a diverse dataset of 4,500 cases demonstrate that DermBench and DermEval achieve close alignment with expert ratings, with mean deviations of 0.251 and 0.117 (out of 5), respectively, providing reliable measurement of diagnostic ability and trustworthiness across different multimodal LLMs.