Towards Trustworthy Dermatology MLLMs: A Benchmark and Multimodal Evaluator for Diagnostic Narratives

📅 2025-11-12

📈 Citations: 0

✨ Influential: 0

career value

185K/year

🤖 AI Summary

Multimodal large language models (MLLMs) lack reliable, scalable automatic evaluation methods for dermatological diagnostic report generation. Method: We introduce DermBench—the first real-world clinical image–report benchmark (4,000 cases)—and propose DermEval, a reference-free multimodal evaluation model. DermEval pioneers a clinically grounded, fine-grained scoring framework integrating structured critique with dimension-wise scoring, enabling fully automated, reproducible assessment without human-written references. Contribution/Results: On a 4,500-case test set, DermEval achieves a mean absolute deviation of only 0.117 from expert ratings (on a 5-point scale), significantly outperforming existing approaches. This work establishes the first integration of an expert-curated clinical benchmark with multidimensional automatic evaluation, substantially enhancing consistency, interpretability, and clinical credibility in diagnostic report assessment.

Technology Category

Application Category

📝 Abstract

Multimodal large language models (LLMs) are increasingly used to generate dermatology diagnostic narratives directly from images. However, reliable evaluation remains the primary bottleneck for responsible clinical deployment. We introduce a novel evaluation framework that combines DermBench, a meticulously curated benchmark, with DermEval, a robust automatic evaluator, to enable clinically meaningful, reproducible, and scalable assessment. We build DermBench, which pairs 4,000 real-world dermatology images with expert-certified diagnostic narratives and uses an LLM-based judge to score candidate narratives across clinically grounded dimensions, enabling consistent and comprehensive evaluation of multimodal models. For individual case assessment, we train DermEval, a reference-free multimodal evaluator. Given an image and a generated narrative, DermEval produces a structured critique along with an overall score and per-dimension ratings. This capability enables fine-grained, per-case analysis, which is critical for identifying model limitations and biases. Experiments on a diverse dataset of 4,500 cases demonstrate that DermBench and DermEval achieve close alignment with expert ratings, with mean deviations of 0.251 and 0.117 (out of 5), respectively, providing reliable measurement of diagnostic ability and trustworthiness across different multimodal LLMs.

Problem

Research questions and friction points this paper is trying to address.

Evaluating dermatology diagnostic narratives from multimodal LLMs reliably

Assessing clinical trustworthiness of AI-generated dermatology reports consistently

Measuring diagnostic accuracy and biases in skin condition AI narratives

Innovation

Methods, ideas, or system contributions that make the work stand out.

A benchmark pairs images with expert narratives

A multimodal evaluator produces structured diagnostic critiques

Framework enables reproducible clinical assessment of models

🔎 Similar Papers

SkinCaRe: A Multimodal Dermatology Dataset Annotated with Medical Caption and Chain-of-Thought Reasoning