🤖 AI Summary
This study addresses the challenges faced by existing medical multimodal large language models in rare disease scenarios—namely, data scarcity, insufficient prior knowledge, and the absence of benchmarks evaluating their multimodal, multi-image clinical reasoning capabilities. To bridge this gap, the authors construct the first multimodal, multi-image evaluation benchmark specifically for rare diseases, encompassing four core clinical tasks: diagnosis, treatment planning, cross-image evidence alignment, and examination recommendation. Built upon high-quality question-answer pairs and images derived from PMC case reports, the benchmark incorporates orchestrated task design, Orphanet ontology alignment, multi-image evidence annotation, leakage control, and a two-tier evaluation protocol. Systematic evaluation of 23 prominent models reveals that medically fine-tuned models significantly underperform general-purpose models on multi-image tasks, exposing a “capability dilution effect,” particularly pronounced in treatment planning.
📝 Abstract
Multimodal large language models (MLLMs) have advanced clinical tasks for common conditions, but their performance on rare diseases remains largely untested. In rare-disease scenarios, clinicians often lack prior clinical knowledge, forcing them to rely strictly on case-level evidence for clinical judgments. Existing benchmarks predominantly evaluate common-condition, single-image settings, leaving multimodal and multi-image evidence integration under rare-disease data scarcity systematically unevaluated. We introduce MMRareBench, to our knowledge the first rare-disease benchmark jointly evaluating multimodal and multi-image clinical capability across four workflow-aligned tracks: diagnosis, treatment planning, cross-image evidence alignment, and examination suggestion. The benchmark comprises 1,756 question-answer pairs with 7,958 associated medical images curated from PMC case reports, with Orphanet-anchored ontology alignment, track-specific leakage control, evidence-grounded annotations, and a two-level evaluation protocol. A systematic evaluation of 23 MLLMs reveals fragmented capability profiles and universally low treatment-planning performance, with medical-domain models trailing general-purpose MLLMs substantially on multi-image tracks despite competitive diagnostic scores. These patterns are consistent with a capacity dilution effect: medical fine-tuning can narrow the diagnostic gap but may erode the compositional multi-image capability that rare-disease evidence integration demands.