🤖 AI Summary
This study addresses the lack of open, multicenter benchmarks for evaluating the real-world efficacy of AI assistants in routine digital pathology diagnostics. To this end, the authors introduce DALPHIN, the first open multicenter visual question answering (VQA) benchmark, encompassing 1,236 whole-slide images across 130 diseases and 14 subspecialties from six countries, with performance benchmarked by 31 international pathologists. Employing a blinded ground-truth access protocol, the study systematically evaluates GPT-5, Gemini 2.5 Pro, and the authors’ PathChat+ model under both independent and sequential answering paradigms. PathChat+ achieves expert-level performance in four out of six tasks, significantly outperforming competing models. The complete dataset and evaluation platform are publicly released to establish a reliable benchmark for future pathology AI copilot research.
📝 Abstract
Foundation models with visual question answering capabilities for digital pathology are emerging. Such unprecedented technology requires independent benchmarking to assess its potential in assisting pathologists in routine diagnostics. We created DALPHIN, the first multicentric open benchmark for pathology AI copilots, comprising 1236 images from 300 cases, spanning 130 rare to common diagnoses, 6 countries, and 14 subspecialties. The DALPHIN design and dataset are introduced alongside a human performance benchmark of 31 pathologists from 10 countries with varying expertise. We report results for two general-purpose (GPT-5, Gemini 2.5 Pro) and one pathology-specific copilot (PathChat+) for sequential and independent answer generation. We observed no statistically significant difference from expert-level performance in four of six tasks for PathChat, 2/6 tasks for Gemini, and 1/6 tasks for GPT. DALPHIN is publicly released with sequestered, indirectly accessible ground truth to foster robust and enduring benchmarking. Data, methods, and the evaluation platform are accessible through dalphin.grand-challenge.org.