DiagnosisArena: Benchmarking Diagnostic Reasoning for Large Language Models

📅 2025-05-20

📈 Citations: 0

✨ Influential: 0

career value

181K/year

🤖 AI Summary

Existing medical benchmarks inadequately assess large language models’ expert-level diagnostic capabilities in complex clinical scenarios, particularly lacking rigorous evaluation of causal reasoning, differential diagnosis, and subspecialty depth. To address this gap, we introduce DiagnosisArena—the first stringent, clinically grounded benchmark comprising 1,113 real-world case–diagnosis pairs spanning 28 medical specialties, curated from 10 top-tier peer-reviewed journals and validated through multi-round AI-assisted and expert human review to eliminate data leakage. We further propose a standardized framework for evaluating diagnostic accuracy. Empirical results reveal severe limitations: state-of-the-art models—including o3-mini (45.82%), o1 (31.09%), and DeepSeek-R1 (17.79%)—achieve markedly subhuman performance, exposing fundamental deficits in advanced clinical reasoning. This work establishes a new gold standard for evaluating diagnostic competence in medical AI.

Technology Category

Application Category

📝 Abstract

The emergence of groundbreaking large language models capable of performing complex reasoning tasks holds significant promise for addressing various scientific challenges, including those arising in complex clinical scenarios. To enable their safe and effective deployment in real-world healthcare settings, it is urgently necessary to benchmark the diagnostic capabilities of current models systematically. Given the limitations of existing medical benchmarks in evaluating advanced diagnostic reasoning, we present DiagnosisArena, a comprehensive and challenging benchmark designed to rigorously assess professional-level diagnostic competence. DiagnosisArena consists of 1,113 pairs of segmented patient cases and corresponding diagnoses, spanning 28 medical specialties, deriving from clinical case reports published in 10 top-tier medical journals. The benchmark is developed through a meticulous construction pipeline, involving multiple rounds of screening and review by both AI systems and human experts, with thorough checks conducted to prevent data leakage. Our study reveals that even the most advanced reasoning models, o3-mini, o1, and DeepSeek-R1, achieve only 45.82%, 31.09%, and 17.79% accuracy, respectively. This finding highlights a significant generalization bottleneck in current large language models when faced with clinical diagnostic reasoning challenges. Through DiagnosisArena, we aim to drive further advancements in AIs diagnostic reasoning capabilities, enabling more effective solutions for real-world clinical diagnostic challenges. We provide the benchmark and evaluation tools for further research and development https://github.com/SPIRAL-MED/DiagnosisArena.

Problem

Research questions and friction points this paper is trying to address.

Systematically benchmark LLMs' diagnostic capabilities in healthcare

Evaluate professional-level diagnostic reasoning across 28 medical specialties

Address generalization bottleneck in clinical diagnostic AI models

Innovation

Methods, ideas, or system contributions that make the work stand out.

Comprehensive benchmark for diagnostic reasoning evaluation

Segmented patient cases from top-tier medical journals

AI and human expert-reviewed construction pipeline

🔎 Similar Papers

DiReCT: Diagnostic Reasoning for Clinical Notes via Large Language Models