🤖 AI Summary
Existing benchmarks struggle to evaluate large language models’ ability to synthesize multi-step medical evidence and generate expert-level clinical guidelines. To address this gap, this work introduces MedProbeBench—the first benchmark that leverages high-quality clinical guidelines as a gold standard for expert-level assessment. It features a comprehensive evaluation framework comprising over 1,200 adaptive scoring rules and enables fine-grained evidence verification through more than 5,130 atomic claims. Integrating large language models, deep research agents, and task-adaptive scoring mechanisms, MedProbeBench facilitates a large-scale evaluation of 17 state-of-the-art systems, revealing that current approaches still fall significantly short of expert performance in evidence integration and guideline generation.
📝 Abstract
Recent advances in deep research systems enable large language models to retrieve, synthesize, and reason over large-scale external knowledge. In medicine, developing clinical guidelines critically depends on such deep evidence integration. However, existing benchmarks fail to evaluate this capability in realistic workflows requiring multi-step evidence integration and expert-level judgment. To address this gap, we introduce MedProbeBench, the first benchmark leveraging high-quality clinical guidelines as expert-level references. Medical guidelines, with their rigorous standards in neutrality and verifiability, represent the pinnacle of medical expertise and pose substantial challenges for deep research agents. For evaluation, we propose MedProbe-Eval, a comprehensive evaluation framework featuring: (1) Holistic Rubrics with 1,200+ task-adaptive rubric criteria for comprehensive quality assessment, and (2) Fine-grained Evidence Verification for rigorous validation of evidence precision, grounded in 5,130+ atomic claims. Evaluation of 17 LLMs and deep research agents reveals critical gaps in evidence integration and guideline generation, underscoring the substantial distance between current capabilities and expert-level clinical guideline development. Project: https://github.com/uni-medical/MedProbeBench