MedProbeBench: Systematic Benchmarking at Deep Evidence Integration for Expert-level Medical Guideline

📅 2026-04-20

📈 Citations: 0

✨ Influential: 0

career value

227K/year

🤖 AI Summary

Existing benchmarks struggle to evaluate large language models’ ability to synthesize multi-step medical evidence and generate expert-level clinical guidelines. To address this gap, this work introduces MedProbeBench—the first benchmark that leverages high-quality clinical guidelines as a gold standard for expert-level assessment. It features a comprehensive evaluation framework comprising over 1,200 adaptive scoring rules and enables fine-grained evidence verification through more than 5,130 atomic claims. Integrating large language models, deep research agents, and task-adaptive scoring mechanisms, MedProbeBench facilitates a large-scale evaluation of 17 state-of-the-art systems, revealing that current approaches still fall significantly short of expert performance in evidence integration and guideline generation.

Technology Category

Application Category

📝 Abstract

Recent advances in deep research systems enable large language models to retrieve, synthesize, and reason over large-scale external knowledge. In medicine, developing clinical guidelines critically depends on such deep evidence integration. However, existing benchmarks fail to evaluate this capability in realistic workflows requiring multi-step evidence integration and expert-level judgment. To address this gap, we introduce MedProbeBench, the first benchmark leveraging high-quality clinical guidelines as expert-level references. Medical guidelines, with their rigorous standards in neutrality and verifiability, represent the pinnacle of medical expertise and pose substantial challenges for deep research agents. For evaluation, we propose MedProbe-Eval, a comprehensive evaluation framework featuring: (1) Holistic Rubrics with 1,200+ task-adaptive rubric criteria for comprehensive quality assessment, and (2) Fine-grained Evidence Verification for rigorous validation of evidence precision, grounded in 5,130+ atomic claims. Evaluation of 17 LLMs and deep research agents reveals critical gaps in evidence integration and guideline generation, underscoring the substantial distance between current capabilities and expert-level clinical guideline development. Project: https://github.com/uni-medical/MedProbeBench

Problem

Research questions and friction points this paper is trying to address.

evidence integration

clinical guidelines

benchmarking

expert-level judgment

medical AI

Innovation

Methods, ideas, or system contributions that make the work stand out.

deep evidence integration

clinical guideline benchmarking

expert-level evaluation