MedProbeBench: Systematic Benchmarking at Deep Evidence Integration for Expert-level Medical Guideline

📅 2026-04-20
📈 Citations: 0
Influential: 0
📄 PDF

career value

220K/year
🤖 AI Summary
Existing benchmarks struggle to evaluate large language models’ ability to synthesize multi-step medical evidence and generate expert-level clinical guidelines. To address this gap, this work introduces MedProbeBench—the first benchmark that leverages high-quality clinical guidelines as a gold standard for expert-level assessment. It features a comprehensive evaluation framework comprising over 1,200 adaptive scoring rules and enables fine-grained evidence verification through more than 5,130 atomic claims. Integrating large language models, deep research agents, and task-adaptive scoring mechanisms, MedProbeBench facilitates a large-scale evaluation of 17 state-of-the-art systems, revealing that current approaches still fall significantly short of expert performance in evidence integration and guideline generation.

Technology Category

Application Category

📝 Abstract
Recent advances in deep research systems enable large language models to retrieve, synthesize, and reason over large-scale external knowledge. In medicine, developing clinical guidelines critically depends on such deep evidence integration. However, existing benchmarks fail to evaluate this capability in realistic workflows requiring multi-step evidence integration and expert-level judgment. To address this gap, we introduce MedProbeBench, the first benchmark leveraging high-quality clinical guidelines as expert-level references. Medical guidelines, with their rigorous standards in neutrality and verifiability, represent the pinnacle of medical expertise and pose substantial challenges for deep research agents. For evaluation, we propose MedProbe-Eval, a comprehensive evaluation framework featuring: (1) Holistic Rubrics with 1,200+ task-adaptive rubric criteria for comprehensive quality assessment, and (2) Fine-grained Evidence Verification for rigorous validation of evidence precision, grounded in 5,130+ atomic claims. Evaluation of 17 LLMs and deep research agents reveals critical gaps in evidence integration and guideline generation, underscoring the substantial distance between current capabilities and expert-level clinical guideline development. Project: https://github.com/uni-medical/MedProbeBench
Problem

Research questions and friction points this paper is trying to address.

evidence integration
clinical guidelines
benchmarking
expert-level judgment
medical AI
Innovation

Methods, ideas, or system contributions that make the work stand out.

deep evidence integration
clinical guideline benchmarking
expert-level evaluation
fine-grained evidence verification
large language models in medicine
🔎 Similar Papers
No similar papers found.
Jiyao Liu
Jiyao Liu
Fudan University
Low-level VisionAI for Healthcare/ScienceMLLMsAgent
J
Jianghan Shen
Nanjing University, Nanjing, China; Shanghai Artificial Intelligence Laboratory, Shanghai, China
S
Sida Song
Fudan University, Shanghai, China
Tianbin Li
Tianbin Li
Shanghai Artificial Intelligence Laboratory
Machine LearningComputer VisionGeneral Intelligence
X
Xiaojia Liu
Fudan University, Shanghai, China
R
Rongbin Li
Shanghai Artificial Intelligence Laboratory, Shanghai, China
Z
Ziyan Huang
Shanghai Artificial Intelligence Laboratory, Shanghai, China
J
Jiashi Lin
Shanghai Artificial Intelligence Laboratory, Shanghai, China
J
Junzhi Ning
Shanghai Artificial Intelligence Laboratory, Shanghai, China
C
Changkai Ji
Fudan University, Shanghai, China; Shanghai Artificial Intelligence Laboratory, Shanghai, China
Siqi Luo
Siqi Luo
Shanghai Jiao Tong university
AIGCComputer VisionImage EditingAI4Science
W
Wenjie Li
Shanghai Artificial Intelligence Laboratory, Shanghai, China
Chenglong Ma
Chenglong Ma
Fudan University; Shanghai Innovation Institute
multi-modal modelsgenerative modelsmedical image analysis
Ming Hu
Ming Hu
Monash University | Shanghai AI Laboratory
Jing Xiong
Jing Xiong
The University of Hong Kong
Natural Language ProcessingAutomated Theorem Proving
J
Jin Ye
Shanghai Artificial Intelligence Laboratory, Shanghai, China
B
Bin Fu
Shanghai Artificial Intelligence Laboratory, Shanghai, China
N
Ningsheng Xu
Fudan University, Shanghai, China
Yirong Chen
Yirong Chen
Stanford University
L
Lei Jin
Fudan University, Shanghai, China
Hong Chen
Hong Chen
Professor of computer science, Renmin University of China
Data privacyManagement of DataDatabase
Junjun He
Junjun He
Shanghai Jiao Tong University