GAPS: A Clinically Grounded, Automated Benchmark for Evaluating AI Clinicians

📅 2025-10-15
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Current AI clinician evaluations predominantly rely on multiple-choice questions or manual scoring, failing to adequately capture the cognitive depth, completeness, robustness, and safety required in real-world clinical decision-making. To address this, we propose GAPS—a four-dimensional automated assessment framework: (1) guideline-grounded, end-to-end generation of high-fidelity clinical test items; (2) integration of a DeepResearch agent that emulates GRADE/PICO evidence appraisal and ReAct-style iterative reasoning; and (3) an LLM-based adjudication panel enabling reproducible, objective scoring. GAPS is the first framework to enable clinical-anchored, fully automated, multi-dimensional evaluation of AI clinician capabilities. Experimental results reveal systematic deficiencies across leading models in deep clinical reasoning, answer completeness, and adversarial robustness—highlighting critical gaps for safe, reliable clinical deployment.

Technology Category

Application Category

📝 Abstract
Current benchmarks for AI clinician systems, often based on multiple-choice exams or manual rubrics, fail to capture the depth, robustness, and safety required for real-world clinical practice. To address this, we introduce the GAPS framework, a multidimensional paradigm for evaluating extbf{G}rounding (cognitive depth), extbf{A}dequacy (answer completeness), extbf{P}erturbation (robustness), and extbf{S}afety. Critically, we developed a fully automated, guideline-anchored pipeline to construct a GAPS-aligned benchmark end-to-end, overcoming the scalability and subjectivity limitations of prior work. Our pipeline assembles an evidence neighborhood, creates dual graph and tree representations, and automatically generates questions across G-levels. Rubrics are synthesized by a DeepResearch agent that mimics GRADE-consistent, PICO-driven evidence review in a ReAct loop. Scoring is performed by an ensemble of large language model (LLM) judges. Validation confirmed our automated questions are high-quality and align with clinician judgment. Evaluating state-of-the-art models on the benchmark revealed key failure modes: performance degrades sharply with increased reasoning depth (G-axis), models struggle with answer completeness (A-axis), and they are highly vulnerable to adversarial perturbations (P-axis) as well as certain safety issues (S-axis). This automated, clinically-grounded approach provides a reproducible and scalable method for rigorously evaluating AI clinician systems and guiding their development toward safer, more reliable clinical practice.
Problem

Research questions and friction points this paper is trying to address.

Evaluating AI clinicians' cognitive depth and robustness
Assessing answer completeness and safety in clinical AI
Automating benchmark creation to overcome scalability limitations
Innovation

Methods, ideas, or system contributions that make the work stand out.

Automated pipeline constructs evidence-based clinical benchmarks
Dual graph and tree representations model clinical reasoning
LLM ensemble judges score using synthesized GRADE-aligned rubrics
🔎 Similar Papers
No similar papers found.
X
Xiuyuan Chen
Department of Thoracic Surgery, Peking University People’s Hospital, Beijing, China.
T
Tao Sun
Ant Group
D
Dexin Su
Ant Group
A
Ailing Yu
Ant Group
J
Junwei Liu
Ant Group
Z
Zhe Chen
Ant Group
G
Gangzeng Jin
Ant Group
X
Xin Wang
Ant Group
J
Jingnan Liu
Ant Group
H
Hansong Xiao
Ant Group
H
Hualei Zhou
Ant Group
D
Dongjie Tao
Ant Group
C
Chunxiao Guo
Ant Group
Minghui Yang
Minghui Yang
Ant Group
NLPDialogueGraph3DV
Y
Yuan Xia
Ant Group
J
Jing Zhao
Ant Group
Q
Qianrui Fan
Ant Group
Yanyun Wang
Yanyun Wang
MPhil Student, The Hong Kong University of Science and Technology (Guangzhou)
Adversarial RobustnessAI Security
S
Shuai Zhen
Ant Group
K
Kezhong Chen
Department of Thoracic Surgery, Peking University People’s Hospital, Beijing, China.
J
Jun Wang
Department of Thoracic Surgery, Peking University People’s Hospital, Beijing, China.
Z
Zewen Sun
Department of Thoracic Surgery, Peking University People’s Hospital, Beijing, China.
Heng Zhao
Heng Zhao
The Rockefeller University
Image RestorationDeep LearningInverse Problem
T
Tian Guan
Department of Thoracic Surgery, Peking University People’s Hospital, Beijing, China.
S
Shaodong Wang
Department of Thoracic Surgery, Peking University People’s Hospital, Beijing, China.
G
Geyun Chang
Department of Thoracic Surgery, Peking University People’s Hospital, Beijing, China.
J
Jiaming Deng
Department of Thoracic Surgery, Peking University People’s Hospital, Beijing, China.
H
Hongchengcheng Chen
Department of Thoracic Surgery, Peking University People’s Hospital, Beijing, China.
K
K. Feng
Department of Thoracic Surgery, Peking University People’s Hospital, Beijing, China.
R
Ruzhen Li
Department of Thoracic Surgery, Peking University People’s Hospital, Beijing, China.
J
J. Geng
Department of Thoracic Surgery, Peking University People’s Hospital, Beijing, China.
C
Changtai Zhao
Department of Thoracic Surgery, Peking University People’s Hospital, Beijing, China.
G
Guihu Lin
Department of Thoracic Surgery, Peking University People’s Hospital, Beijing, China.
Peihao Li
Peihao Li
Tsinghua University; Tongyi Lab, Alibaba
3D reconstructiongenerative modelhuman avatar
L
Liqi Liu
Department of Thoracic Surgery, Peking University People’s Hospital, Beijing, China.
P
Peng Wei
Ant Group
J
Jian Wang
Ant Group
Jinjie Gu
Jinjie Gu
ant group
机器学习,推荐
P
Ping Wang
School of Software and Microelectronics, Peking University, Beijing, China.
F
Fan Yang
Department of Thoracic Surgery, Peking University People’s Hospital, Beijing, China.