GAPS: A Clinically Grounded, Automated Benchmark for Evaluating AI Clinicians

📅 2025-10-15

📈 Citations: 0

✨ Influential: 0

career value

187K/year

🤖 AI Summary

Current AI clinician evaluations predominantly rely on multiple-choice questions or manual scoring, failing to adequately capture the cognitive depth, completeness, robustness, and safety required in real-world clinical decision-making. To address this, we propose GAPS—a four-dimensional automated assessment framework: (1) guideline-grounded, end-to-end generation of high-fidelity clinical test items; (2) integration of a DeepResearch agent that emulates GRADE/PICO evidence appraisal and ReAct-style iterative reasoning; and (3) an LLM-based adjudication panel enabling reproducible, objective scoring. GAPS is the first framework to enable clinical-anchored, fully automated, multi-dimensional evaluation of AI clinician capabilities. Experimental results reveal systematic deficiencies across leading models in deep clinical reasoning, answer completeness, and adversarial robustness—highlighting critical gaps for safe, reliable clinical deployment.

Technology Category

Application Category

📝 Abstract

Current benchmarks for AI clinician systems, often based on multiple-choice exams or manual rubrics, fail to capture the depth, robustness, and safety required for real-world clinical practice. To address this, we introduce the GAPS framework, a multidimensional paradigm for evaluating extbf{G}rounding (cognitive depth), extbf{A}dequacy (answer completeness), extbf{P}erturbation (robustness), and extbf{S}afety. Critically, we developed a fully automated, guideline-anchored pipeline to construct a GAPS-aligned benchmark end-to-end, overcoming the scalability and subjectivity limitations of prior work. Our pipeline assembles an evidence neighborhood, creates dual graph and tree representations, and automatically generates questions across G-levels. Rubrics are synthesized by a DeepResearch agent that mimics GRADE-consistent, PICO-driven evidence review in a ReAct loop. Scoring is performed by an ensemble of large language model (LLM) judges. Validation confirmed our automated questions are high-quality and align with clinician judgment. Evaluating state-of-the-art models on the benchmark revealed key failure modes: performance degrades sharply with increased reasoning depth (G-axis), models struggle with answer completeness (A-axis), and they are highly vulnerable to adversarial perturbations (P-axis) as well as certain safety issues (S-axis). This automated, clinically-grounded approach provides a reproducible and scalable method for rigorously evaluating AI clinician systems and guiding their development toward safer, more reliable clinical practice.

Problem

Research questions and friction points this paper is trying to address.

Evaluating AI clinicians' cognitive depth and robustness

Assessing answer completeness and safety in clinical AI

Automating benchmark creation to overcome scalability limitations

Innovation

Methods, ideas, or system contributions that make the work stand out.

Automated pipeline constructs evidence-based clinical benchmarks

Dual graph and tree representations model clinical reasoning

LLM ensemble judges score using synthesized GRADE-aligned rubrics

🔎 Similar Papers

AgentClinic: a multimodal agent benchmark to evaluate AI in simulated clinical environments