🤖 AI Summary
Current AI clinician evaluations predominantly rely on multiple-choice questions or manual scoring, failing to adequately capture the cognitive depth, completeness, robustness, and safety required in real-world clinical decision-making. To address this, we propose GAPS—a four-dimensional automated assessment framework: (1) guideline-grounded, end-to-end generation of high-fidelity clinical test items; (2) integration of a DeepResearch agent that emulates GRADE/PICO evidence appraisal and ReAct-style iterative reasoning; and (3) an LLM-based adjudication panel enabling reproducible, objective scoring. GAPS is the first framework to enable clinical-anchored, fully automated, multi-dimensional evaluation of AI clinician capabilities. Experimental results reveal systematic deficiencies across leading models in deep clinical reasoning, answer completeness, and adversarial robustness—highlighting critical gaps for safe, reliable clinical deployment.
📝 Abstract
Current benchmarks for AI clinician systems, often based on multiple-choice exams or manual rubrics, fail to capture the depth, robustness, and safety required for real-world clinical practice. To address this, we introduce the GAPS framework, a multidimensional paradigm for evaluating extbf{G}rounding (cognitive depth), extbf{A}dequacy (answer completeness), extbf{P}erturbation (robustness), and extbf{S}afety. Critically, we developed a fully automated, guideline-anchored pipeline to construct a GAPS-aligned benchmark end-to-end, overcoming the scalability and subjectivity limitations of prior work. Our pipeline assembles an evidence neighborhood, creates dual graph and tree representations, and automatically generates questions across G-levels. Rubrics are synthesized by a DeepResearch agent that mimics GRADE-consistent, PICO-driven evidence review in a ReAct loop. Scoring is performed by an ensemble of large language model (LLM) judges. Validation confirmed our automated questions are high-quality and align with clinician judgment. Evaluating state-of-the-art models on the benchmark revealed key failure modes: performance degrades sharply with increased reasoning depth (G-axis), models struggle with answer completeness (A-axis), and they are highly vulnerable to adversarial perturbations (P-axis) as well as certain safety issues (S-axis). This automated, clinically-grounded approach provides a reproducible and scalable method for rigorously evaluating AI clinician systems and guiding their development toward safer, more reliable clinical practice.