CP-Env: Evaluating Large Language Models on Clinical Pathways in a Controllable Hospital Environment

📅 2025-12-10
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing LLM evaluation benchmarks primarily focus on static exams or single-turn dialogues, failing to capture models’ capabilities in dynamic, longitudinal clinical pathways. Method: We propose CP-Env—a controllable intelligent hospital environment integrating patient-flow simulation, multi-role collaboration, and a configurable clinical pathway engine—to enable end-to-end clinical decision-making assessment. We introduce a novel three-tier evaluation framework—clinical validity, process competency, and professional ethics—grounded in multi-agent simulation, clinical knowledge graph–driven state modeling, and an automated, hierarchical metric system. Results: Experiments reveal that mainstream LLMs exhibit hallucination and critical diagnostic omissions under increasing pathway complexity; excessive reasoning degrades performance; and top-performing models rely more on internalized knowledge than external tool invocation. This work transcends static benchmarking limitations, establishing a new paradigm for rigorous, context-aware evaluation of medical LLMs.

Technology Category

Application Category

📝 Abstract
Medical care follows complex clinical pathways that extend beyond isolated physician-patient encounters, emphasizing decision-making and transitions between different stages. Current benchmarks focusing on static exams or isolated dialogues inadequately evaluate large language models (LLMs) in dynamic clinical scenarios. We introduce CP-Env, a controllable agentic hospital environment designed to evaluate LLMs across end-to-end clinical pathways. CP-Env simulates a hospital ecosystem with patient and physician agents, constructing scenarios ranging from triage and specialist consultation to diagnostic testing and multidisciplinary team meetings for agent interaction. Following real hospital adaptive flow of healthcare, it enables branching, long-horizon task execution. We propose a three-tiered evaluation framework encompassing Clinical Efficacy, Process Competency, and Professional Ethics. Results reveal that most models struggle with pathway complexity, exhibiting hallucinations and losing critical diagnostic details. Interestingly, excessive reasoning steps can sometimes prove counterproductive, while top models tend to exhibit reduced tool dependency through internalized knowledge. CP-Env advances medical AI agents development through comprehensive end-to-end clinical evaluation. We provide the benchmark and evaluation tools for further research and development at https://github.com/SPIRAL-MED/CP-Env.
Problem

Research questions and friction points this paper is trying to address.

Evaluates LLMs in dynamic clinical pathways
Simulates hospital ecosystem for agent interaction
Assesses clinical efficacy, process competency, ethics
Innovation

Methods, ideas, or system contributions that make the work stand out.

Controllable agentic hospital environment for LLM evaluation
Simulates hospital ecosystem with patient and physician agents
Three-tiered framework for clinical efficacy and ethics
🔎 Similar Papers
No similar papers found.
Yakun Zhu
Yakun Zhu
Shanghai Jiao Tong University
Zhongzhen Huang
Zhongzhen Huang
Shanghai Jiao Tong University
Medical Image AnalysisVision and Language
Q
Qianhan Feng
The Chinese University of Hong Kong
L
Linjie Mu
Shanghai Jiao Tong University
Yannian Gu
Yannian Gu
Shanghai Jiao Tong University
Shaoting Zhang
Shaoting Zhang
Shanghai AI Lab; SenseTime Research
Medical Image AnalysisComputer VisionFoundation Models
Q
Qi Dou
The Chinese University of Hong Kong
X
Xiaofan Zhang
Shanghai Jiao Tong University