AgentClinic: a multimodal agent benchmark to evaluate AI in simulated clinical environments

📅 2024-05-13
🏛️ arXiv.org
📈 Citations: 44
Influential: 4
📄 PDF
🤖 AI Summary
Existing clinical benchmarks predominantly rely on static question-answering, failing to capture the dynamic, sequential nature of real-world diagnosis and treatment. To address this, we introduce the first multimodal agent benchmark grounded in authentic clinical scenarios—spanning nine medical specialties and seven languages—and simulating patient interaction, incomplete multi-source data acquisition, and iterative tool invocation to emulate end-to-end clinical workflows. Our contributions include: (1) a Dynamic Clinical Simulation Environment supporting persistent interaction, collaborative tool use, and cross-case memory; (2) patient-centered evaluation metrics; and (3) validation via real-world EHRs alongside bias robustness analysis. Methodologically, we integrate a multimodal agent architecture, tool-augmented reasoning (note-taking, retrieval, reflection, experience learning), and cross-lingual clinical knowledge modeling. Experiments reveal that traditional MedQA overestimates diagnostic accuracy by >90%; Claude-3.5 achieves the best overall performance; and Llama-3’s accuracy improves by 92% with note-taking tools—highlighting tool utilization as a critical bottleneck for clinical agents.

Technology Category

Application Category

📝 Abstract
Evaluating large language models (LLM) in clinical scenarios is crucial to assessing their potential clinical utility. Existing benchmarks rely heavily on static question-answering, which does not accurately depict the complex, sequential nature of clinical decision-making. Here, we introduce AgentClinic, a multimodal agent benchmark for evaluating LLMs in simulated clinical environments that include patient interactions, multimodal data collection under incomplete information, and the usage of various tools, resulting in an in-depth evaluation across nine medical specialties and seven languages. We find that solving MedQA problems in the sequential decision-making format of AgentClinic is considerably more challenging, resulting in diagnostic accuracies that can drop to below a tenth of the original accuracy. Overall, we observe that agents sourced from Claude-3.5 outperform other LLM backbones in most settings. Nevertheless, we see stark differences in the LLMs' ability to make use of tools, such as experiential learning, adaptive retrieval, and reflection cycles. Strikingly, Llama-3 shows up to 92% relative improvements with the notebook tool that allows for writing and editing notes that persist across cases. To further scrutinize our clinical simulations, we leverage real-world electronic health records, perform a clinical reader study, perturb agents with biases, and explore novel patient-centric metrics that this interactive environment firstly enables.
Problem

Research questions and friction points this paper is trying to address.

Evaluating LLMs in dynamic clinical decision-making scenarios
Assessing multimodal tool usage in simulated medical environments
Measuring performance gaps across medical specialties and languages
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multimodal agent benchmark for clinical simulations
Sequential decision-making with incomplete information
Tool usage evaluation across medical specialties
🔎 Similar Papers
No similar papers found.
Samuel Schmidgall
Samuel Schmidgall
Google DeepMind
AI AgentsLLM agentsLarge Language ModelsMedical AI
R
Rojin Ziaei
Department of Computer Science, Johns Hopkins University, Baltimore, MD, USA
C
Carl Harris
Department of Biomedical Engineering, Johns Hopkins University, Baltimore, MD, USA
E
Eduardo Reis
Department of Radiology, Stanford University, Stanford, CA, USA; Hospital Israelita Albert Einstein, Sao Paulo, Brazil
J
Jeffrey Jopling
Department of Surgery, Johns Hopkins University, Baltimore, MD, USA
Michael Moor
Michael Moor
MD, PhD. Assistant Professor at ETH Zurich. Previously: Stanford, Computer Science.
Medical AIFoundation modelsLLMsAgentsReasoning