The Optimization Paradox in Clinical AI Multi-Agent Systems

📅 2025-06-06
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study identifies an “optimization paradox” in clinical AI multi-agent systems: a system with individually optimized agents (information accuracy: 85.5%) achieves lower final diagnostic accuracy (67.7%) than a partially co-designed multi-agent system (77.4%). Method: Leveraging 2,400 real-world abdominal disease electronic health records from MIMIC-CDM, we propose a three-stage task decomposition framework—information acquisition, interpretation, and differential diagnosis—and comparatively evaluate single-agent versus multi-agent diagnostic pipelines. We establish an end-to-end evaluation framework encompassing diagnostic accuracy, procedural adherence, and cost efficiency. Contribution/Results: Our work provides the first empirical evidence that inter-agent information flow coordination and interface compatibility are more decisive for clinical performance than isolated agent optimization. Crucially, it demonstrates that end-to-end, system-level validation—not component-level benchmarking—is a prerequisite for safe and effective clinical deployment of AI multi-agent systems.

Technology Category

Application Category

📝 Abstract
Multi-agent artificial intelligence systems are increasingly deployed in clinical settings, yet the relationship between component-level optimization and system-wide performance remains poorly understood. We evaluated this relationship using 2,400 real patient cases from the MIMIC-CDM dataset across four abdominal pathologies (appendicitis, pancreatitis, cholecystitis, diverticulitis), decomposing clinical diagnosis into information gathering, interpretation, and differential diagnosis. We evaluated single agent systems (one model performing all tasks) against multi-agent systems (specialized models for each task) using comprehensive metrics spanning diagnostic outcomes, process adherence, and cost efficiency. Our results reveal a paradox: while multi-agent systems generally outperformed single agents, the component-optimized or Best of Breed system with superior components and excellent process metrics (85.5% information accuracy) significantly underperformed in diagnostic accuracy (67.7% vs. 77.4% for a top multi-agent system). This finding underscores that successful integration of AI in healthcare requires not just component level optimization but also attention to information flow and compatibility between agents. Our findings highlight the need for end to end system validation rather than relying on component metrics alone.
Problem

Research questions and friction points this paper is trying to address.

Understanding component vs system performance in clinical AI
Evaluating single vs multi-agent systems in diagnosis accuracy
Highlighting need for end-to-end validation in AI healthcare
Innovation

Methods, ideas, or system contributions that make the work stand out.

Evaluated multi-agent vs single-agent clinical AI systems
Used 2,400 patient cases from MIMIC-CDM dataset
Highlighted importance of end-to-end system validation
🔎 Similar Papers
No similar papers found.
Suhana Bedi
Suhana Bedi
PhD Student, Stanford University
Generative AI in healthcareMultimodal data fusionData Commons
I
Iddah Mlauzi
Department of Computer Science, Stanford University, USA
D
Daniel Shin
Department of Computer Science, Stanford University, USA
Sanmi Koyejo
Sanmi Koyejo
Assistant Professor, Stanford University
Machine LearningHealthcare AINeuroinformatics
N
Nigam H. Shah
Department of Medicine, Stanford School of Medicine, USA