AgentRx: A Benchmark Study of LLM Agents for Multimodal Clinical Prediction Tasks

📅 2026-05-11

📈 Citations: 0

✨ Influential: 0

career value

204K/year

🤖 AI Summary

This study addresses the lack of systematic evaluation of large language model (LLM) agents in multimodal clinical risk prediction and the performance bottlenecks of multi-agent collaboration when integrating heterogeneous medical data. The authors present the first LLM agent benchmark specifically designed for multimodal clinical prediction, leveraging real-world, large-scale electronic health records, medical imaging, radiology reports, and clinical notes. They systematically compare the predictive performance and calibration capabilities of single-agent versus multi-agent frameworks. Results demonstrate that a single-agent architecture significantly outperforms naive multi-agent systems in both multimodal fusion and prediction calibration. The work releases an open-source evaluation framework to serve as a critical reference for future design and development of medical AI agents.

📝 Abstract

Building effective clinical decision support systems requires the synthesis of complex heterogeneous multimodal data. Such modalities include temporal electronic health records data, medical images, radiology reports, and clinical notes. Large language model (LLM)-based agents have shown impressive performance in various healthcare tasks, especially those involving textual modalities. Considering the fragmentation of healthcare data across hospital systems, collaborative agent frameworks present a promising direction to mitigate data sharing challenges. However, the effectiveness of LLM agents for multimodal clinical risk prediction remains largely unexamined. In this work, we conduct a systematic evaluation of LLM-based agents for clinical prediction tasks using large-scale real-world data. We assess performance in unimodal and multimodal settings and quantify performance gaps between single agent and multi-agent systems. Our findings highlight that single agent frameworks outperform naive multi-agent systems, are better at handling multimodal data, and are better calibrated. This underscores a critical need for improving multi-agent collaboration to better handle heterogeneous inputs. By open-sourcing our code and evaluation framework, this work offers a new benchmark to support future developments relating to agentic systems in healthcare.

Problem

Research questions and friction points this paper is trying to address.

multimodal clinical prediction

LLM agents

clinical decision support

heterogeneous healthcare data

agent collaboration

Innovation

Methods, ideas, or system contributions that make the work stand out.

LLM agents

multimodal clinical prediction

agent benchmarking