🤖 AI Summary
This study addresses the absence of a systematic evaluation framework for large language model (LLM) agents in healthcare. The authors propose the first seven-dimensional assessment framework tailored to medical AI agents, encompassing cognition, knowledge management, interaction, adaptive learning, safety and ethics, agent architecture, and core clinical tasks. This framework is operationalized into 29 measurable sub-dimensions and applied through a systematic literature review of 49 studies, using a three-tier annotation scheme (fully/partially/not implemented) for quantitative mapping and co-occurrence analysis. Findings reveal that external knowledge integration is widely implemented (76% fully), whereas event-triggered activation (92% not implemented) and drift detection (98% not implemented) are critically underdeveloped. Multi-agent architectures dominate (82% fully), yet action-oriented tasks such as treatment planning remain notably underexplored.
📝 Abstract
Large Language Model (LLM)-based agents that plan, use tools and act has begun to shape healthcare and medicine. Reported studies demonstrate competence on various tasks ranging from EHR analysis and differential diagnosis to treatment planning and research workflows. Yet the literature largely consists of overviews which are either broad surveys or narrow dives into a single capability (e.g., memory, planning, reasoning), leaving healthcare work without a common frame. We address this by reviewing 49 studies using a seven-dimensional taxonomy: Cognitive Capabilities, Knowledge Management, Interaction Patterns, Adaptation & Learning, Safety & Ethics, Framework Typology and Core Tasks & Subtasks with 29 operational sub-dimensions. Using explicit inclusion and exclusion criteria and a labeling rubric (Fully Implemented ✓, Partially Implemented $\Delta $ , Not Implemented ✗), we map each study to the taxonomy and report quantitative summaries of capability prevalence and co-occurrence patterns. Our empirical analysis surfaces clear asymmetries. For instance, the External Knowledge Integration sub-dimension under Knowledge Management is commonly realized (~76% ✓) whereas Event-Triggered Activation sub-dimenison under Interaction Patterns is largely absent (~92% ✗) and Drift Detection & Mitigation sub-dimension under Adaptation & Learning is rare (~98% ✗). Architecturally, Multi-Agent Design sub-dimension under Framework Typology is the dominant pattern (~82% ✓) while orchestration layers remain mostly partial. Across Core Tasks & Subtasks, information centric capabilities lead e.g., Medical Question Answering & Decision Support and Benchmarking & Simulation, while action and discovery oriented areas such as Treatment Planning & Prescription still show substantial gaps (~59% ✗). Together, these findings provide an empirical baseline indicating that current agents excel at retrieval-grounded advising but require stronger adaptation and compliance platforms to move from early-stage systems to dependable systems.