🤖 AI Summary
This work addresses the limitation of existing medical AI agents, which operate in isolation and struggle to accumulate cross-case experience or correct recurring errors. The authors propose Evo-MedAgent—a training-free, self-evolving memory module that endows frozen large language models with continual cross-case learning capabilities through three mechanisms: clinical snippet memory, adaptive rule-based reflection, and tool reliability assessment. Requiring only a single memory retrieval and one reflective invocation per case, the approach dynamically optimizes reasoning strategies. Evaluated on ChestAgentBench, it boosts the multiple-choice accuracy of GPT-5-mini and Gemini-3 Flash from 0.68 and 0.76 to 0.79 and 0.87, respectively—significantly outperforming pure tool-orchestration baselines and demonstrating, for the first time, a doctor-like evolutionary capacity in static medical agents.
📝 Abstract
Tool-augmented large language model (LLM) agents can orchestrate specialist classifiers, segmentation models, and visual question-answering modules to interpret chest X-rays. However, these agents still solve each case in isolation: they fail to accumulate experience across cases, correct recurrent reasoning mistakes, or adapt their tool-use behavior without expensive reinforcement learning. While a radiologist naturally improves with every case, current agents remain static. In this work, we propose Evo-MedAgent, a self-evolving memory module that equips a medical agent with the capacity for inter-case learning at test time. Our memory comprises three complementary stores: (1)~\emph{Retrospective Clinical Episodes} that retrieve problem-solving experiences from similar past cases, (2)~an \emph{Adaptive Procedural Heuristics} bank curating priority-tagged diagnostic rules that evolves via reflection, much like a physician refining their internal criteria, and (3)~a \emph{Tool Reliability Controller} that tracks per-tool trustworthiness. On ChestAgentBench, Evo-MedAgent raises multiple-choice question (MCQ) accuracy from 0.68 to 0.79 on GPT-5-mini, and from 0.76 to 0.87 on Gemini-3 Flash. With a strong base model, evolving memory improves performance more effectively than orchestrating external tools on qualitative diagnostic tasks. Because Evo-MedAgent requires no training, its per-case overhead is bounded by one additional retrieval pass and a single reflection call, making it deployable on top of any frozen model.