🤖 AI Summary
Medical imaging lacks dedicated automated machine learning (AutoML) tools, and existing multi-step modeling pipelines remain challenging to automate end-to-end. Method: This paper introduces the first fully automated multi-agent system tailored for medical imaging, comprising four collaborative agents that jointly orchestrate data preprocessing, environment configuration, self-debugging, and model training. It proposes a novel medical imaging–specific multi-agent architecture with a structured workspace and establishes M3Bench—a comprehensive benchmark spanning anatomical regions, imaging modalities, and dimensionalities. The system supports plug-and-play large language models (LLMs) and extensible task orchestration. Contribution/Results: Evaluated on M3Bench, the system achieves a 94.29% task success rate—significantly outperforming state-of-the-art medical AI agents—and provides the first empirical validation of feasibility and effectiveness in fully automated medical imaging modeling.
📝 Abstract
Agentic AI systems have gained significant attention for their ability to autonomously perform complex tasks. However, their reliance on well-prepared tools limits their applicability in the medical domain, which requires to train specialized models. In this paper, we make three contributions: (i) We present M3Builder, a novel multi-agent system designed to automate machine learning (ML) in medical imaging. At its core, M3Builder employs four specialized agents that collaborate to tackle complex, multi-step medical ML workflows, from automated data processing and environment configuration to self-contained auto debugging and model training. These agents operate within a medical imaging ML workspace, a structured environment designed to provide agents with free-text descriptions of datasets, training codes, and interaction tools, enabling seamless communication and task execution. (ii) To evaluate progress in automated medical imaging ML, we propose M3Bench, a benchmark comprising four general tasks on 14 training datasets, across five anatomies and three imaging modalities, covering both 2D and 3D data. (iii) We experiment with seven state-of-the-art large language models serving as agent cores for our system, such as Claude series, GPT-4o, and DeepSeek-V3. Compared to existing ML agentic designs, M3Builder shows superior performance on completing ML tasks in medical imaging, achieving a 94.29% success rate using Claude-3.7-Sonnet as the agent core, showing huge potential towards fully automated machine learning in medical imaging.