ReX-MLE: The Autonomous Agent Benchmark for Medical Imaging Challenges

📅 2025-12-19
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing LLM-driven autonomous coding agents perform well on general software and ML tasks but face severe limitations in scientific domains—particularly medical imaging—that demand long-duration training, high-dimensional data processing, and domain-specific preprocessing/verification; moreover, no dedicated evaluation benchmark exists for this setting. Method: We introduce the first end-to-end autonomous agent benchmark for medical imaging, comprising 20 cross-modal, multi-task challenges drawn from high-impact competitions, requiring agents to independently execute data preprocessing, model training, and result submission under realistic computational and time constraints. We develop agent frameworks based on GPT-5, Gemini, and Claude, augmented with medical imaging–specific pipelines, and conduct the first systematic evaluation of their full scientific workflow capabilities. Contribution/Results: We define a domain-specific evaluation paradigm and uncover a fundamental adaptability bottleneck: state-of-the-art agents—including AIDE, ML-Master, and R&D-Agent—achieve zeroth-percentile performance relative to human experts across all challenges.

Technology Category

Application Category

📝 Abstract
Autonomous coding agents built on large language models (LLMs) can now solve many general software and machine learning tasks, but they remain ineffective on complex, domain-specific scientific problems. Medical imaging is a particularly demanding domain, requiring long training cycles, high-dimensional data handling, and specialized preprocessing and validation pipelines, capabilities not fully measured in existing agent benchmarks. To address this gap, we introduce ReX-MLE, a benchmark of 20 challenges derived from high-impact medical imaging competitions spanning diverse modalities and task types. Unlike prior ML-agent benchmarks, ReX-MLE evaluates full end-to-end workflows, requiring agents to independently manage data preprocessing, model training, and submission under realistic compute and time constraints. Evaluating state-of-the-art agents (AIDE, ML-Master, R&D-Agent) with different LLM backends (GPT-5, Gemini, Claude), we observe a severe performance gap: most submissions rank in the 0th percentile compared to human experts. Failures stem from domain-knowledge and engineering limitations. ReX-MLE exposes these bottlenecks and provides a foundation for developing domain-aware autonomous AI systems.
Problem

Research questions and friction points this paper is trying to address.

Benchmarks autonomous agents on complex medical imaging tasks
Evaluates full end-to-end workflows under realistic constraints
Exposes domain-knowledge and engineering limitations in AI agents
Innovation

Methods, ideas, or system contributions that make the work stand out.

Benchmark for medical imaging agent evaluation
End-to-end workflow assessment under constraints
Identifies domain-knowledge and engineering bottlenecks
🔎 Similar Papers
No similar papers found.
R
Roshan Kenia
Xiaoman Zhang
Xiaoman Zhang
Harvard University
AI for MedicineMedical Image Analysis
P
Pranav Rajpurkar
Department of Biomedical Informatics, Harvard Medical School, Boston, MA