ReX-MLE: The Autonomous Agent Benchmark for Medical Imaging Challenges

📅 2025-12-19

📈 Citations: 0

✨ Influential: 0

career value

218K/year

🤖 AI Summary

Existing LLM-driven autonomous coding agents perform well on general software and ML tasks but face severe limitations in scientific domains—particularly medical imaging—that demand long-duration training, high-dimensional data processing, and domain-specific preprocessing/verification; moreover, no dedicated evaluation benchmark exists for this setting. Method: We introduce the first end-to-end autonomous agent benchmark for medical imaging, comprising 20 cross-modal, multi-task challenges drawn from high-impact competitions, requiring agents to independently execute data preprocessing, model training, and result submission under realistic computational and time constraints. We develop agent frameworks based on GPT-5, Gemini, and Claude, augmented with medical imaging–specific pipelines, and conduct the first systematic evaluation of their full scientific workflow capabilities. Contribution/Results: We define a domain-specific evaluation paradigm and uncover a fundamental adaptability bottleneck: state-of-the-art agents—including AIDE, ML-Master, and R&D-Agent—achieve zeroth-percentile performance relative to human experts across all challenges.

Technology Category

Application Category

📝 Abstract

Autonomous coding agents built on large language models (LLMs) can now solve many general software and machine learning tasks, but they remain ineffective on complex, domain-specific scientific problems. Medical imaging is a particularly demanding domain, requiring long training cycles, high-dimensional data handling, and specialized preprocessing and validation pipelines, capabilities not fully measured in existing agent benchmarks. To address this gap, we introduce ReX-MLE, a benchmark of 20 challenges derived from high-impact medical imaging competitions spanning diverse modalities and task types. Unlike prior ML-agent benchmarks, ReX-MLE evaluates full end-to-end workflows, requiring agents to independently manage data preprocessing, model training, and submission under realistic compute and time constraints. Evaluating state-of-the-art agents (AIDE, ML-Master, R&D-Agent) with different LLM backends (GPT-5, Gemini, Claude), we observe a severe performance gap: most submissions rank in the 0th percentile compared to human experts. Failures stem from domain-knowledge and engineering limitations. ReX-MLE exposes these bottlenecks and provides a foundation for developing domain-aware autonomous AI systems.

Problem

Research questions and friction points this paper is trying to address.

Benchmarks autonomous agents on complex medical imaging tasks

Evaluates full end-to-end workflows under realistic constraints

Exposes domain-knowledge and engineering limitations in AI agents

Innovation

Methods, ideas, or system contributions that make the work stand out.

Benchmark for medical imaging agent evaluation

End-to-end workflow assessment under constraints

Identifies domain-knowledge and engineering bottlenecks

🔎 Similar Papers

VIS-MAE: An Efficient Self-supervised Learning Approach on Medical Image Segmentation and Classification