BioXArena: Benchmarking LLM Agents on Multi-Modal Biomedical Machine Learning Tasks

📅 2026-05-15
📈 Citations: 0
Influential: 0
📄 PDF

career value

212K/year
🤖 AI Summary
This work addresses the lack of systematic evaluation of end-to-end multimodal machine learning capabilities in existing biomedical benchmarks, which predominantly focus on question answering or tool usage. The authors propose BioXArena, the first end-to-end multimodal biomedical benchmark designed for large language model agents, encompassing 76 tasks across nine domains. Agents must generate executable training code, process heterogeneous data modalities—including sequences, images, text, and structures—and submit predictions. The benchmark enforces a standardized evaluation protocol with a unified single-GPU, two-hour runtime environment, private test sets, biology-aware metrics, and hidden labels, enabling fair comparison across 11 agent configurations. Experimental results show that MLEvolve paired with Gemini-3.1-Pro achieves the highest average score of 0.666, yet no single agent dominates all domains. The project releases all tasks, scorers, runners, and agent trajectories publicly.
📝 Abstract
Large language model (LLM) agents are increasingly capable of automating components of machine learning development, yet existing biomedical benchmarks mainly focus on question answering, reasoning, and tool usage, or evaluate only narrow aspects of biomedical ML coding. We present BioXArena, a biomedical machine learning benchmark designed to evaluate whether agents can generate task-specific model training pipelines for heterogeneous and multi-modal biomedical datasets. BioXArena contains 76 end-to-end tasks across 9 domains, including sequence modeling, single-cell analysis, structural biology, network biology, chemical biology, perturbation dynamics, phenotype-disease modeling, biomedical imaging, and text-integrated learning. Each task is curated from primary biomedical sources into a unified evaluation framework with hidden labels, held-out graders, and biology-aware metrics normalized to a 0 to 1 scale. Agents are required to write executable code, train predictive models, and generate submissions for private test samples. Most tasks involve multiple input modalities, including tabular data, images, natural language, molecular sequences, omics matrices, and protein structures. We evaluate 11 agent configurations in a standardized 2-hour single-GPU environment. MLEvolve with Gemini-3.1-Pro achieves the highest average score of 0.666, followed by GPT-5.4 with 0.636, while no single agent consistently dominates across all domains. We additionally perform extensive ablation studies, robustness evaluations, scaling analyses, cost analyses, and failure-mode investigations to better understand how model backbones, agent scaffolds, inference budgets, and biomedical domains influence BioML coding performance. We will publicly release all benchmark tasks, graders, execution runners, leaderboard results, and agent trajectories.
Problem

Research questions and friction points this paper is trying to address.

biomedical machine learning
LLM agents
multi-modal data
benchmarking
model training pipelines
Innovation

Methods, ideas, or system contributions that make the work stand out.

multimodal biomedical benchmark
LLM agents
end-to-end ML pipeline generation
executable code synthesis
BioML evaluation
Loka Li
Loka Li
Mohamed bin Zayed University of Artificial Intelligence
Machine LearningCausality
Duzhen Zhang
Duzhen Zhang
Institute of Automation, Chinese Academy of Sciences
Natural Language ProcessingMultimodalLarge Language ModelsContinual LearningAI4Science
Xingbo Du
Xingbo Du
Postdoc at MBZUAI
AI/LLM4CO
L
Leonard Song
Mohamed bin Zayed University of Artificial Intelligence
Zixiao Wang
Zixiao Wang
University of Science and Technology of China
A
Assanali Aukenov
Mohamed bin Zayed University of Artificial Intelligence
N
Noel Thomas
Mohamed bin Zayed University of Artificial Intelligence
S
Shakhnazar Sailaukan
Mohamed bin Zayed University of Artificial Intelligence
Y
Yonghan Yang
Mohamed bin Zayed University of Artificial Intelligence
Feilong Chen
Feilong Chen
Huawei Inc.; Previously CASIA
(Native) Multimodal LLMMultimodal GenerationMultimodal ReasoningOmni-modal LLM
J
Jiahua Dong
Mohamed bin Zayed University of Artificial Intelligence
Kun Zhang
Kun Zhang
Carnegie Mellon University & Mohamed bin Zayed University of Artificial Intelligence (MBZUAI)
Causal discovery and inferencemachine learningrepresentation learning
B
Bin Zhang
Mohamed bin Zayed University of Artificial Intelligence
Le Song
Le Song
CTO, GenBio AI; Professor, MBZUAI
AIAI for ScienceMachine Learning