BioMedArena: An Open-source Toolkit for Building and Evaluating Biomedical Deep Research Agents

📅 2026-05-07

📈 Citations: 0

✨ Influential: 0

career value

242K/year

🤖 AI Summary

This work addresses the challenges of reproducibility and unfair comparison in evaluating biomedical research agents, which often arise from implementation disparities across systems. To resolve this, the authors propose the first systematically decoupled evaluation framework, employing a lightweight adapter mechanism to partition the evaluation pipeline into six plug-and-play layers, thereby substantially reducing integration overhead for new models or tools. The resulting open-source toolkit integrates 147 biomedical benchmarks, 75 tools spanning nine functional categories, six context management strategies, six agent frameworks, and twelve backbone models, enabling fair and efficient agent development and assessment. The framework achieves state-of-the-art performance on eight representative benchmarks, with an average improvement of 15.03 percentage points, and fully releases its toolchain, configurations, and execution trajectories.

📝 Abstract

Building a deep research agent today is an exercise in glue code: the same backbone evaluated on the same benchmark can report different accuracies in different papers because harness and tool registry all differ, and integrating a new foundation model into a comparable evaluation surface costs weeks of model-specific engineering. We call this the per-paper engineering tax and release BioMedArena, an open-source toolkit that not only alleviates it but also provides an arena for fair comparison of different foundation models when evaluating them as deep-research agents. BioMedArena decouples six layers of biomedical agent evaluation -- benchmark loading, tool exposure, tool selection, execution mode, context management, and scoring -- and exposes 147 biomedical benchmarks and 75 biomedical tools across 9 functional families. Adding a new model, benchmark, or tool reduces to registering a few-line provider adapter. We further provide 6 agent harnesses with 6 context-management strategies, which provide 12 backbones with competitive research capabilities and significantly improved performance, achieving state-of-the-art (SOTA) results on 8 representative biomedical benchmarks, with an average lift of +15.03 percentage points over prior SOTA. The toolkit, configurations, and per-task traces are available at https://github.com/AI-in-Health/BioMedArena

Problem

Research questions and friction points this paper is trying to address.

deep research agents

biomedical evaluation

foundation models

benchmark inconsistency

engineering overhead

Innovation

Methods, ideas, or system contributions that make the work stand out.

BioMedArena

deep research agents

foundation models