ResearchEnvBench: Benchmarking Agents on Environment Synthesis for Research Code Execution

📅 2026-03-06
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Current autonomous research agents generally lack effective evaluation of their capability to construct complex execution environments, particularly regarding intricate software dependencies, alignment of hardware and framework versions, and distributed configurations. To address this gap, this work systematically incorporates environment synthesis into agent evaluation for the first time, introducing a benchmark built from real-world scientific code repositories. The benchmark leverages dependency resolution, version matching, and containerization techniques to generate reproducible environment construction challenges. Experimental results demonstrate that state-of-the-art agents perform poorly on these tasks, primarily due to incomplete dependency resolution and fragile version coupling. This study thus makes a significant contribution by establishing the first comprehensive evaluation framework for environment configuration in automated scientific research.

Technology Category

Application Category

📝 Abstract
Autonomous agents are increasingly expected to support scientific research, and recent benchmarks report progress in code repair and autonomous experimentation. However, these evaluations typically assume a pre-configured execution environment, which requires resolving complex software dependencies, aligning hardware and framework versions, and configuring distributed execution, yet this capability remains largely unbenchmarked. We introduce ResearchEnvBench, a benchmark for environment synthesis in research code execution. Given a research repository, documentation, and a target execution setting, agents must construct an environment that successfully executes at runtime. Evaluations on diverse research repositories reveal a substantial gap in current SOTA agents, with failures dominated by incomplete dependency resolution and brittle version coupling. ResearchEnvBench provides a realistic testbed for advancing autonomous agents toward reproducible scientific research.
Problem

Research questions and friction points this paper is trying to address.

environment synthesis
research code execution
dependency resolution
autonomous agents
reproducible research
Innovation

Methods, ideas, or system contributions that make the work stand out.

environment synthesis
autonomous agents
research reproducibility
dependency resolution
benchmarking
🔎 Similar Papers
No similar papers found.
Y
Yubang Wang
Institute of Trustworthy Embodied AI, Fudan University; Shanghai Innovation Institution; College of Computer Science and Artificial Intelligence, Fudan University; OpenMOSS; Wuhan University
C
Chenxi Zhang
College of Computer Science and Artificial Intelligence, Fudan University; OpenMOSS
B
Bowen Chen
Institute of Trustworthy Embodied AI, Fudan University; OpenMOSS; Nanjing University
Z
Zezheng Huai
Shanghai Innovation Institution; College of Computer Science and Artificial Intelligence, Fudan University; OpenMOSS; Jilin University
Z
Zihao Dai
Shanghai Innovation Institution; College of Computer Science and Artificial Intelligence, Fudan University; OpenMOSS
Xinchi Chen
Xinchi Chen
Professor at Fudan University, Shanghai, China
Large Language ModelsEmbodied AINatural Language ProcessingInformation Retrievaletc.
Yuxin Wang
Yuxin Wang
Fudan University
Y
Yining Zheng
Institute of Trustworthy Embodied AI, Fudan University; Shanghai Innovation Institution; College of Computer Science and Artificial Intelligence, Fudan University; OpenMOSS
Jingjing Gong
Jingjing Gong
SII
Machine LearningAI for ScienceLarge Language ModelEmbodied AI
X
Xipeng Qiu
Shanghai Innovation Institution; College of Computer Science and Artificial Intelligence, Fudan University; OpenMOSS