🤖 AI Summary
Current autonomous research agents generally lack effective evaluation of their capability to construct complex execution environments, particularly regarding intricate software dependencies, alignment of hardware and framework versions, and distributed configurations. To address this gap, this work systematically incorporates environment synthesis into agent evaluation for the first time, introducing a benchmark built from real-world scientific code repositories. The benchmark leverages dependency resolution, version matching, and containerization techniques to generate reproducible environment construction challenges. Experimental results demonstrate that state-of-the-art agents perform poorly on these tasks, primarily due to incomplete dependency resolution and fragile version coupling. This study thus makes a significant contribution by establishing the first comprehensive evaluation framework for environment configuration in automated scientific research.
📝 Abstract
Autonomous agents are increasingly expected to support scientific research, and recent benchmarks report progress in code repair and autonomous experimentation. However, these evaluations typically assume a pre-configured execution environment, which requires resolving complex software dependencies, aligning hardware and framework versions, and configuring distributed execution, yet this capability remains largely unbenchmarked. We introduce ResearchEnvBench, a benchmark for environment synthesis in research code execution. Given a research repository, documentation, and a target execution setting, agents must construct an environment that successfully executes at runtime. Evaluations on diverse research repositories reveal a substantial gap in current SOTA agents, with failures dominated by incomplete dependency resolution and brittle version coupling. ResearchEnvBench provides a realistic testbed for advancing autonomous agents toward reproducible scientific research.