🤖 AI Summary
Automated environment configuration lacks reliable evaluation standards, hindering the scalable automation of software engineering. To address this, we introduce EnvBench—the first benchmark for automated environment configuration grounded in real-world scenarios—comprising 40 authentic repositories across nine programming languages and supporting multi-language Dockerized environment construction. We propose a dual-dimensional evaluation framework: success rate–to–pass rate (F2P) and resource consumption, enabling systematic assessment of large language models and agent-based frameworks. Key findings reveal that model scale is not decisive; several open-source models outperform proprietary ones in success rate (up to 37.7%) and efficiency; and agent architecture design and language-specific adaptability significantly influence build success. EnvBench provides a reproducible, extensible evaluation infrastructure to advance research in automated environment configuration.
📝 Abstract
Automated environment configuration is a critical bottleneck in scaling software engineering (SWE) automation. To provide a reliable evaluation standard for this task, we present Multi-Docker-Eval benchmark. It includes 40 real-world repositories spanning 9 programming languages and measures both success in achieving executable states and efficiency under realistic constraints. Our extensive evaluation of state-of-the-art LLMs and agent frameworks reveals key insights: (1) the overall success rate of current models is low (F2P at most 37.7%), with environment construction being the primary bottleneck; (2) model size and reasoning length are not decisive factors, and open-source models like DeepSeek-V3.1 and Kimi-K2 are competitive in both efficiency and effectiveness; (3) agent framework and programming language also have significantly influence on success rate. These findings provide actionable guidelines for building scalable, fully automated SWE pipelines.