Multi-Docker-Eval: A `Shovel of the Gold Rush' Benchmark on Automatic Environment Building for Software Engineering

📅 2025-12-07

📈 Citations: 0

✨ Influential: 0

career value

212K/year

🤖 AI Summary

Automated environment configuration lacks reliable evaluation standards, hindering the scalable automation of software engineering. To address this, we introduce EnvBench—the first benchmark for automated environment configuration grounded in real-world scenarios—comprising 40 authentic repositories across nine programming languages and supporting multi-language Dockerized environment construction. We propose a dual-dimensional evaluation framework: success rate–to–pass rate (F2P) and resource consumption, enabling systematic assessment of large language models and agent-based frameworks. Key findings reveal that model scale is not decisive; several open-source models outperform proprietary ones in success rate (up to 37.7%) and efficiency; and agent architecture design and language-specific adaptability significantly influence build success. EnvBench provides a reproducible, extensible evaluation infrastructure to advance research in automated environment configuration.

Technology Category

Application Category

📝 Abstract

Automated environment configuration is a critical bottleneck in scaling software engineering (SWE) automation. To provide a reliable evaluation standard for this task, we present Multi-Docker-Eval benchmark. It includes 40 real-world repositories spanning 9 programming languages and measures both success in achieving executable states and efficiency under realistic constraints. Our extensive evaluation of state-of-the-art LLMs and agent frameworks reveals key insights: (1) the overall success rate of current models is low (F2P at most 37.7%), with environment construction being the primary bottleneck; (2) model size and reasoning length are not decisive factors, and open-source models like DeepSeek-V3.1 and Kimi-K2 are competitive in both efficiency and effectiveness; (3) agent framework and programming language also have significantly influence on success rate. These findings provide actionable guidelines for building scalable, fully automated SWE pipelines.

Problem

Research questions and friction points this paper is trying to address.

Automated environment configuration is a critical bottleneck in scaling software engineering automation.

The benchmark measures success in achieving executable states and efficiency under realistic constraints.

It evaluates LLMs and agent frameworks to provide guidelines for scalable automated SWE pipelines.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Benchmark for automated environment configuration evaluation

Evaluates success and efficiency across multiple programming languages

Identifies key factors influencing automated environment construction success

🔎 Similar Papers

Containerization in Multi-Cloud Environment: Roles, Strategies, Challenges, and Solutions for Effective Implementation