🤖 AI Summary
This work addresses the scarcity of verifiable datasets for large language models in software engineering tasks, a limitation primarily caused by the complexity and poor scalability of constructing cross-language executable environments. To overcome this, the authors propose MEnvAgent, a multi-agent framework based on a plan–execute–verify architecture that automatically constructs, repairs, and reuses Dockerized software environments across multiple programming languages. The framework introduces a novel incremental environment reuse mechanism that substantially reduces computational overhead. Using this approach, the authors curate MEnvData-SWE, the first large-scale, multilingual, verifiable environment dataset comprising 1,000 cross-language tasks. Evaluated on the new benchmark MEnvBench, their method improves the first-time-to-pass (F2P) rate by 8.6% and reduces environment setup time by 43%. Both code and dataset are publicly released.
📝 Abstract
The evolution of Large Language Model (LLM) agents for software engineering (SWE) is constrained by the scarcity of verifiable datasets, a bottleneck stemming from the complexity of constructing executable environments across diverse languages. To address this, we introduce MEnvAgent, a Multi-language framework for automated Environment construction that facilitates scalable generation of verifiable task instances. MEnvAgent employs a multi-agent Planning-Execution-Verification architecture to autonomously resolve construction failures and integrates a novel Environment Reuse Mechanism that reduces computational overhead by incrementally patching historical environments. Evaluations on MEnvBench, a new benchmark comprising 1,000 tasks across 10 languages, demonstrate that MEnvAgent outperforms baselines, improving Fail-to-Pass (F2P) rates by 8.6% while reducing time costs by 43%. Additionally, we demonstrate the utility of MEnvAgent by constructing MEnvData-SWE, the largest open-source polyglot dataset of realistic verifiable Docker environments to date, alongside solution trajectories that enable consistent performance gains on SWE tasks across a wide range of models. Our code, benchmark, and dataset are available at https://github.com/ernie-research/MEnvAgent.