MEnvAgent: Scalable Polyglot Environment Construction for Verifiable Software Engineering

📅 2026-01-30
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the scarcity of verifiable datasets for large language models in software engineering tasks, a limitation primarily caused by the complexity and poor scalability of constructing cross-language executable environments. To overcome this, the authors propose MEnvAgent, a multi-agent framework based on a plan–execute–verify architecture that automatically constructs, repairs, and reuses Dockerized software environments across multiple programming languages. The framework introduces a novel incremental environment reuse mechanism that substantially reduces computational overhead. Using this approach, the authors curate MEnvData-SWE, the first large-scale, multilingual, verifiable environment dataset comprising 1,000 cross-language tasks. Evaluated on the new benchmark MEnvBench, their method improves the first-time-to-pass (F2P) rate by 8.6% and reduces environment setup time by 43%. Both code and dataset are publicly released.

Technology Category

Application Category

📝 Abstract
The evolution of Large Language Model (LLM) agents for software engineering (SWE) is constrained by the scarcity of verifiable datasets, a bottleneck stemming from the complexity of constructing executable environments across diverse languages. To address this, we introduce MEnvAgent, a Multi-language framework for automated Environment construction that facilitates scalable generation of verifiable task instances. MEnvAgent employs a multi-agent Planning-Execution-Verification architecture to autonomously resolve construction failures and integrates a novel Environment Reuse Mechanism that reduces computational overhead by incrementally patching historical environments. Evaluations on MEnvBench, a new benchmark comprising 1,000 tasks across 10 languages, demonstrate that MEnvAgent outperforms baselines, improving Fail-to-Pass (F2P) rates by 8.6% while reducing time costs by 43%. Additionally, we demonstrate the utility of MEnvAgent by constructing MEnvData-SWE, the largest open-source polyglot dataset of realistic verifiable Docker environments to date, alongside solution trajectories that enable consistent performance gains on SWE tasks across a wide range of models. Our code, benchmark, and dataset are available at https://github.com/ernie-research/MEnvAgent.
Problem

Research questions and friction points this paper is trying to address.

verifiable datasets
executable environments
software engineering
polyglot environment
LLM agents
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-agent Planning-Execution-Verification
Environment Reuse Mechanism
Polyglot Environment Construction
Verifiable Software Engineering
Scalable Dataset Generation
🔎 Similar Papers
No similar papers found.
C
Chuanzhe Guo
Research Center for Social Computing and Interactive Robotics, Harbin Institute of Technology, Harbin, China
Jingjing Wu
Jingjing Wu
Vis, Baidu inc
CVMLLM
S
Sijun He
Baidu Inc., Shenzhen, China
Y
Yang Chen
Baidu Inc., Shenzhen, China
Z
Zhaoqi Kuang
Baidu Inc., Shenzhen, China
S
Shilong Fan
Baidu Inc., Shenzhen, China
B
Bingjin Chen
Baidu Inc., Shenzhen, China
Siqi Bao
Siqi Bao
Baidu
Natural Language ProcessingMedical Image Analysis
Jing Liu
Jing Liu
Baidu Inc.
Large Language ModelInformation RetrievalAgents
H
Hua Wu
Baidu Inc., Shenzhen, China
Qingfu Zhu
Qingfu Zhu
Harbin Institute of Technology
NLPCode LLM
Wanxiang Che
Wanxiang Che
Professor of Harbin Institute of Technology
Natural Language Processing
Haifeng Wang
Haifeng Wang
Baidu
NLPMTSearchSpeechData Mining