Process-Level Trajectory Evaluation for Environment Configuration in Software Engineering Agents

πŸ“… 2025-10-29
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
Current software engineering agents face two key bottlenecks in environment configuration: (1) the absence of large-scale, high-quality benchmarks, and (2) overreliance on end-to-end success metrics in evaluation, which hinders fine-grained failure diagnosis. Method: We introduce Enconda-benchβ€”the first fine-grained diagnostic benchmark specifically designed for environment configuration. It constructs realistic, multi-step execution trajectories by injecting real-world README-based errors and validating outcomes via Docker automation, covering planning, perception/diagnosis, feedback-driven repair, and execution. Crucially, it proposes a process-level capability decomposition framework to move beyond traditional black-box evaluation. Results: Experiments reveal that state-of-the-art agents consistently identify configuration errors but exhibit weak repair capabilities. Enconda-bench has been validated across multiple leading LLMs and agent frameworks, significantly enhancing interpretability and revealing precise internal capability bottlenecks in environment configuration tasks.

Technology Category

Application Category

πŸ“ Abstract
Large language model-based agents show promise for software engineering, but environment configuration remains a bottleneck due to heavy manual effort and scarce large-scale, high-quality datasets. Existing benchmarks assess only end-to-end build/test success, obscuring where and why agents succeed or fail. We introduce the Environment Configuration Diagnosis Benchmark, Enconda-bench, which provides process-level trajectory assessment of fine-grained agent capabilities during environment setup-planning, perception-driven error diagnosis, feedback-driven repair, and action to execute final environment configuration. Our task instances are automatically constructed by injecting realistic README errors and are validated in Docker for scalable, high-quality evaluation. Enconda-bench combines process-level analysis with end-to-end executability to enable capability assessments beyond aggregate success rates. Evaluations across state-of-the-art LLMs and agent frameworks show that while agents can localize errors, they struggle to translate feedback into effective corrections, limiting end-to-end performance. To our knowledge, Enconda-bench is the first framework to provide process-level internal capability assessment for environment configuration, offering actionable insights for improving software engineering agents.
Problem

Research questions and friction points this paper is trying to address.

Evaluating software engineering agents' process-level capabilities in environment configuration
Diagnosing where and why agents succeed or fail during environment setup
Assessing agents' ability to translate error feedback into effective corrections
Innovation

Methods, ideas, or system contributions that make the work stand out.

Process-level trajectory assessment for agent capabilities
Automated realistic error injection in README files
Docker-validated scalable environment configuration evaluation
πŸ”Ž Similar Papers
No similar papers found.
J
Jiayi Kuang
Youtu-LLM Team, Tencent Youtu Lab
Y
Yinghui Li
Youtu-LLM Team, Tencent Youtu Lab
X
Xin Zhang
Youtu-LLM Team, Tencent Youtu Lab
Y
Yangning Li
Youtu-LLM Team, Tencent Youtu Lab
Di Yin
Di Yin
Tencent
LLMNLPMLLM
Xing Sun
Xing Sun
Tencent Youtu Lab
LLMMLLMAgent
Y
Ying Shen
Sun Yat-sen University
Philip S. Yu
Philip S. Yu
Professor of Computer Science, University of Illinons at Chicago
Data miningDatabasePrivacy