Process-Level Trajectory Evaluation for Environment Configuration in Software Engineering Agents

📅 2025-10-29

📈 Citations: 0

✨ Influential: 0

career value

201K/year

🤖 AI Summary

Current software engineering agents face two key bottlenecks in environment configuration: (1) the absence of large-scale, high-quality benchmarks, and (2) overreliance on end-to-end success metrics in evaluation, which hinders fine-grained failure diagnosis. Method: We introduce Enconda-bench—the first fine-grained diagnostic benchmark specifically designed for environment configuration. It constructs realistic, multi-step execution trajectories by injecting real-world README-based errors and validating outcomes via Docker automation, covering planning, perception/diagnosis, feedback-driven repair, and execution. Crucially, it proposes a process-level capability decomposition framework to move beyond traditional black-box evaluation. Results: Experiments reveal that state-of-the-art agents consistently identify configuration errors but exhibit weak repair capabilities. Enconda-bench has been validated across multiple leading LLMs and agent frameworks, significantly enhancing interpretability and revealing precise internal capability bottlenecks in environment configuration tasks.

Technology Category

Application Category

📝 Abstract

Large language model-based agents show promise for software engineering, but environment configuration remains a bottleneck due to heavy manual effort and scarce large-scale, high-quality datasets. Existing benchmarks assess only end-to-end build/test success, obscuring where and why agents succeed or fail. We introduce the Environment Configuration Diagnosis Benchmark, Enconda-bench, which provides process-level trajectory assessment of fine-grained agent capabilities during environment setup-planning, perception-driven error diagnosis, feedback-driven repair, and action to execute final environment configuration. Our task instances are automatically constructed by injecting realistic README errors and are validated in Docker for scalable, high-quality evaluation. Enconda-bench combines process-level analysis with end-to-end executability to enable capability assessments beyond aggregate success rates. Evaluations across state-of-the-art LLMs and agent frameworks show that while agents can localize errors, they struggle to translate feedback into effective corrections, limiting end-to-end performance. To our knowledge, Enconda-bench is the first framework to provide process-level internal capability assessment for environment configuration, offering actionable insights for improving software engineering agents.

Problem

Research questions and friction points this paper is trying to address.

Evaluating software engineering agents' process-level capabilities in environment configuration

Diagnosing where and why agents succeed or fail during environment setup

Assessing agents' ability to translate error feedback into effective corrections

Innovation

Methods, ideas, or system contributions that make the work stand out.

Process-level trajectory assessment for agent capabilities

Automated realistic error injection in README files

Docker-validated scalable environment configuration evaluation

🔎 Similar Papers

Large Language Model-Based Agents for Software Engineering: A Survey