π€ AI Summary
Current software engineering agents face two key bottlenecks in environment configuration: (1) the absence of large-scale, high-quality benchmarks, and (2) overreliance on end-to-end success metrics in evaluation, which hinders fine-grained failure diagnosis. Method: We introduce Enconda-benchβthe first fine-grained diagnostic benchmark specifically designed for environment configuration. It constructs realistic, multi-step execution trajectories by injecting real-world README-based errors and validating outcomes via Docker automation, covering planning, perception/diagnosis, feedback-driven repair, and execution. Crucially, it proposes a process-level capability decomposition framework to move beyond traditional black-box evaluation. Results: Experiments reveal that state-of-the-art agents consistently identify configuration errors but exhibit weak repair capabilities. Enconda-bench has been validated across multiple leading LLMs and agent frameworks, significantly enhancing interpretability and revealing precise internal capability bottlenecks in environment configuration tasks.
π Abstract
Large language model-based agents show promise for software engineering, but environment configuration remains a bottleneck due to heavy manual effort and scarce large-scale, high-quality datasets. Existing benchmarks assess only end-to-end build/test success, obscuring where and why agents succeed or fail. We introduce the Environment Configuration Diagnosis Benchmark, Enconda-bench, which provides process-level trajectory assessment of fine-grained agent capabilities during environment setup-planning, perception-driven error diagnosis, feedback-driven repair, and action to execute final environment configuration. Our task instances are automatically constructed by injecting realistic README errors and are validated in Docker for scalable, high-quality evaluation. Enconda-bench combines process-level analysis with end-to-end executability to enable capability assessments beyond aggregate success rates. Evaluations across state-of-the-art LLMs and agent frameworks show that while agents can localize errors, they struggle to translate feedback into effective corrections, limiting end-to-end performance. To our knowledge, Enconda-bench is the first framework to provide process-level internal capability assessment for environment configuration, offering actionable insights for improving software engineering agents.