Benchmarking and Studying the LLM-based Agent System in End-to-End Software Development

📅 2025-11-06

📈 Citations: 0

✨ Influential: 0

career value

211K/year

🤖 AI Summary

Current LLM-based autonomous agents face two key bottlenecks in end-to-end software development evaluation: (1) existing benchmarks lack realism, failing to reflect genuine development complexity; and (2) architectural heterogeneity impedes fair, controlled comparisons. To address these, we propose E2EDevBench—the first dynamic benchmark grounded in authentic software development workflows—and a hybrid evaluation framework integrating functional testing with LLM-driven requirement consistency verification. By standardizing the underlying agent architecture to control confounding variables, our approach enables fine-grained, comparable empirical analysis of critical design dimensions—including task decomposition and multi-agent collaboration. Experimental results show that state-of-the-art agents satisfy only ~50% of functional requirements, primarily due to requirement omission and insufficient self-verification capability. This work establishes a reproducible, empirically grounded evaluation paradigm to advance agent capabilities in requirement understanding, planning, and verification.

Technology Category

Application Category

📝 Abstract

The development of LLM-based autonomous agents for end-to-end software development represents a significant paradigm shift in software engineering. However, the scientific evaluation of these systems is hampered by significant challenges, including overly simplistic benchmarks and the difficulty of conducting fair comparisons between different agent architectures due to confounding implementation variables. To address these limitations, we first construct a challenging and dynamically curated E2EDevBench to simulate realistic development scenarios. Second, we propose a hybrid evaluation framework that combines test-case-based functional assessment with fine-grained, LLM-based requirement verification. Using this framework, we conduct a controlled empirical study on three representative agent architectures implemented upon a unified foundation to isolate the impact of workflow design. Our findings reveal that state-of-the-art agents can fulfill approximately 50% of requirements on ench{}, but their success is critically dependent on the architectural strategy for task decomposition and collaboration. Furthermore, our analysis indicates that the primary bottleneck is the omission of requirements and inadequate self-verification. This work provides the community with a more realistic benchmark, a comprehensive evaluation framework, and crucial insights into the current capabilities and core challenges of software development agents, guiding future research toward enhancing requirement comprehension and planning.

Problem

Research questions and friction points this paper is trying to address.

Evaluating LLM-based agents in realistic software development scenarios

Addressing simplistic benchmarks and unfair architectural comparisons

Identifying requirement omission and self-verification as key bottlenecks

Innovation

Methods, ideas, or system contributions that make the work stand out.

Constructed E2EDevBench for realistic development scenarios

Proposed hybrid evaluation framework combining functional and LLM-based verification

Conducted controlled study on agent architectures to isolate workflow impact

🔎 Similar Papers

Large Language Model-Based Agents for Software Engineering: A Survey