ProjDevBench: Benchmarking AI Coding Agents on End-to-End Project Development

📅 2026-02-02
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the lack of systematic evaluation of AI programming agents in end-to-end software project development, as existing benchmarks predominantly focus on problem-level code repair. To bridge this gap, we introduce ProjDevBench—the first multidimensional benchmark specifically designed to assess full-cycle software development capabilities. It encompasses 20 tasks spanning eight categories and integrates both automated online judging and large language model–assisted code review to holistically evaluate agents across system architecture design, functional correctness, and iterative refinement. Experiments on six state-of-the-art LLM-based coding agents reveal an overall pass rate of only 27.38%, highlighting significant deficiencies in complex system design, time complexity optimization, and resource management. These findings underscore the critical role of ProjDevBench in filling the current void in comprehensive, end-to-end AI programming evaluation.

Technology Category

Application Category

📝 Abstract
Recent coding agents can generate complete codebases from simple prompts, yet existing evaluations focus on issue-level bug fixing and lag behind end-to-end development. We introduce ProjDevBench, an end-to-end benchmark that provides project requirements to coding agents and evaluates the resulting repositories. Combining Online Judge (OJ) testing with LLM-assisted code review, the benchmark evaluates agents on (1) system architecture design, (2) functional correctness, and (3) iterative solution refinement. We curate 20 programming problems across 8 categories, covering both concept-oriented tasks and real-world application scenarios, and evaluate six coding agents built on different LLM backends. Our evaluation reports an overall acceptance rate of 27.38%: agents handle basic functionality and data structures but struggle with complex system design, time complexity optimization, and resource management. Our benchmark is available at https://github.com/zsworld6/projdevbench.
Problem

Research questions and friction points this paper is trying to address.

AI coding agents
end-to-end project development
benchmark
system architecture design
code evaluation
Innovation

Methods, ideas, or system contributions that make the work stand out.

end-to-end project development
coding agents
benchmark
system architecture design
LLM-assisted code review
🔎 Similar Papers
No similar papers found.
P
Pengrui Lu
University of California, Merced, CA, USA; Shanghai Jiao Tong University, Shanghai, China; Shanghai Innovation Institute, Shanghai, China
S
Shiqi Zhang
Shanghai Jiao Tong University, Shanghai, China
Yunzhong Hou
Yunzhong Hou
Australian National University
Computer VisionMachine Learning
Lyumanshan Ye
Lyumanshan Ye
Shanghai Jiao Tong Univeristy
Human-Computer Interaction
C
Chaoyi Huang
Shanghai Jiao Tong University, Shanghai, China
Z
Zixi Chen
Shanghai Jiao Tong University, Shanghai, China
J
Ji Zeng
Shanghai Jiao Tong University, Shanghai, China
H
Hantao Jiang
Shanghai Jiao Tong University, Shanghai, China
Pengfei Liu
Pengfei Liu
Associate professor at Shanghai Jiao Tong University
LLM
Yiwei Wang
Yiwei Wang
University of California at Merced
Natural Language ProcessingVision Language Models
Ming-Hsuan Yang
Ming-Hsuan Yang
University of California at Merced; Google DeepMind
Computer VisionMachine LearningArtificial Intelligence