🤖 AI Summary
This study addresses the lack of empirical research on the architectural integrity and maintainability of large-scale software systems generated by AI-powered integrated development environments (IDEs). The authors propose a Feature-Driven Human-in-the-Loop (FD-HITL) framework that integrates structured requirement specifications, code generation via the Cursor AI IDE, static analysis tools (SonarQube and CodeScene), and manual expert evaluation to systematically assess design quality. Applied across ten cross-domain projects averaging 16,965 lines of code and 114 files each, the approach achieved a 91% functional correctness rate yet uncovered 4,498 design flaws. These defects broadly violated core software engineering principles—such as the Single Responsibility Principle (SRP), Separation of Concerns (SoC), and Don’t Repeat Yourself (DRY)—manifesting as code duplication, excessive complexity, inadequate error handling, and other issues spanning nine to eleven distinct categories. The work highlights significant structural risks in AI-generated systems beyond functional correctness and demonstrates how FD-HITL enhances controllability and design quality.
📝 Abstract
New generation of AI coding tools, including AI-powered IDEs equipped with agentic capabilities, can generate code within the context of the project. These AI IDEs are increasingly perceived as capable of producing project-level code at scale. However, there is limited empirical evidence on the extent to which they can generate large-scale software systems and what design issues such systems may exhibit. To address this gap, we conducted a study to explore the capability of Cursor in generating large-scale projects and to evaluate the design quality of projects generated by Cursor. First, we propose a Feature-Driven Human-In-The-Loop (FD-HITL) framework that systematically guides project generation from curated project descriptions. We generated 10 projects using Cursor with the FD-HITL framework across three application domains and multiple technologies. We assessed the functional correctness of these projects through manual evaluation, obtaining an average functional correctness score of 91%. Next, we analyzed the generated projects using two static analysis tools, CodeScene and SonarQube, to detect design issues. We identified 1,305 design issues categorized into 9 categories by CodeScene and 3,193 issues in 11 categories by SonarQube. Our findings show that (1) when used with the FD-HITL framework, Cursor can generate functional large-scale projects averaging 16,965 LoC and 114 files; (2) the generated projects nevertheless contain design issues that may pose long-term maintainability and evolvability risks, requiring careful review by experienced developers; (3) the most prevalent issues include Code Duplication, high Code Complexity, Large Methods, Framework Best-Practice Violations, Exception-Handling Issues and Accessibility Issues; (4) these design issues violate design principles such as SRP, SoC, and DRY. The replication package is at https://github.com/Kashifraz/DIinAGP