🤖 AI Summary
Existing code generation models lack dedicated benchmarks for evaluating their ability to implement novel functionality within real-world codebases. Method: We introduce RepoGen, the first repository-level benchmark for new feature implementation, constructed from 83 real GitHub pull requests; each task comprises changed code and corresponding unit tests, requiring models to jointly generate new components and edit existing code across files. We innovatively define and evaluate LLM capabilities in incremental feature development—including requirement understanding, cross-file collaborative editing, and test-driven verification—and propose intent-driven task filtering and verifiable test pairing. Results: Experiments show substantial performance degradation of state-of-the-art LLMs on RepoGen, exposing fundamental limitations in multi-file contextual reasoning and requirement-code alignment.
📝 Abstract
Implementing new features in repository-level codebases is a crucial application of code generation models. However, current benchmarks lack a dedicated evaluation framework for this capability. To fill this gap, we introduce FEA-Bench, a benchmark designed to assess the ability of large language models (LLMs) to perform incremental development within code repositories. We collect pull requests from 83 GitHub repositories and use rule-based and intent-based filtering to construct task instances focused on new feature development. Each task instance containing code changes is paired with relevant unit test files to ensure that the solution can be verified. The feature implementation requires LLMs to simultaneously possess code completion capabilities for new components and code editing abilities for other relevant parts in the code repository, providing a more comprehensive evaluation method of LLMs' automated software engineering capabilities. Experimental results show that LLMs perform significantly worse in the FEA-Bench, highlighting considerable challenges in such repository-level incremental code development.