FEA-Bench: A Benchmark for Evaluating Repository-Level Code Generation for Feature Implementation

📅 2025-03-09
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing code generation models lack dedicated benchmarks for evaluating their ability to implement novel functionality within real-world codebases. Method: We introduce RepoGen, the first repository-level benchmark for new feature implementation, constructed from 83 real GitHub pull requests; each task comprises changed code and corresponding unit tests, requiring models to jointly generate new components and edit existing code across files. We innovatively define and evaluate LLM capabilities in incremental feature development—including requirement understanding, cross-file collaborative editing, and test-driven verification—and propose intent-driven task filtering and verifiable test pairing. Results: Experiments show substantial performance degradation of state-of-the-art LLMs on RepoGen, exposing fundamental limitations in multi-file contextual reasoning and requirement-code alignment.

Technology Category

Application Category

📝 Abstract
Implementing new features in repository-level codebases is a crucial application of code generation models. However, current benchmarks lack a dedicated evaluation framework for this capability. To fill this gap, we introduce FEA-Bench, a benchmark designed to assess the ability of large language models (LLMs) to perform incremental development within code repositories. We collect pull requests from 83 GitHub repositories and use rule-based and intent-based filtering to construct task instances focused on new feature development. Each task instance containing code changes is paired with relevant unit test files to ensure that the solution can be verified. The feature implementation requires LLMs to simultaneously possess code completion capabilities for new components and code editing abilities for other relevant parts in the code repository, providing a more comprehensive evaluation method of LLMs' automated software engineering capabilities. Experimental results show that LLMs perform significantly worse in the FEA-Bench, highlighting considerable challenges in such repository-level incremental code development.
Problem

Research questions and friction points this paper is trying to address.

Evaluates LLMs' ability for repository-level feature implementation
Assesses code completion and editing in incremental development
Highlights challenges in automated repository-level code generation
Innovation

Methods, ideas, or system contributions that make the work stand out.

FEA-Bench evaluates repository-level code generation.
Uses pull requests and unit tests for validation.
Assesses LLMs' code completion and editing abilities.
🔎 Similar Papers
No similar papers found.
W
Wei Li
State Key Laboratory for Multimedia Information Processing, School of Computer Science, Peking University
X
Xin Zhang
Microsoft Research Asia
Z
Zhongxin Guo
Microsoft Research Asia
Shaoguang Mao
Shaoguang Mao
Technical Staff, Moonshot.AI
AI AgentLLM
Wen Luo
Wen Luo
Peking University
Guangyue Peng
Guangyue Peng
Peking University
Yangyu Huang
Yangyu Huang
Microsoft Research Asia
computer visioncomputer graphics
H
Houfeng Wang
State Key Laboratory for Multimedia Information Processing, School of Computer Science, Peking University
S
Scarlett Li
Microsoft Research Asia