FEA-Bench: A Benchmark for Evaluating Repository-Level Code Generation for Feature Implementation

📅 2025-03-09

📈 Citations: 0

✨ Influential: 0

career value

166K/year

🤖 AI Summary

Existing code generation models lack dedicated benchmarks for evaluating their ability to implement novel functionality within real-world codebases. Method: We introduce RepoGen, the first repository-level benchmark for new feature implementation, constructed from 83 real GitHub pull requests; each task comprises changed code and corresponding unit tests, requiring models to jointly generate new components and edit existing code across files. We innovatively define and evaluate LLM capabilities in incremental feature development—including requirement understanding, cross-file collaborative editing, and test-driven verification—and propose intent-driven task filtering and verifiable test pairing. Results: Experiments show substantial performance degradation of state-of-the-art LLMs on RepoGen, exposing fundamental limitations in multi-file contextual reasoning and requirement-code alignment.

Technology Category

Application Category

📝 Abstract

Implementing new features in repository-level codebases is a crucial application of code generation models. However, current benchmarks lack a dedicated evaluation framework for this capability. To fill this gap, we introduce FEA-Bench, a benchmark designed to assess the ability of large language models (LLMs) to perform incremental development within code repositories. We collect pull requests from 83 GitHub repositories and use rule-based and intent-based filtering to construct task instances focused on new feature development. Each task instance containing code changes is paired with relevant unit test files to ensure that the solution can be verified. The feature implementation requires LLMs to simultaneously possess code completion capabilities for new components and code editing abilities for other relevant parts in the code repository, providing a more comprehensive evaluation method of LLMs' automated software engineering capabilities. Experimental results show that LLMs perform significantly worse in the FEA-Bench, highlighting considerable challenges in such repository-level incremental code development.

Problem

Research questions and friction points this paper is trying to address.

Evaluates LLMs' ability for repository-level feature implementation

Assesses code completion and editing in incremental development

Highlights challenges in automated repository-level code generation

Innovation

Methods, ideas, or system contributions that make the work stand out.

FEA-Bench evaluates repository-level code generation.

Uses pull requests and unit tests for validation.

Assesses LLMs' code completion and editing abilities.

🔎 Similar Papers

No similar papers found.