Can Language Models Replace Programmers? REPOCOD Says 'Not Yet'

📅 2024-10-29
🏛️ arXiv.org
📈 Citations: 3
Influential: 0
📄 PDF
🤖 AI Summary
Existing code generation benchmarks (e.g., HumanEval, MBPP) predominantly evaluate single-function completion, failing to reflect the contextual complexity and functional rigor of real-world software development. Method: We introduce REPOCOD, a novel benchmark derived from 11 real-world open-source projects, comprising 980 tasks—58.3% of which require cross-file or cross-repository context—and featuring average solution lengths of 331.6 tokens and cyclomatic complexity of 9.00. Each task is accompanied by an average of 313.5 developer-written, executable test cases. Contribution/Results: REPOCOD establishes the first evaluation paradigm integrating real-project grounding, high-context dependency, and strong functional validation. Experiments across ten state-of-the-art LLMs reveal a maximum pass@1 score of only 29.7%, substantially lower than performance on conventional benchmarks—demonstrating a significant capability gap in handling realistic software engineering tasks.

Technology Category

Application Category

📝 Abstract
Large language models (LLMs) have achieved high accuracy, i.e., more than 90% pass@1, in solving Python coding problems in HumanEval and MBPP. Thus, a natural question is, whether LLMs achieve comparable code completion performance compared to human developers? Unfortunately, one cannot answer this question using existing manual crafted or simple (e.g., single-line) code generation benchmarks, since such tasks fail to represent real-world software development tasks. In addition, existing benchmarks often use poor code correctness metrics, providing misleading conclusions. To address these challenges, we create REPOCOD, a code generation benchmark with 980 problems collected from 11 popular real-world projects, with more than 58% of them requiring file-level or repository-level context information. In addition, REPOCOD has the longest average canonical solution length (331.6 tokens) and the highest average cyclomatic complexity (9.00) compared to existing benchmarks. Each task in REPOCOD includes 313.5 developer-written test cases on average for better correctness evaluation. In our evaluations of ten LLMs, none of the models achieve more than 30% pass@1 on REPOCOD, indicating the necessity of building stronger LLMs that can help developers in real-world software development. REPOCOD is available at https://github.com/lt-asset/REPOCOD
Problem

Research questions and friction points this paper is trying to address.

Evaluating LLMs' real-world coding performance vs benchmarks
Addressing lack of realistic dependencies in existing benchmarks
Assessing repository-level code generation capabilities of LLMs
Innovation

Methods, ideas, or system contributions that make the work stand out.

REPOCOD benchmark for real-world coding tasks
Includes 980 whole-function generation tasks
Retrieval-augmented generation improves performance
🔎 Similar Papers
No similar papers found.