Can Language Models Replace Programmers? REPOCOD Says 'Not Yet'

📅 2024-10-29

🏛️ arXiv.org

📈 Citations: 3

✨ Influential: 0

career value

178K/year

🤖 AI Summary

Existing code generation benchmarks (e.g., HumanEval, MBPP) predominantly evaluate single-function completion, failing to reflect the contextual complexity and functional rigor of real-world software development. Method: We introduce REPOCOD, a novel benchmark derived from 11 real-world open-source projects, comprising 980 tasks—58.3% of which require cross-file or cross-repository context—and featuring average solution lengths of 331.6 tokens and cyclomatic complexity of 9.00. Each task is accompanied by an average of 313.5 developer-written, executable test cases. Contribution/Results: REPOCOD establishes the first evaluation paradigm integrating real-project grounding, high-context dependency, and strong functional validation. Experiments across ten state-of-the-art LLMs reveal a maximum pass@1 score of only 29.7%, substantially lower than performance on conventional benchmarks—demonstrating a significant capability gap in handling realistic software engineering tasks.

Technology Category

Application Category

📝 Abstract

Large language models (LLMs) have achieved high accuracy, i.e., more than 90% pass@1, in solving Python coding problems in HumanEval and MBPP. Thus, a natural question is, whether LLMs achieve comparable code completion performance compared to human developers? Unfortunately, one cannot answer this question using existing manual crafted or simple (e.g., single-line) code generation benchmarks, since such tasks fail to represent real-world software development tasks. In addition, existing benchmarks often use poor code correctness metrics, providing misleading conclusions. To address these challenges, we create REPOCOD, a code generation benchmark with 980 problems collected from 11 popular real-world projects, with more than 58% of them requiring file-level or repository-level context information. In addition, REPOCOD has the longest average canonical solution length (331.6 tokens) and the highest average cyclomatic complexity (9.00) compared to existing benchmarks. Each task in REPOCOD includes 313.5 developer-written test cases on average for better correctness evaluation. In our evaluations of ten LLMs, none of the models achieve more than 30% pass@1 on REPOCOD, indicating the necessity of building stronger LLMs that can help developers in real-world software development. REPOCOD is available at https://github.com/lt-asset/REPOCOD

Problem

Research questions and friction points this paper is trying to address.

Evaluating LLMs' real-world coding performance vs benchmarks

Addressing lack of realistic dependencies in existing benchmarks

Assessing repository-level code generation capabilities of LLMs

Innovation

Methods, ideas, or system contributions that make the work stand out.

REPOCOD benchmark for real-world coding tasks

Includes 980 whole-function generation tasks

Retrieval-augmented generation improves performance

🔎 Similar Papers

Do Large Code Models Understand Programming Concepts? Counterfactual Analysis for Code Predicates

2024-02-08International Conference on Machine LearningCitations: 6