🤖 AI Summary
Existing code generation benchmarks (e.g., HumanEval, MBPP) predominantly evaluate single-function completion, failing to reflect the contextual complexity and functional rigor of real-world software development. Method: We introduce REPOCOD, a novel benchmark derived from 11 real-world open-source projects, comprising 980 tasks—58.3% of which require cross-file or cross-repository context—and featuring average solution lengths of 331.6 tokens and cyclomatic complexity of 9.00. Each task is accompanied by an average of 313.5 developer-written, executable test cases. Contribution/Results: REPOCOD establishes the first evaluation paradigm integrating real-project grounding, high-context dependency, and strong functional validation. Experiments across ten state-of-the-art LLMs reveal a maximum pass@1 score of only 29.7%, substantially lower than performance on conventional benchmarks—demonstrating a significant capability gap in handling realistic software engineering tasks.
📝 Abstract
Large language models (LLMs) have achieved high accuracy, i.e., more than 90% pass@1, in solving Python coding problems in HumanEval and MBPP. Thus, a natural question is, whether LLMs achieve comparable code completion performance compared to human developers? Unfortunately, one cannot answer this question using existing manual crafted or simple (e.g., single-line) code generation benchmarks, since such tasks fail to represent real-world software development tasks. In addition, existing benchmarks often use poor code correctness metrics, providing misleading conclusions. To address these challenges, we create REPOCOD, a code generation benchmark with 980 problems collected from 11 popular real-world projects, with more than 58% of them requiring file-level or repository-level context information. In addition, REPOCOD has the longest average canonical solution length (331.6 tokens) and the highest average cyclomatic complexity (9.00) compared to existing benchmarks. Each task in REPOCOD includes 313.5 developer-written test cases on average for better correctness evaluation. In our evaluations of ten LLMs, none of the models achieve more than 30% pass@1 on REPOCOD, indicating the necessity of building stronger LLMs that can help developers in real-world software development. REPOCOD is available at https://github.com/lt-asset/REPOCOD