RepoMasterEval: Evaluating Code Completion via Real-World Repositories

๐Ÿ“… 2024-08-07
๐Ÿ›๏ธ arXiv.org
๐Ÿ“ˆ Citations: 6
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
Existing code completion benchmarks primarily focus on function- or class-level completion, rely on manually crafted textual prompts, and fail to reflect realistic development scenariosโ€”where completions are fine-grained (e.g., inline or block-level) and lack natural-language descriptions. Method: We propose the first benchmark tailored to real-world development: it samples complete file structures from open-source Python and TypeScript repositories and employs dynamic context-aware masking. We introduce a novel mutation-testing-driven test case augmentation mechanism to rigorously assess functional correctness, complemented by human validation and a multi-model comparative evaluation framework. Contribution/Results: Evaluated on six state-of-the-art models, our benchmark demonstrates high consistency between automated scores and actual developer performance. Monthly industrial deployment confirms its accurate feedback, effectively guiding model iteration and optimization.

Technology Category

Application Category

๐Ÿ“ Abstract
With the growing reliance on automated code completion tools in software development, the need for robust evaluation benchmarks has become critical. However, existing benchmarks focus more on code generation tasks in function and class level and provide rich text description to prompt the model. By contrast, such descriptive prompt is commonly unavailable in real development and code completion can occur in wider range of situations such as in the middle of a function or a code block. These limitations makes the evaluation poorly align with the practical scenarios of code completion tools. In this paper, we propose RepoMasterEval, a novel benchmark for evaluating code completion models constructed from real-world Python and TypeScript repositories. Each benchmark datum is generated by masking a code snippet (ground truth) from one source code file with existing test suites. To improve test accuracy of model generated code, we employ mutation testing to measure the effectiveness of the test cases and we manually crafted new test cases for those test suites with low mutation score. Our empirical evaluation on 6 state-of-the-art models shows that test argumentation is critical in improving the accuracy of the benchmark and RepoMasterEval is able to report difference in model performance in real-world scenarios. The deployment of RepoMasterEval in a collaborated company for one month also revealed that the benchmark is useful to give accurate feedback during model training and the score is in high correlation with the model's performance in practice. Based on our findings, we call for the software engineering community to build more LLM benchmarks tailored for code generation tools taking the practical and complex development environment into consideration.
Problem

Research questions and friction points this paper is trying to address.

Evaluating code completion tools using real-world repository scenarios
Addressing limitations of existing descriptive-prompt-based evaluation benchmarks
Improving test accuracy through mutation testing and manual augmentation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Constructed benchmark from real-world repositories
Employed mutation testing for test effectiveness
Manually augmented test cases for accuracy
๐Ÿ”Ž Similar Papers
No similar papers found.
Q
Qinyun Wu
ByteDance, Beijing, China
C
Chao Peng
ByteDance, Beijing, China
P
Pengfei Gao
ByteDance, Beijing, China
Ruida Hu
Ruida Hu
Harbin Institute of Technology, Shenzhen
software engineeringLLM agent
H
Haoyu Gan
ByteDance, Beijing, China
B
Bo Jiang
ByteDance, Shenzhen, China
J
Jinhe Tang
ByteDance, Beijing, China
Z
Zhiwen Deng
ByteDance, Hangzhou, China
Z
Zhanming Guan
ByteDance, Beijing, China
C
Cuiyun Gao
Haribin Institute of Technology, Shenzhen, Shenzhen, China
Xia Liu
Xia Liu
ByteDance, Shenzhen, China
P
Ping Yang
ByteDance, Beijing, China