🤖 AI Summary
Existing code benchmarks predominantly focus on code generation, neglecting critical editing tasks prevalent in real-world software development—such as debugging, translation, optimization, and requirement switching.
Method: We introduce the first comprehensive benchmark for evaluating large language models’ (LLMs’) code editing capabilities across the full software development lifecycle. It encompasses four task categories, multiple programming languages, and varying difficulty levels. Our methodology centers on editing as the core operation, featuring a multi-source heterogeneous task design and prompt-sensitivity analysis framework. We employ manually curated challenging examples, standardized prompt templates, and multidimensional evaluation metrics to systematically assess 19 state-of-the-art LLMs.
Contribution/Results: Results reveal substantial performance gaps, with closed-source models (GPT-4, Gemini-Ultra) significantly outperforming open-source counterparts. To foster reproducibility and advancement, we fully open-source all prompts, datasets, and evaluation tools—enabling rigorous research and iterative improvement of code editing capabilities.
📝 Abstract
Large Language Models (LLMs) for code are rapidly evolving, with code editing emerging as a critical capability. We introduce CodeEditorBench, an evaluation framework designed to rigorously assess the performance of LLMs in code editing tasks, including debugging, translating, polishing, and requirement switching. Unlike existing benchmarks focusing solely on code generation, CodeEditorBench emphasizes real-world scenarios and practical aspects of software development. We curate diverse coding challenges and scenarios from five sources, covering various programming languages, complexity levels, and editing tasks. Evaluation of 19 LLMs reveals that closed-source models (particularly Gemini-Ultra and GPT-4), outperform open-source models in CodeEditorBench, highlighting differences in model performance based on problem types and prompt sensitivities. CodeEditorBench aims to catalyze advancements in LLMs by providing a robust platform for assessing code editing capabilities. We will release all prompts and datasets to enable the community to expand the dataset and benchmark emerging LLMs. By introducing CodeEditorBench, we contribute to the advancement of LLMs in code editing and provide a valuable resource for researchers and practitioners.