CodeEditorBench: Evaluating Code Editing Capability of Large Language Models

📅 2024-04-04
🏛️ arXiv.org
📈 Citations: 9
Influential: 0
📄 PDF
🤖 AI Summary
Existing code benchmarks predominantly focus on code generation, neglecting critical editing tasks prevalent in real-world software development—such as debugging, translation, optimization, and requirement switching. Method: We introduce the first comprehensive benchmark for evaluating large language models’ (LLMs’) code editing capabilities across the full software development lifecycle. It encompasses four task categories, multiple programming languages, and varying difficulty levels. Our methodology centers on editing as the core operation, featuring a multi-source heterogeneous task design and prompt-sensitivity analysis framework. We employ manually curated challenging examples, standardized prompt templates, and multidimensional evaluation metrics to systematically assess 19 state-of-the-art LLMs. Contribution/Results: Results reveal substantial performance gaps, with closed-source models (GPT-4, Gemini-Ultra) significantly outperforming open-source counterparts. To foster reproducibility and advancement, we fully open-source all prompts, datasets, and evaluation tools—enabling rigorous research and iterative improvement of code editing capabilities.

Technology Category

Application Category

📝 Abstract
Large Language Models (LLMs) for code are rapidly evolving, with code editing emerging as a critical capability. We introduce CodeEditorBench, an evaluation framework designed to rigorously assess the performance of LLMs in code editing tasks, including debugging, translating, polishing, and requirement switching. Unlike existing benchmarks focusing solely on code generation, CodeEditorBench emphasizes real-world scenarios and practical aspects of software development. We curate diverse coding challenges and scenarios from five sources, covering various programming languages, complexity levels, and editing tasks. Evaluation of 19 LLMs reveals that closed-source models (particularly Gemini-Ultra and GPT-4), outperform open-source models in CodeEditorBench, highlighting differences in model performance based on problem types and prompt sensitivities. CodeEditorBench aims to catalyze advancements in LLMs by providing a robust platform for assessing code editing capabilities. We will release all prompts and datasets to enable the community to expand the dataset and benchmark emerging LLMs. By introducing CodeEditorBench, we contribute to the advancement of LLMs in code editing and provide a valuable resource for researchers and practitioners.
Problem

Research questions and friction points this paper is trying to address.

Assessing LLMs' code editing performance in real-world scenarios
Evaluating diverse code editing tasks like debugging and translating
Comparing closed-source and open-source LLMs on editing capabilities
Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduces CodeEditorBench for LLM evaluation
Covers diverse real-world coding scenarios
Evaluates 19 LLMs on editing tasks
🔎 Similar Papers
No similar papers found.
Jiawei Guo
Jiawei Guo
Bupt & M-A-P
LLM MLLM
Z
Ziming Li
HKUST
X
Xueling Liu
Multimodal Art Projection Research Community
Kaijing Ma
Kaijing Ma
Fudan University
Computer VisionMachine Learning
Tianyu Zheng
Tianyu Zheng
M-A-P & Tiktok Researcher
LLM
Zhouliang Yu
Zhouliang Yu
The SphereLab, CUHK
Reinforcement LearningLLMFormal AI
D
Ding Pan
HKUST
Yizhi Li
Yizhi Li
University of Manchester, M-A-P
LLMReasoningPost-trainingComputational Music
Ruibo Liu
Ruibo Liu
RS @Google DeepMind
ASI
Y
Yue Wang
Multimodal Art Projection Research Community
S
Shuyue Guo
Multimodal Art Projection Research Community
X
Xingwei Qu
HKUST, University of Manchester
Xiang Yue
Xiang Yue
Carnegie Mellon University
Natural Language ProcessingLarge Language ModelsMachine Learning
G
Ge Zhang
Multimodal Art Projection Research Community, University of Waterloo, Vector Institute
Wenhu Chen
Wenhu Chen
Assistant Professor at University of Waterloo
Natural Language ProcessingArtificial IntelligenceDeep Learning
J
Jie Fu
Multimodal Art Projection Research Community, HKUST