RepoDebug: Repository-Level Multi-Task and Multi-Language Debugging Evaluation of Large Language Models

📅 2025-09-04
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing code debugging benchmarks predominantly focus on function-level repair, neglecting realistic repository-level scenarios; moreover, current repository-level datasets suffer from severe limitations in task diversity, language coverage, and error-type breadth. To address this gap, we introduce RepoDebug—the first multi-task, multi-language, repository-level debugging benchmark—comprising 8 programming languages, 3 debugging task categories, and 22 fine-grained error subtypes, constructed from real-world open-source repositories via hybrid human annotation and automated processing. We design a comprehensive, multi-dimensional evaluation protocol and systematically assess 10 state-of-the-art large language models (LLMs). Our results reveal that even the strongest model (Claude 3.5 Sonnet) achieves only modest performance on repository-level repair. This work fills a critical gap in evaluating LLMs’ practical debugging capabilities in realistic software engineering contexts and establishes a new standard for industrial-relevant model assessment.

Technology Category

Application Category

📝 Abstract
Large Language Models (LLMs) have exhibited significant proficiency in code debugging, especially in automatic program repair, which may substantially reduce the time consumption of developers and enhance their efficiency. Significant advancements in debugging datasets have been made to promote the development of code debugging. However, these datasets primarily focus on assessing the LLM's function-level code repair capabilities, neglecting the more complex and realistic repository-level scenarios, which leads to an incomplete understanding of the LLM's challenges in repository-level debugging. While several repository-level datasets have been proposed, they often suffer from limitations such as limited diversity of tasks, languages, and error types. To mitigate this challenge, this paper introduces RepoDebug, a multi-task and multi-language repository-level code debugging dataset with 22 subtypes of errors that supports 8 commonly used programming languages and 3 debugging tasks. Furthermore, we conduct evaluation experiments on 10 LLMs, where Claude 3.5 Sonnect, the best-performing model, still cannot perform well in repository-level debugging.
Problem

Research questions and friction points this paper is trying to address.

Evaluates LLMs on repository-level debugging tasks
Addresses limitations in task and language diversity
Identifies challenges in complex multi-error code repair
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-language repository-level debugging dataset
Supports 8 programming languages and 3 tasks
Includes 22 subtypes of diverse errors
🔎 Similar Papers
No similar papers found.
J
Jingjing Liu
School of Computer Science and Engineering, Beihang University, Beijing, China
Z
Zeming Liu
School of Computer Science and Engineering, Beihang University, Beijing, China
Z
Zihao Cheng
School of Computer Science and Engineering, Beihang University, Beijing, China
M
Mengliang He
East China Normal University, Shanghai, China
X
Xiaoming Shi
East China Normal University, Shanghai, China
Y
Yuhang Guo
School of Computer Science and Technology, Beijing Institute of Technology
Xiangrong Zhu
Xiangrong Zhu
PhD Student
Human computer interaction (HCI)
Yuanfang Guo
Yuanfang Guo
Beihang University
Multimedia securityAI securityGraph Neural NetworksMultimedia processing
Yunhong Wang
Yunhong Wang
Professor, School of Computer Science and Engineering, Beihang University
BiometricsPattern RecognitionImage ProcessingComputer Vision
H
Haifeng Wang
Baidu Inc., Beijing, China