MdEval: Massively Multilingual Code Debugging

📅 2024-11-04

🏛️ arXiv.org

📈 Citations: 3

✨ Influential: 0

career value

171K/year

🤖 AI Summary

Existing programming benchmarks are heavily biased toward Python and lack comprehensive evaluation of multilingual debugging capabilities. Method: We introduce MDEVAL, the first large-scale multilingual code debugging benchmark comprising 3.6K samples across 18 programming languages, covering automatic program repair, code review, and defect identification. We propose xDebugGen, a cross-language defect injection framework, to construct MDEVAL-INSTRUCT—a dedicated instruction-tuning dataset—and train xDebugCoder, a specialized multilingual debugging model capable of modeling language-specific defects (e.g., Rust ownership violations, C memory errors). Our approach integrates syntax-aware defect modeling, multi-task evaluation, and synthetic-data-driven instruction fine-tuning. Contribution/Results: Experiments reveal that leading open-source models significantly underperform proprietary models (e.g., GPT, Claude) on multilingual debugging tasks. MDEVAL establishes a standardized evaluation platform and provides a strong baseline model, advancing research in multilingual intelligent code debugging.

Technology Category

Application Category

📝 Abstract

Code large language models (LLMs) have made significant progress in code debugging by directly generating the correct code based on the buggy code snippet. Programming benchmarks, typically consisting of buggy code snippet and their associated test cases, are used to assess the debugging capabilities of LLMs. However, many existing benchmarks primarily focus on Python and are often limited in terms of language diversity (e.g., DebugBench and DebugEval). To advance the field of multilingual debugging with LLMs, we propose the first massively multilingual debugging benchmark, which includes 3.6K test samples of 18 programming languages and covers the automated program repair (APR) task, the code review (CR) task, and the bug identification (BI) task. Further, we introduce the debugging instruction corpora MDEVAL-INSTRUCT by injecting bugs into the correct multilingual queries and solutions (xDebugGen). Further, a multilingual debugger xDebugCoder trained on MDEVAL-INSTRUCT as a strong baseline specifically to handle the bugs of a wide range of programming languages (e.g."Missing Mut"in language Rust and"Misused Macro Definition"in language C). Our extensive experiments on MDEVAL reveal a notable performance gap between open-source models and closed-source LLMs (e.g., GPT and Claude series), highlighting huge room for improvement in multilingual code debugging scenarios.

Problem

Research questions and friction points this paper is trying to address.

Develops multilingual code debugging benchmark

Introduces MDEVAL-INSTRUCT debugging corpora

Trains xDebugCoder for diverse programming languages

Innovation

Methods, ideas, or system contributions that make the work stand out.

Multilingual debugging benchmark creation

Debugging instruction corpora development

Multilingual debugger training improvement

🔎 Similar Papers

BLAZE: Cross-Language and Cross-Project Bug Localization via Dynamic Chunking and Hard Example Learning

2024-07-24IEEE Transactions on Software EngineeringCitations: 2