🤖 AI Summary
Existing programming benchmarks are heavily biased toward Python and lack comprehensive evaluation of multilingual debugging capabilities. Method: We introduce MDEVAL, the first large-scale multilingual code debugging benchmark comprising 3.6K samples across 18 programming languages, covering automatic program repair, code review, and defect identification. We propose xDebugGen, a cross-language defect injection framework, to construct MDEVAL-INSTRUCT—a dedicated instruction-tuning dataset—and train xDebugCoder, a specialized multilingual debugging model capable of modeling language-specific defects (e.g., Rust ownership violations, C memory errors). Our approach integrates syntax-aware defect modeling, multi-task evaluation, and synthetic-data-driven instruction fine-tuning. Contribution/Results: Experiments reveal that leading open-source models significantly underperform proprietary models (e.g., GPT, Claude) on multilingual debugging tasks. MDEVAL establishes a standardized evaluation platform and provides a strong baseline model, advancing research in multilingual intelligent code debugging.
📝 Abstract
Code large language models (LLMs) have made significant progress in code debugging by directly generating the correct code based on the buggy code snippet. Programming benchmarks, typically consisting of buggy code snippet and their associated test cases, are used to assess the debugging capabilities of LLMs. However, many existing benchmarks primarily focus on Python and are often limited in terms of language diversity (e.g., DebugBench and DebugEval). To advance the field of multilingual debugging with LLMs, we propose the first massively multilingual debugging benchmark, which includes 3.6K test samples of 18 programming languages and covers the automated program repair (APR) task, the code review (CR) task, and the bug identification (BI) task. Further, we introduce the debugging instruction corpora MDEVAL-INSTRUCT by injecting bugs into the correct multilingual queries and solutions (xDebugGen). Further, a multilingual debugger xDebugCoder trained on MDEVAL-INSTRUCT as a strong baseline specifically to handle the bugs of a wide range of programming languages (e.g."Missing Mut"in language Rust and"Misused Macro Definition"in language C). Our extensive experiments on MDEVAL reveal a notable performance gap between open-source models and closed-source LLMs (e.g., GPT and Claude series), highlighting huge room for improvement in multilingual code debugging scenarios.