🤖 AI Summary
To address the weak semantic understanding and significantly inferior end-to-end performance of small-scale LLMs (e.g., 7B-parameter models) in code debugging, this work proposes: (1) DEBUGEVAL—the first multi-granularity debugging benchmark covering localization, identification, repair, and verification stages—systematically exposing the limitations of 7B models in deep semantic reasoning; (2) COAST—a multi-agent collaborative data synthesis framework integrating role-based prompting, program analysis, and execution-feedback-driven iterative generation—to autonomously produce high-quality, diverse debugging data; (3) supervised fine-tuning of a 7B model on COAST-generated data, achieving substantial gains over both human-annotated and GPT-4-synthesized data on DEBUGEVAL, with debugging performance approaching that of GPT-3.5 and markedly narrowing the gap with large models. This work establishes the first scalable, interpretable multi-agent paradigm for debugging data synthesis.
📝 Abstract
Code debugging is a vital stage of software development, essential for ensuring the reliability and performance of Large Language Models (LLMs) in code generation task. Human debugging typically follows a multi-stage process, which includes Bug Localization, Bug Identification, Code Repair, and Code Recognition. However, existing code debugging benchmarks predominantly focus on the Code Repair stage, which offers only a limited perspective on evaluating the debugging capabilities of LLMs. In this paper, we introduce DEBUGEVAL, a comprehensive benchmark for evaluating the debugging abilities of LLMs by emulating the multi-stage human debugging process. Through evaluating on DEBUGEVAL, we observe that 7B-scale models consistently underperform compared to their larger counterparts, highlighting their limitations in comprehending code semantics. In this case, we propose the COmmunicative Agent-based data SynThesis (COAST) framework, which employs a multi-agent system to generate high-quality training data for supervised fine-tuning (SFT). Experimental results demonstrate that COAST-generated data outperform human-curated and GPT-4-generated data, enabling 7B-scale LLMs to achieve debugging performance comparable to GPT-3.5.