From Code to Correctness: Closing the Last Mile of Code Generation with Hierarchical Debugging

📅 2024-10-02

🏛️ arXiv.org

📈 Citations: 13

✨ Influential: 2

career value

151K/year

🤖 AI Summary

Current large language model (LLM)-based code generators suffer from limited test pass rates, particularly on complex tasks, due to difficulties in localizing and repairing cross-granularity errors—from syntactic violations to algorithmic flaws—often necessitating manual debugging. To address this, we propose MGDebugger, the first multi-granularity debugger featuring a hierarchical debugging paradigm: it recursively decomposes code into a tree-structured hierarchy and performs bottom-up error isolation and collaborative repair. MGDebugger integrates an LLM-simulated Python executor enabling fine-grained variable-state tracking and precise breakpoint localization. It further combines hierarchical decomposition, iterative debugging, multi-stage prompt engineering, and automated subfunction unit test generation. Evaluated on HumanEval, MGDebugger achieves an 18.9% absolute improvement in code repair accuracy; on HumanEvalFix, it attains a 97.6% repair success rate. These results demonstrate substantial gains in debugging robustness and generalization for complex programming tasks.

Technology Category

Application Category

📝 Abstract

While large language models have made significant strides in code generation, the pass rate of the generated code is bottlenecked on subtle errors, often requiring human intervention to pass tests, especially for complex problems. Existing LLM-based debugging systems treat generated programs as monolithic units, failing to address bugs at multiple levels of granularity, from low-level syntax errors to high-level algorithmic flaws. In this paper, we introduce Multi-Granularity Debugger (MGDebugger), a hierarchical code debugger by isolating, identifying, and resolving bugs at various levels of granularity. MGDebugger decomposes problematic code into a hierarchical tree structure of subfunctions, with each level representing a particular granularity of error. During debugging, it analyzes each subfunction and iteratively resolves bugs in a bottom-up manner. To effectively test each subfunction, we propose an LLM-simulated Python executor, which traces code execution and tracks important variable states to pinpoint errors accurately. Extensive experiments demonstrate that MGDebugger outperforms existing debugging systems, achieving an 18.9% improvement in accuracy over seed generations in HumanEval and a 97.6% repair success rate in HumanEvalFix. Furthermore, MGDebugger effectively fixes bugs across different categories and difficulty levels, demonstrating its robustness and effectiveness.

Problem

Research questions and friction points this paper is trying to address.

Addresses subtle errors in LLM-generated code requiring human fixes

Solves multi-level bugs from syntax to algorithmic flaws

Improves debugging accuracy and repair rates in generated code

Innovation

Methods, ideas, or system contributions that make the work stand out.

Hierarchical tree structure for multi-granularity debugging

LLM-simulated Python executor for error tracing

Bottom-up iterative bug resolution approach

🔎 Similar Papers

No similar papers found.