DebugRepair: Enhancing LLM-Based Automated Program Repair via Self-Directed Debugging

📅 2026-04-21

📈 Citations: 0

✨ Influential: 0

career value

162K/year

🤖 AI Summary

This work addresses a critical limitation in existing large language model (LLM)-based program repair approaches, which predominantly rely on superficial failure signals—such as stack traces—while lacking visibility into key intermediate runtime states, often leading to inaccurate root cause localization and incorrect patches. To overcome this, the paper introduces DebugRepair, a novel framework that integrates simulated debugging into the LLM repair pipeline for the first time. DebugRepair employs test semantic sanitization and injects simulated instrumentation to generate debugging statements that capture runtime evidence, enabling a debugging-driven, multi-turn dialogue for progressive repair. Evaluated on the Defects4J benchmark, the method repairs 224 bugs using GPT-3.5, outperforming the previous best LLM-based approach by 26.2%; with DeepSeek-V3, it fixes 295 bugs, surpassing the next-best baseline by 59. Across five diverse LLMs, DebugRepair achieves an average improvement of 51.3% in repair performance.

Technology Category

Application Category

📝 Abstract

Automated Program Repair (APR) has benefited from the code understanding and generation capabilities of Large Language Models (LLMs). Existing feedback-based APR methods iteratively refine candidate patches using test execution feedback and have shown promising results. However, most rely on outcome-level failure symptoms, such as stack traces, which show how failures are observed but fail to expose the intermediate runtime states critical for root-cause analysis. As a result, LLMs often infer bug causes without sufficient runtime evidence, leading to incorrect patches. To address this limitation, we propose DebugRepair, a self-directed debugging framework for LLM-based APR. DebugRepair enhances patch refinement with intermediate runtime evidence collected through simulated debugging. It consists of three components: test semantic purification, simulated instrumentation, and debugging-driven conversational repair. Together, they reduce noisy test context, collect runtime traces through targeted debugging statements with rule-based fallback, and progressively refine candidate patches using prior attempts and newly observed runtime states. We evaluate DebugRepair on three benchmarks across Java and Python. Experiments show that DebugRepair achieves state-of-the-art performance against 15 approaches. With GPT-3.5, it correctly fixes 224 bugs on Defects4J, outperforming prior SOTA LLM-based methods by 26.2%. With DeepSeek-V3, it correctly fixes 295 Defects4J bugs, surpassing the second-best baseline by 59 bugs. Across five additional backbone LLMs, DebugRepair improves repair performance by 51.3% over vanilla settings. Ablation studies further confirm the effectiveness of all components.

Problem

Research questions and friction points this paper is trying to address.

Automated Program Repair

Large Language Models

Runtime State

Debugging

Program Repair

Innovation

Methods, ideas, or system contributions that make the work stand out.

self-directed debugging

runtime state evidence

simulated instrumentation