π€ AI Summary
Existing code-generation agents primarily rely on static analysis or trial-and-error testing for bug repair, struggling to effectively leverage runtime debugging information. This work proposes Debug2Fix, a novel framework that deeply integrates an interactive debugger into the coding agent for the first time. By employing a sub-agent architecture, Debug2Fix enables synergistic coordination between debugging and repair processes, allowing the agent to pinpoint and fix complex bugs with human-like precision based on runtime program states. Evaluated on GitBug-Java and SWE-Bench-Live benchmarks, the approach outperforms baseline methods by over 20%, substantially boosting the repair capabilities of weaker modelsβsuch as GPT-5 and Claude Haiku 4.5βto surpass those of stronger counterparts like Claude Sonnet 4.5, thereby demonstrating that thoughtful tool design can effectively compensate for inherent model limitations.
π Abstract
While significant progress has been made in automating various aspects of software development through coding agents, there is still significant room for improvement in their bug fixing capabilities. Debugging and investigation of runtime behavior remains largely a manual, developer-driven process. Popular coding agents typically rely on either static analysis of the code or iterative test-fix cycles, which is akin to trial and error debugging. We posit that there is a wealth of rich runtime information that developers routinely access while debugging code, which agents are currently deprived of due to design limitations. Despite how prevalent debuggers are in modern IDEs and command-line tools, they have surprisingly not made their way into coding agents. In this work, we introduce Debug2Fix, a novel framework that incorporates interactive debugging as a core component of a software engineering agent via a subagent architecture. We incorporate debuggers for Java and Python into our agent framework and evaluate against GitBug-Java and SWE-Bench-Live and achieve >20% improvement in performance compared to the baseline for certain models. Furthermore, using our framework, we're able to make weaker models like GPT-5 and Claude Haiku 4.5 match or exceed the performances of stronger models like Claude Sonnet 4.5, showing that better tool design is often just as important as switching to a more expensive model. Finally, we conduct systematic ablations demonstrating the importance of both the subagent architecture and debugger integration.