🤖 AI Summary
Existing LLM-based automated program repair (APR) approaches suffer from insufficient code context understanding and incomplete test-suite coverage, often yielding partially correct or overfitted patches. To address this, we propose Context-Aware Patch Refinement (CAPR), a novel framework featuring three core innovations: (1) fuzzy problem disambiguation to improve defect localization accuracy; (2) test-time augmentation for generating diverse candidate patches; and (3) an LLM-driven multi-agent code review mechanism that aggregates partially correct patches and refines them into complete, correct fixes. CAPR is modular and integrable, compatible with mainstream APR systems. Evaluated on SWE-Bench Lite, CAPR achieves a 51.67% resolution rate—outperforming AutoCodeRover by 14.67% and surpassing state-of-the-art methods by an average of 14% across multiple benchmarks.
📝 Abstract
Large Language Models (LLMs) have recently shown strong potential in automatic program repair (APR), especially in repository-level settings where the goal is to generate patches based on natural language issue descriptions, large codebases, and regression tests. However, despite their promise, current LLM-based APR techniques often struggle to produce correct fixes due to limited understanding of code context and over-reliance on incomplete test suites. As a result, they frequently generate Draft Patches-partially correct patches that either incompletely address the bug or overfit to the test cases. In this work, we propose a novel patch refinement framework, Refine, that systematically transforms Draft Patches into correct ones. Refine addresses three key challenges: disambiguating vague issue and code context, diversifying patch candidates through test-time scaling, and aggregating partial fixes via an LLM-powered code review process. We implement Refine as a general refinement module that can be integrated into both open-agent-based and workflow-based APR systems. Our evaluation on the SWE-Bench Lite benchmark shows that Refine achieves state-of-the-art results among workflow-based approaches and approaches the best-known performance across all APR categories. Specifically, Refine boosts AutoCodeRover's performance by 14.67%, achieving a score of 51.67% and surpassing all prior baselines. On SWE-Bench Verified, Refine improves the resolution rate by 12.2%, and when integrated across multiple APR systems, it yields an average improvement of 14%-demonstrating its broad effectiveness and generalizability. These results highlight the effectiveness of refinement as a missing component in current APR pipelines and the potential of agentic collaboration in closing the gap between near-correct and correct patches. We also open source our code.