🤖 AI Summary
This study addresses the lack of systematic comparison among architectural paradigms for large language model (LLM)-driven automated program repair systems, which hinders informed trade-offs among effectiveness, cost, and reliability. To this end, we establish a unified benchmark and evaluation framework and conduct the first controlled comparative experiment across four representative architectures: fixed workflow, single-agent, multi-agent, and general-purpose code agent. Our results demonstrate that both architectural design and iteration depth predominantly govern repair performance. The general-purpose code agent achieves the best overall results by leveraging a universal tool interface, while the other architectures exhibit distinct trade-offs: fixed workflows are efficient yet brittle, single-agent systems balance flexibility and cost, and multi-agent approaches offer strong generalization at the expense of high computational overhead and susceptibility to reasoning drift.
📝 Abstract
Large language models (LLMs) have shown promise for automated patching, but their effectiveness depends strongly on how they are integrated into patching systems. While prior work explores prompting strategies and individual agent designs, the field lacks a systematic comparison of patching architectures. In this paper, we present a controlled evaluation of four LLM-based patching paradigms -- fixed workflow, single-agent system, multi-agent system, and general-purpose code agents -- using a unified benchmark and evaluation framework. We analyze patch correctness, failure modes, token usage, and execution time across real-world vulnerability tasks. Our results reveal clear architectural trade-offs: fixed workflows are efficient but brittle, single-agent systems balance flexibility and cost, and multi-agent designs improve generalization at the expense of substantially higher overhead and increased risk of reasoning drift on complex tasks. Surprisingly, general-purpose code agents achieve the strongest overall patching performance, benefiting from general-purpose tool interfaces that support effective adaptation across vulnerability types. Overall, we show that architectural design and iteration depth, rather than model capability alone, dominate the reliability and cost of LLM-based automated patching.