Beyond Accuracy: Behavioral Dynamics of Agentic Multi-Hunk Repair

📅 2025-11-14

📈 Citations: 0

✨ Influential: 0

career value

184K/year

🤖 AI Summary

This work investigates large language model (LLM)-based programming agents’ behavior in repairing multi-hunk bugs—real-world defects requiring coordinated fixes across multiple non-contiguous code regions. We propose the first behavioral dynamic analysis framework grounded in repair trajectories, accompanied by a fine-grained metric suite to systematically evaluate localization precision, repair accuracy, and computational cost trade-offs for Claude Code, Codex, Gemini-cli, and Qwen Code on 372 multi-hunk vulnerabilities. Innovations include a “fail-fast” mechanism to reduce inference overhead and the Maple framework, which enhances repository-level contextual awareness. Results show repair accuracy ranges from 25.8% to 93.3%, decreasing with increasing defect dispersion; Maple boosts Gemini-cli’s accuracy by 30%; top-performing agents exhibit stronger semantic consistency and superior capability in suppressing regressive errors.

Technology Category

Application Category

📝 Abstract

Automated program repair has traditionally focused on single-hunk defects, overlooking multi-hunk bugs that are prevalent in real-world systems. Repairing these bugs requires coordinated edits across multiple, disjoint code regions, posing substantially greater challenges. We present the first systematic study of LLM-driven coding agents (Claude Code, Codex, Gemini-cli, and Qwen Code) on this task. We evaluate these agents on 372 multi-hunk bugs from the Hunk4J dataset, analyzing 1,488 repair trajectories using fine-grained metrics that capture localization, repair accuracy, regression behavior, and operational dynamics. Results reveal substantial variation: repair accuracy ranges from 25.8% (Qwen Code) to 93.3% (Claude Code) and consistently declines with increasing bug dispersion and complexity. High-performing agents demonstrate superior semantic consistency, achieving positive regression reduction, whereas lower-performing agents often introduce new test failures. Notably, agents do not fail fast; failed repairs consume substantially more resources (39%-343% more tokens) and require longer execution time (43%-427%). Additionally, we developed Maple to provide agents with repository-level context. Empirical results show that Maple improves the repair accuracy of Gemini-cli by 30% through enhanced localization. By analyzing fine-grained metrics and trajectory-level analysis, this study moves beyond accuracy to explain how coding agents localize, reason, and act during multi-hunk repair.

Problem

Research questions and friction points this paper is trying to address.

Addressing multi-hunk bug repair challenges in automated program systems

Evaluating LLM-driven agents on localization and repair accuracy metrics

Analyzing resource consumption and failure dynamics during repair processes

Innovation

Methods, ideas, or system contributions that make the work stand out.

Evaluating LLM agents on multi-hunk bug repair

Using fine-grained metrics to analyze repair trajectories

Enhancing repair accuracy with repository-level context

🔎 Similar Papers

No similar papers found.