Agentic Bug Reproduction for Effective Automated Program Repair at Google

📅 2025-02-03
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Manually writing Bug Reproduction Tests (BRTs) for industrial-scale private codebases is costly and labor-intensive, while automated BRT generation remains challenging. Method: This paper proposes BRT Agent, the first framework leveraging fine-tuned large language models (LLMs) as code-editing agents, integrated with multi-turn reasoning and iterative test validation to enable end-to-end generation of executable BRTs from real-world bug reports (sourced from Google’s internal systems). It further introduces the Ensemble Pass Rate (EPR) metric to accurately rank Automated Program Repair (APR) outcomes. Results: Evaluated on 80 real Google bugs, BRT Agent achieves a 28% BRT generation rate—18 percentage points higher than LIBRO. When integrated with APR, it improves repair success by 30%. EPR achieves 70% accuracy in identifying viable fixes among top-ranked APR candidates.

Technology Category

Application Category

📝 Abstract
Bug reports often lack sufficient detail for developers to reproduce and fix the underlying defects. Bug Reproduction Tests (BRTs), tests that fail when the bug is present and pass when it has been resolved, are crucial for debugging, but they are rarely included in bug reports, both in open-source and in industrial settings. Thus, automatically generating BRTs from bug reports has the potential to accelerate the debugging process and lower time to repair. This paper investigates automated BRT generation within an industry setting, specifically at Google, focusing on the challenges of a large-scale, proprietary codebase and considering real-world industry bugs extracted from Google's internal issue tracker. We adapt and evaluate a state-of-the-art BRT generation technique, LIBRO, and present our agent-based approach, BRT Agent, which makes use of a fine-tuned Large Language Model (LLM) for code editing. Our BRT Agent significantly outperforms LIBRO, achieving a 28% plausible BRT generation rate, compared to 10% by LIBRO, on 80 human-reported bugs from Google's internal issue tracker. We further investigate the practical value of generated BRTs by integrating them with an Automated Program Repair (APR) system at Google. Our results show that providing BRTs to the APR system results in 30% more bugs with plausible fixes. Additionally, we introduce Ensemble Pass Rate (EPR), a metric which leverages the generated BRTs to select the most promising fixes from all fixes generated by APR system. Our evaluation on EPR for Top-K and threshold-based fix selections demonstrates promising results and trade-offs. For example, EPR correctly selects a plausible fix from a pool of 20 candidates in 70% of cases, based on its top-1 ranking.
Problem

Research questions and friction points this paper is trying to address.

Automated Bug Reproduction Tests generation
Enhancing Automated Program Repair efficiency
Large-scale proprietary codebase challenges
Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses fine-tuned Large Language Model
Generates Bug Reproduction Tests automatically
Integrates BRTs with Automated Program Repair
🔎 Similar Papers
No similar papers found.
Runxiang Cheng
Runxiang Cheng
University of Illinois at Urbana–Champaign
Michele Tufano
Michele Tufano
Google
J
Jurgen Cito
TU Wien, Austria
J
J. Cambronero
Google
P
Pat Rondon
Google
R
Renyao Wei
Google
Aaron Sun
Aaron Sun
CS PhD Student, UMass Amherst
Computer Vision
S
Satish Chandra
Google