Studying and Understanding the Effectiveness and Failures of Conversational LLM-Based Repair

📅 2025-03-19

📈 Citations: 0

✨ Influential: 0

career value

180K/year

🤖 AI Summary

This study investigates the effectiveness and failure mechanisms of conversational large language models (e.g., ChatGPT) in automated program repair (APR). Using the ChatRepair framework, we conduct an empirical analysis on single-function bugs from Defects4J, integrating quantitative evaluation with qualitative root-cause analysis. We compare fill-in-the-blank versus full-function repair strategies, examine iterative patching behavior, and systematically dissect failure modes. Our analysis reveals three dominant failure patterns—semantic misinterpretation, context truncation, and iterative degradation—and shows that over 50% of defects remain uncorrected. Crucially, we identify strategy selection, context modeling, and feedback mechanism design as key determinants of repair robustness. Based on these findings, we propose empirically grounded, verifiable improvements to enhance the reliability of conversational APR—offering both theoretical insights and practical guidance for advancing LLM-based software engineering tools.

Technology Category

Application Category

📝 Abstract

Automated program repair (APR) is designed to automate the process of bug-fixing. In recent years, thanks to the rapid development of large language models (LLMs), automated repair has achieved remarkable progress. Advanced APR techniques powered by conversational LLMs, most notably ChatGPT, have exhibited impressive repair abilities and gained increasing popularity due to the capabilities of the underlying LLMs in providing repair feedback and performing iterative patch improvement. Despite the superiority, conversational APR techniques still fail to repair a large number of bugs. For example, a state-of-the-art conversational technique ChatRepair does not correctly repair over half of the single-function bugs in the Defects4J dataset. To understand the effectiveness and failures of conversational LLM-based repair and provide possible directions for improvement, we studied the exemplary ChatRepair with a focus on comparing the effectiveness of its cloze-style and full function repair strategies, assessing its key iterative component for patch improvement, and analyzing the repair failures. Our study has led to a series of findings, which we believe provide key implications for future research.

Problem

Research questions and friction points this paper is trying to address.

Evaluating effectiveness of conversational LLM-based program repair.

Analyzing failures in repairing bugs using ChatRepair techniques.

Comparing cloze-style and full function repair strategies.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses conversational LLMs for automated program repair.

Compares cloze-style and full function repair strategies.

Analyzes iterative patch improvement and repair failures.

🔎 Similar Papers

A Systematic Literature Review on Large Language Models for Automated Program Repair