Exploring and Lifting the Robustness of LLM-powered Automated Program Repair with Metamorphic Testing

📅 2024-10-10

🏛️ arXiv.org

📈 Citations: 1

✨ Influential: 0

career value

191K/year

🤖 AI Summary

LLM-based Automated Program Repair (LAPR) exhibits poor robustness against semantically equivalent code variants. Method: This paper proposes MT-LAPR, the first systematic mutation testing framework for LAPR, defining nine developer-consensus semantic-preserving mutation relations across token-, statement-, and code-block-level perturbations. It empirically identifies a strong positive correlation between code readability and LAPR robustness, and accordingly introduces a readability-guided preprocessing paradigm to enhance robustness. Contribution/Results: Evaluated on Defect4J and QuixBugs, MT-LAPR exposes 34.4%–48.5% of LAPR instability under semantic equivalence. The proposed preprocessing method improves repair robustness by up to 49.32%. MT-LAPR establishes a reproducible, interpretable methodology for both evaluating and enhancing LAPR robustness—offering a novel, principled pathway toward reliable LLM-driven program repair.

Technology Category

Application Category

📝 Abstract

In recent years, Large language model-powered Automated Program Repair (LAPR) techniques have achieved state-of-the-art bug-fixing performance and have been pervasively applied and studied in both industry and academia. Nonetheless, LLMs were proved to be highly sensitive to input prompts, with slight differences in the expressions of semantically equivalent programs potentially causing repair failures. Therefore, it is crucial to conduct robustness testing on LAPR techniques before their practical deployment. However, related research is scarce. To this end, we propose MT-LAPR, a Metamorphic Testing framework exclusively for LAPR techniques, which summarizes nine widely-recognized Metamorphic Relations (MRs) by developers across three perturbation levels: token, statement, and block. Afterward, our proposed MRs are applied to buggy codes to generate test cases, which are semantically equivalent yet to affect the inference of LAPR. Experiments are carried out on two extensively examined bug-fixing datasets, i.e., Defect4J and QuixBugs, and four bug-fixing abled LLMs released recently, demonstrating that 34.4% - 48.5% of the test cases expose the instability of LAPR techniques on average, showing the effectiveness of MT-LAPR and uncovering a positive correlation between code readability and the robustness of LAPR techniques. Inspired by the above findings, this paper uses the test cases generated by MT-LAPR as samples to train a CodeT5-based code editing model aiming at improving code readability and then embeds it into the LAPR workflow as a data preprocessing step. Extensive experiments demonstrate that this approach significantly enhances the robustness of LAPR by 49.32% at most.

Problem

Research questions and friction points this paper is trying to address.

Assessing robustness of LLM-powered Automated Program Repair (LAPR) techniques.

Developing Metamorphic Testing framework (MT-LAPR) for LAPR robustness evaluation.

Improving LAPR robustness by enhancing code readability using CodeT5-based model.

Innovation

Methods, ideas, or system contributions that make the work stand out.

MT-LAPR framework enhances LAPR robustness testing.

CodeT5 model improves code readability for LAPR.

Metamorphic Relations generate semantically equivalent test cases.

🔎 Similar Papers

A Systematic Literature Review on Large Language Models for Automated Program Repair