Exploring and Lifting the Robustness of LLM-powered Automated Program Repair with Metamorphic Testing

๐Ÿ“… 2024-10-10
๐Ÿ›๏ธ arXiv.org
๐Ÿ“ˆ Citations: 1
โœจ Influential: 0
๐Ÿ“„ PDF

career value

209K/year
๐Ÿค– AI Summary
LLM-based Automated Program Repair (LAPR) exhibits poor robustness against semantically equivalent code variants. Method: This paper proposes MT-LAPR, the first systematic mutation testing framework for LAPR, defining nine developer-consensus semantic-preserving mutation relations across token-, statement-, and code-block-level perturbations. It empirically identifies a strong positive correlation between code readability and LAPR robustness, and accordingly introduces a readability-guided preprocessing paradigm to enhance robustness. Contribution/Results: Evaluated on Defect4J and QuixBugs, MT-LAPR exposes 34.4%โ€“48.5% of LAPR instability under semantic equivalence. The proposed preprocessing method improves repair robustness by up to 49.32%. MT-LAPR establishes a reproducible, interpretable methodology for both evaluating and enhancing LAPR robustnessโ€”offering a novel, principled pathway toward reliable LLM-driven program repair.

Technology Category

Application Category

๐Ÿ“ Abstract
In recent years, Large language model-powered Automated Program Repair (LAPR) techniques have achieved state-of-the-art bug-fixing performance and have been pervasively applied and studied in both industry and academia. Nonetheless, LLMs were proved to be highly sensitive to input prompts, with slight differences in the expressions of semantically equivalent programs potentially causing repair failures. Therefore, it is crucial to conduct robustness testing on LAPR techniques before their practical deployment. However, related research is scarce. To this end, we propose MT-LAPR, a Metamorphic Testing framework exclusively for LAPR techniques, which summarizes nine widely-recognized Metamorphic Relations (MRs) by developers across three perturbation levels: token, statement, and block. Afterward, our proposed MRs are applied to buggy codes to generate test cases, which are semantically equivalent yet to affect the inference of LAPR. Experiments are carried out on two extensively examined bug-fixing datasets, i.e., Defect4J and QuixBugs, and four bug-fixing abled LLMs released recently, demonstrating that 34.4% - 48.5% of the test cases expose the instability of LAPR techniques on average, showing the effectiveness of MT-LAPR and uncovering a positive correlation between code readability and the robustness of LAPR techniques. Inspired by the above findings, this paper uses the test cases generated by MT-LAPR as samples to train a CodeT5-based code editing model aiming at improving code readability and then embeds it into the LAPR workflow as a data preprocessing step. Extensive experiments demonstrate that this approach significantly enhances the robustness of LAPR by 49.32% at most.
Problem

Research questions and friction points this paper is trying to address.

Assessing robustness of LLM-powered Automated Program Repair (LAPR) techniques.
Developing Metamorphic Testing framework (MT-LAPR) for LAPR robustness evaluation.
Improving LAPR robustness by enhancing code readability using CodeT5-based model.
Innovation

Methods, ideas, or system contributions that make the work stand out.

MT-LAPR framework enhances LAPR robustness testing.
CodeT5 model improves code readability for LAPR.
Metamorphic Relations generate semantically equivalent test cases.
P
Pengyu Xue
School of Computer Science and Technology, Shandong University, Qingdao, China
L
Linhao Wu
School of Computer Science and Technology, Shandong University, Qingdao, China
Z
Zhen Yang
School of Computer Science and Technology, Shandong University, Qingdao, China
X
Xinyi Li
School of Electrical and Electronic Engineering, Nanyang Technological University, Singapore
Zhongxing Yu
Zhongxing Yu
Shandong University
Programming LanguageFormal MethodsSoftware Engineering
Zhi Jin
Zhi Jin
Sun Yat-Sen University, Associate Professor
Ge Li
Ge Li
Full Professor of Computer Science, Peking University
Program AnalysisProgram GenerationDeep Learning
Y
Yan Xiao
School of Cyber Science and Technology, Sun Yat-sen University, Shenzhen, China
J
Jingwen Wu
School of Computer Science and Technology, Shandong University, Qingdao, China