🤖 AI Summary
This study addresses the challenge of overfitting patches in automated program repair (APR), often caused by inadequate test suites, which undermines repair reliability. It presents the first large-scale empirical evaluation of code representations and deep learning models for patch correctness prediction, systematically assessing four code representations—including abstract syntax trees (ASTs), control flow graphs, and code property graphs (CPGs)—across 15 benchmarks with 11 deep learning classifiers. The findings reveal that graph-based representations, particularly CPGs, significantly outperform others; combining sequential and heuristic representations improves prediction performance by 13.5% on average; graph neural network (GNN)-based models achieve an average accuracy of 82.6%; and TREETRAIN integrated with AST effectively filters out 87.09% of overfitted patches. These results demonstrate the generalizability of advanced code representations in enhancing existing APR approaches.
📝 Abstract
Automated program repair (APR) attempts to generate correct patches and has drawn wide attention from both academia and industry in the past decades. However, APR is continuously struggling with the patch overfitting issue due to the weak test suites. Thus, to address the overfitting problem, the community has proposed an increasing number of approaches to predict patch correctness (APCA approaches). Among them, locally deep learning approaches aimed at automatically match designs has been emerging strongly. Such approaches typically encode input code snippets into well-designed representations and build a binary model for correctness prediction. Despite being fundamental in reason about patch correctness, code representation has not been systematically investigated. To bridge this gap, we perform the first extensive study to evaluate the performance of different code representations on predicting patch correctness from more than 500 trained APCA models. The experimental results on 15 benchmarks with four categories and 11 classifiers show that the graph-based code representation which is ill-explored in the literature, consistently outperforms other representations, e.g., an average accuracy of 82.6% for CPG across three GNN models. Moreover, we demonstrate that such representations can achieve comparable or better performance for three different previous APCA approaches, e.g., filtering out 87.09% overfitting patches by TREETRAIN with AST. We further find that integrating sequence-based representation into heuristic-based representation is able to yield an average improvement of 13.5% on five metrics. Overall, our study highlights the potential and challenges of utilizing code representation to reason about patch correctness, thus increasing the usability of off-the-shelf APR tools and reducing the manual debugging effort of developers in practice.