🤖 AI Summary
Code cloning is a major contributor to high maintenance costs and security risks; however, existing AST-based deep learning approaches suffer from insufficient semantic representation. Method: This paper systematically evaluates the effectiveness of various graph representations—including ASTs, CFGs, DFGs, and FA-ASTs—and their combinations, in conjunction with GNN architectures (GCN, GAT, GMN) for code clone detection. Contribution/Results: We reveal a strong coupling between graph fusion strategies and model architecture: AST+CFG+DFG significantly improves accuracy for GCN and GAT, whereas FA-AST degrades performance due to structural redundancy; remarkably, GMN achieves superior performance using AST alone, outperforming most fused variants. Our approach attains 98.2% accuracy across multiple benchmarks. This work establishes, for the first time, the optimal matching relationships between GNN architectures and graph representations, providing a reusable, principled modeling guideline for industrial-scale code clone detection.
📝 Abstract
As one of the most detrimental code smells, code clones significantly increase software maintenance costs and heighten vulnerability risks, making their detection a critical challenge in software engineering. Abstract Syntax Trees (ASTs) dominate deep learning-based code clone detection due to their precise syntactic structure representation, but they inherently lack semantic depth. Recent studies address this by enriching AST-based representations with semantic graphs, such as Control Flow Graphs (CFGs) and Data Flow Graphs (DFGs). However, the effectiveness of various enriched AST-based representations and their compatibility with different graph-based machine learning techniques remains an open question, warranting further investigation to unlock their full potential in addressing the complexities of code clone detection. In this paper, we present a comprehensive empirical study to rigorously evaluate the effectiveness of AST-based hybrid graph representations in Graph Neural Network (GNN)-based code clone detection. We systematically compare various hybrid representations ((CFG, DFG, Flow-Augmented ASTs (FA-AST)) across multiple GNN architectures. Our experiments reveal that hybrid representations impact GNNs differently: while AST+CFG+DFG consistently enhances accuracy for convolution- and attention-based models (Graph Convolutional Networks (GCN), Graph Attention Networks (GAT)), FA-AST frequently introduces structural complexity that harms performance. Notably, GMN outperforms others even with standard AST representations, highlighting its superior cross-code similarity detection and reducing the need for enriched structures.