AST-Enhanced or AST-Overloaded? The Surprising Impact of Hybrid Graph Representations on Code Clone Detection

📅 2025-06-17
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Code cloning is a major contributor to high maintenance costs and security risks; however, existing AST-based deep learning approaches suffer from insufficient semantic representation. Method: This paper systematically evaluates the effectiveness of various graph representations—including ASTs, CFGs, DFGs, and FA-ASTs—and their combinations, in conjunction with GNN architectures (GCN, GAT, GMN) for code clone detection. Contribution/Results: We reveal a strong coupling between graph fusion strategies and model architecture: AST+CFG+DFG significantly improves accuracy for GCN and GAT, whereas FA-AST degrades performance due to structural redundancy; remarkably, GMN achieves superior performance using AST alone, outperforming most fused variants. Our approach attains 98.2% accuracy across multiple benchmarks. This work establishes, for the first time, the optimal matching relationships between GNN architectures and graph representations, providing a reusable, principled modeling guideline for industrial-scale code clone detection.

Technology Category

Application Category

📝 Abstract
As one of the most detrimental code smells, code clones significantly increase software maintenance costs and heighten vulnerability risks, making their detection a critical challenge in software engineering. Abstract Syntax Trees (ASTs) dominate deep learning-based code clone detection due to their precise syntactic structure representation, but they inherently lack semantic depth. Recent studies address this by enriching AST-based representations with semantic graphs, such as Control Flow Graphs (CFGs) and Data Flow Graphs (DFGs). However, the effectiveness of various enriched AST-based representations and their compatibility with different graph-based machine learning techniques remains an open question, warranting further investigation to unlock their full potential in addressing the complexities of code clone detection. In this paper, we present a comprehensive empirical study to rigorously evaluate the effectiveness of AST-based hybrid graph representations in Graph Neural Network (GNN)-based code clone detection. We systematically compare various hybrid representations ((CFG, DFG, Flow-Augmented ASTs (FA-AST)) across multiple GNN architectures. Our experiments reveal that hybrid representations impact GNNs differently: while AST+CFG+DFG consistently enhances accuracy for convolution- and attention-based models (Graph Convolutional Networks (GCN), Graph Attention Networks (GAT)), FA-AST frequently introduces structural complexity that harms performance. Notably, GMN outperforms others even with standard AST representations, highlighting its superior cross-code similarity detection and reducing the need for enriched structures.
Problem

Research questions and friction points this paper is trying to address.

Evaluating hybrid AST-graph representations for code clone detection
Assessing compatibility of enriched ASTs with GNN architectures
Determining optimal graph structures for accurate clone detection
Innovation

Methods, ideas, or system contributions that make the work stand out.

Hybrid AST with CFG and DFG enhances accuracy
FA-AST adds complexity, reduces performance
GMN excels with standard AST alone
🔎 Similar Papers
No similar papers found.
Z
Zixian Zhang
University of Galway, Ireland; CRT-AI, Irish National Centre for Research Training in Artificial Intelligence
Takfarinas Saber
Takfarinas Saber
Lecturer, University of Galway, Ireland
Complex Software SystemsOperational ResearchEvolutionary ComputationArtificial Intelligence