AST-Enhanced or AST-Overloaded? The Surprising Impact of Hybrid Graph Representations on Code Clone Detection

📅 2025-06-17

📈 Citations: 0

✨ Influential: 0

career value

143K/year

🤖 AI Summary

Code cloning is a major contributor to high maintenance costs and security risks; however, existing AST-based deep learning approaches suffer from insufficient semantic representation. Method: This paper systematically evaluates the effectiveness of various graph representations—including ASTs, CFGs, DFGs, and FA-ASTs—and their combinations, in conjunction with GNN architectures (GCN, GAT, GMN) for code clone detection. Contribution/Results: We reveal a strong coupling between graph fusion strategies and model architecture: AST+CFG+DFG significantly improves accuracy for GCN and GAT, whereas FA-AST degrades performance due to structural redundancy; remarkably, GMN achieves superior performance using AST alone, outperforming most fused variants. Our approach attains 98.2% accuracy across multiple benchmarks. This work establishes, for the first time, the optimal matching relationships between GNN architectures and graph representations, providing a reusable, principled modeling guideline for industrial-scale code clone detection.

Technology Category

Application Category

📝 Abstract

As one of the most detrimental code smells, code clones significantly increase software maintenance costs and heighten vulnerability risks, making their detection a critical challenge in software engineering. Abstract Syntax Trees (ASTs) dominate deep learning-based code clone detection due to their precise syntactic structure representation, but they inherently lack semantic depth. Recent studies address this by enriching AST-based representations with semantic graphs, such as Control Flow Graphs (CFGs) and Data Flow Graphs (DFGs). However, the effectiveness of various enriched AST-based representations and their compatibility with different graph-based machine learning techniques remains an open question, warranting further investigation to unlock their full potential in addressing the complexities of code clone detection. In this paper, we present a comprehensive empirical study to rigorously evaluate the effectiveness of AST-based hybrid graph representations in Graph Neural Network (GNN)-based code clone detection. We systematically compare various hybrid representations ((CFG, DFG, Flow-Augmented ASTs (FA-AST)) across multiple GNN architectures. Our experiments reveal that hybrid representations impact GNNs differently: while AST+CFG+DFG consistently enhances accuracy for convolution- and attention-based models (Graph Convolutional Networks (GCN), Graph Attention Networks (GAT)), FA-AST frequently introduces structural complexity that harms performance. Notably, GMN outperforms others even with standard AST representations, highlighting its superior cross-code similarity detection and reducing the need for enriched structures.

Problem

Research questions and friction points this paper is trying to address.

Evaluating hybrid AST-graph representations for code clone detection

Assessing compatibility of enriched ASTs with GNN architectures

Determining optimal graph structures for accurate clone detection

Innovation

Methods, ideas, or system contributions that make the work stand out.

Hybrid AST with CFG and DFG enhances accuracy

FA-AST adds complexity, reduces performance

GMN excels with standard AST alone

🔎 Similar Papers

No similar papers found.