🤖 AI Summary
Existing code similarity metrics are often confined to string-level or syntactic representations, failing to capture deep semantic relationships. This work proposes CSSG, a novel approach that, for the first time, incorporates program dependence graphs into code similarity modeling. By explicitly encoding control dependencies and variable interactions, CSSG constructs a semantics-aware representation of code. The method integrates control-flow analysis with graph representation learning and demonstrates significant improvements over state-of-the-art metrics on the CodeContests+ dataset. It achieves superior performance in discerning semantic similarity both within a single programming language and across different languages, offering a more accurate and robust measure of functional equivalence.
📝 Abstract
Existing code similarity metrics, such as BLEU, CodeBLEU, and TSED, largely rely on surface-level string overlap or abstract syntax tree structures, and often fail to capture deeper semantic relationships between programs.We propose CSSG (Code Similarity using Semantic Graphs), a novel metric that leverages program dependence graphs to explicitly model control dependencies and variable interactions, providing a semantics-aware representation of code.Experiments on the CodeContests+ dataset show that CSSG consistently outperforms existing metrics in distinguishing more similar code from less similar code under both monolingual and cross-lingual settings, demonstrating that dependency-aware graph representations offer a more effective alternative to surface-level or syntax-based similarity measures.