🤖 AI Summary
This paper addresses the challenge of detecting “intrinsic clones”—code fragments with identical logical cores but substantial syntactic divergence—within Type-3 clones. To tackle this, we propose a semantic-aware detection method grounded in information theory. Specifically, we innovatively quantify the semantic importance of individual code lines using information entropy and mutual information, enabling the construction of semantically weighted line-level representations. These representations are integrated into the ECScan clone detection framework, shifting the paradigm from syntax-based matching to logic-core-driven identification. Experimental evaluation on real-world projects demonstrates that our approach achieves an average F1-score of 85%, significantly outperforming state-of-the-art tools. The method exhibits high precision, strong generalizability across diverse codebases, and favorable scalability, making it suitable for large-scale industrial applications.
📝 Abstract
Code cloning, a widespread practice in software development, involves replicating code fragments to save time but often at the expense of software maintainability and quality. In this paper, we address the specific challenge of detecting"essence clones", a complex subtype of Type-3 clones characterized by sharing critical logic despite different peripheral codes. Traditional techniques often fail to detect essence clones due to their syntactic focus. To overcome this limitation, we introduce ECScan, a novel detection tool that leverages information theory to assess the semantic importance of code lines. By assigning weights to each line based on its information content, ECScan emphasizes core logic over peripheral code differences. Our comprehensive evaluation across various real-world projects shows that ECScan significantly outperforms existing tools in detecting essence clones, achieving an average F1-score of 85%. It demonstrates robust performance across all clone types and offers exceptional scalability. This study advances clone detection by providing a practical tool for developers to enhance code quality and reduce maintenance burdens, emphasizing the semantic aspects of code through an innovative information-theoretic approach.