Large Language Models for cross-language code clone detection

📅 2024-08-08
🏛️ arXiv.org
📈 Citations: 0
Influential: 0
📄 PDF

career value

190K/year
🤖 AI Summary
This work addresses cross-language code clone detection—a critical challenge in software engineering. We systematically evaluate five large language models (LLMs), including Llama-2 and CodeLlama, under eight prompting strategies, benchmarking them against pretrained code embedding models (e.g., CodeBERT) and classical approaches. Results show that while LLMs achieve near-perfect F1 scores (0.99) on trivial cases, they fundamentally struggle to capture deep cross-language semantic equivalence. In contrast, a binary classification method built upon unified code embeddings demonstrates substantially stronger performance: it outperforms the best LLM by approximately 1 and 20 percentage points on XLCoST and CodeNet, respectively, establishing new state-of-the-art results. Our key contributions are twofold: (i) the first empirical demonstration of LLMs’ intrinsic limitations in modeling cross-language semantic alignment for clone detection, and (ii) rigorous validation of pretrained code embeddings’ superiority and generalizability in cross-language semantic representation.

Technology Category

Application Category

📝 Abstract
With the involvement of multiple programming languages in modern software development, cross-lingual code clone detection has gained traction within the software engineering community. Numerous studies have explored this topic, proposing various promising approaches. Inspired by the significant advances in machine learning in recent years, particularly Large Language Models (LLMs), which have demonstrated their ability to tackle various tasks, this paper revisits cross-lingual code clone detection. We evaluate the performance of five (05) LLMs and eight prompts (08) for the identification of cross-lingual code clones. Additionally, we compare these results against two baseline methods. Finally, we evaluate a pre-trained embedding model to assess the effectiveness of the generated representations for classifying clone and non-clone pairs. The studies involving LLMs and Embedding models are evaluated using two widely used cross-lingual datasets, XLCoST and CodeNet. Our results show that LLMs can achieve high F1 scores, up to 0.99, for straightforward programming examples. However, they not only perform less well on programs associated with complex programming challenges but also do not necessarily understand the meaning of"code clones"in a cross-lingual setting. We show that embedding models used to represent code fragments from different programming languages in the same representation space enable the training of a basic classifier that outperforms all LLMs by ~1 and ~20 percentage points on the XLCoST and CodeNet datasets, respectively. This finding suggests that, despite the apparent capabilities of LLMs, embeddings provided by embedding models offer suitable representations to achieve state-of-the-art performance in cross-lingual code clone detection.
Problem

Research questions and friction points this paper is trying to address.

Cross-language
Code Clone Detection
Software Development
Innovation

Methods, ideas, or system contributions that make the work stand out.

Cross-lingual Code Clone Detection
Unified Code Representation
Specialized Code Transformation Model
🔎 Similar Papers
No similar papers found.