Improving Source Code Similarity Detection Through GraphCodeBERT and Integration of Additional Features

📅 2024-08-12

🏛️ arXiv.org

📈 Citations: 1

✨ Influential: 0

career value

187K/year

🤖 AI Summary

To address the insufficient joint modeling of semantic and structural features in source code similarity detection, this paper proposes a multi-feature fusion method based on GraphCodeBERT. Building upon the pre-trained model, we introduce a learnable, task-specific output feature layer and employ a feature concatenation mechanism to end-to-end integrate newly extracted structural and semantic features with the original contextual representations. Unlike conventional fine-tuning paradigms, our design uniquely embeds auxiliary output features directly into the classification pipeline and jointly optimizes them. Extensive experiments on standard benchmarks—including Bench4BL and POJ-104—demonstrate significant improvements in accuracy, recall, and F1-score. These results validate the effectiveness of fine-grained, synergistic multi-feature representation for code similarity assessment and offer a novel paradigm for adapting pre-trained models to code analysis tasks.

Technology Category

Application Category

📝 Abstract

This paper presents a novel approach for source code similarity detection that integrates an additional output feature into the classification process with the goal of improving model performance. Our approach is based on the GraphCodeBERT model, extended with a custom output feature layer and a concatenation mechanism for improved feature representation. The model was trained and evaluated, achieving promising results in terms of precision, recall, and f-measure. The implementation details, including model architecture and training strategies are discussed. The source code that illustrates our approach can be downloaded from https://www.github.com/jorge-martinez-gil/graphcodebert-feature-integration.

Problem

Research questions and friction points this paper is trying to address.

Enhancing code similarity detection using GraphCodeBERT

Integrating extra features to boost model accuracy

Improving precision and recall in code comparison

Innovation

Methods, ideas, or system contributions that make the work stand out.

Extends GraphCodeBERT with custom feature layer

Integrates additional output feature for classification

Uses concatenation for improved feature representation

🔎 Similar Papers

No similar papers found.