LLMSniffer: Detecting LLM-Generated Code via GraphCodeBERT and Supervised Contrastive Learning

📅 2026-04-17
📈 Citations: 0
Influential: 0
📄 PDF

career value

202K/year
🤖 AI Summary
This study addresses the challenges posed by large language model (LLM)-generated code in academic integrity, code quality, and software security by proposing a novel source detection method based on supervised contrastive learning. The approach introduces supervised contrastive learning to the task of identifying LLM-generated code for the first time, integrating comment removal as a preprocessing step with a structure-aware GraphCodeBERT encoder. A two-stage fine-tuning strategy enhances embedding discriminability, followed by a multilayer perceptron (MLP) classifier for final prediction. Evaluated on the GPTSniffer and Whodunit datasets, the model achieves accuracies of 78% and 94.65%, respectively, with corresponding F1 scores of 78% and 94.64%. t-SNE visualizations further demonstrate that the learned embeddings form more compact and separable clusters, significantly outperforming existing methods.

Technology Category

Application Category

📝 Abstract
The rapid proliferation of Large Language Models (LLMs) in software development has made distinguishing AI-generated code from human-written code a critical challenge with implications for academic integrity, code quality assurance, and software security. We present LLMSniffer, a detection framework that fine-tunes GraphCodeBERT using a two-stage supervised contrastive learning pipeline augmented with comment removal preprocessing and an MLP classifier. Evaluated on two benchmark datasets - GPTSniffer and Whodunit - LLMSniffer achieves substantial improvements over prior baselines: accuracy increases from 70% to 78% on GPTSniffer (F1: 68% to 78%) and from 91% to 94.65% on Whodunit (F1: 91% to 94.64%). t-SNE visualizations confirm that contrastive fine-tuning yields well-separated, compact embeddings. We release our model checkpoints, datasets, codes and a live interactive demo to facilitate further research.
Problem

Research questions and friction points this paper is trying to address.

LLM-generated code detection
code origin identification
AI-generated code
software security
academic integrity
Innovation

Methods, ideas, or system contributions that make the work stand out.

Supervised Contrastive Learning
GraphCodeBERT
LLM-Generated Code Detection
Comment Removal
Code Embedding