Automatic Detection of LLM-generated Code: A Case Study of Claude 3 Haiku

📅 2024-09-02

🏛️ arXiv.org

📈 Citations: 10

✨ Influential: 0

career value

189K/year

🤖 AI Summary

To address security risks introduced by LLM-generated code, this paper proposes a lightweight detection method based on structured software metrics, overcoming the limitations of existing detectors—namely, insufficient cross-model validation and opaque “black-box” mechanisms. Focusing on Claude 3 Haiku, we conduct function-level and class-level empirical analyses on the CodeSearchNet dataset, extracting 22 interpretable static software metrics (e.g., significantly longer functions and shorter classes). We build Random Forest and XGBoost classifiers using these features. This work is the first to systematically identify and validate such statistical deviations as efficient, token-agnostic discriminative criteria. Experimental results demonstrate detection accuracies of 82% at the function level and 66% at the class level, confirming that structured software metrics robustly distinguish LLM-generated code from human-written code.

Technology Category

Application Category

📝 Abstract

Using Large Language Models (LLMs) has gained popularity among software developers for generating source code. However, the use of LLM-generated code can introduce risks of adding suboptimal, defective, and vulnerable code. This makes it necessary to devise methods for the accurate detection of LLM-generated code. Toward this goal, we perform a case study of Claude 3 Haiku (or Claude 3 for brevity) on CodeSearchNet dataset. We divide our analyses into two parts: function-level and class-level. We extract 22 software metric features, such as Code Lines and Cyclomatic Complexity, for each level of granularity. We then analyze code snippets generated by Claude 3 and their human-authored counterparts using the extracted features to understand how unique the code generated by Claude 3 is. In the following step, we use the unique characteristics of Claude 3-generated code to build Machine Learning (ML) models and identify which features of the code snippets make them more detectable by ML models. Our results indicate that Claude 3 tends to generate longer functions, but shorter classes than humans, and this characteristic can be used to detect Claude 3-generated code with ML models with 82% and 66% accuracies for function-level and class-level snippets, respectively.

Problem

Research questions and friction points this paper is trying to address.

Detecting LLM-generated code to prevent vulnerable software integration

Addressing limitations in cross-model validation and opaque detection methods

Investigating granularity effects on detection across different LLM architectures

Innovation

Methods, ideas, or system contributions that make the work stand out.

Comparative study of four LLMs using interpretable software metrics

CatBoost classifiers trained on function and class granularities

SHAP analysis identifies Comment-to-Code Ratio as universal discriminator

🔎 Similar Papers

CodeMirage: Hallucinations in Code Generated by Large Language Models