Investigating Training Data Detection in AI Coders

📅 2025-07-23

📈 Citations: 0

✨ Influential: 0

career value

199K/year

🤖 AI Summary

This study addresses compliance, privacy, and intellectual property risks arising from proprietary or sensitive code in CodeLLM training data, focusing on training data leakage detection. We introduce CodeSnitch—the first function-level benchmark dataset—and propose a controllable mutation strategy grounded in the code clone classification hierarchy. For the first time, we conduct a robustness evaluation of seven state-of-the-art detection methods across eight mainstream CodeLLMs in multilingual and multimodel settings. Experiments reveal critical limitations of existing approaches in code contexts: high false-positive rates, poor generalization, and pronounced language bias. Our findings provide empirical foundations for training data provenance and identify concrete optimization pathways—centered on code semantics and structural properties—for improving leakage detection. This work advances the responsible deployment of AI-powered code generation models.

Technology Category

Application Category

📝 Abstract

Recent advances in code large language models (CodeLLMs) have made them indispensable tools in modern software engineering. However, these models occasionally produce outputs that contain proprietary or sensitive code snippets, raising concerns about potential non-compliant use of training data, and posing risks to privacy and intellectual property. To ensure responsible and compliant deployment of CodeLLMs, training data detection (TDD) has become a critical task. While recent TDD methods have shown promise in natural language settings, their effectiveness on code data remains largely underexplored. This gap is particularly important given code's structured syntax and distinct similarity criteria compared to natural language. To address this, we conduct a comprehensive empirical study of seven state-of-the-art TDD methods on source code data, evaluating their performance across eight CodeLLMs. To support this evaluation, we introduce CodeSnitch, a function-level benchmark dataset comprising 9,000 code samples in three programming languages, each explicitly labeled as either included or excluded from CodeLLM training. Beyond evaluation on the original CodeSnitch, we design targeted mutation strategies to test the robustness of TDD methods under three distinct settings. These mutation strategies are grounded in the well-established Type-1 to Type-4 code clone detection taxonomy. Our study provides a systematic assessment of current TDD techniques for code and offers insights to guide the development of more effective and robust detection methods in the future.

Problem

Research questions and friction points this paper is trying to address.

Detect proprietary code in CodeLLM outputs to ensure compliance

Evaluate TDD methods for code data effectiveness and robustness

Develop benchmark dataset for training data detection in CodeLLMs

Innovation

Methods, ideas, or system contributions that make the work stand out.

Evaluates seven TDD methods on code data

Introduces CodeSnitch benchmark dataset for validation

Tests robustness with mutation-based clone detection

🔎 Similar Papers

No similar papers found.