Train in Vain: Functionality-Preserving Poisoning to Prevent Unauthorized Use of Code Datasets

📅 2026-04-24
📈 Citations: 0
Influential: 0
📄 PDF

career value

199K/year
🤖 AI Summary
This work proposes FunPoison, a novel defense mechanism against the unauthorized use of code datasets for training large language models. By injecting short, compilable, side-effect-free code snippets with weak utility into execution paths, FunPoison degrades the performance of illicitly trained models even when only 10% of the dataset is poisoned. The method achieves this while preserving 100% compilability and functional correctness, and demonstrates robustness against various advanced code sanitization techniques. Its core technical contributions include statement-level reusable templates, type-aware code synthesis, an automated repair mechanism, and conservative safety checks, collectively ensuring that the injected code remains both stealthy and safe. To the best of our knowledge, this is the first approach to simultaneously offer high defensive efficacy, full functional integrity, and resilience to state-of-the-art data-cleaning pipelines.

Technology Category

Application Category

📝 Abstract
The widespread availability of large-scale code datasets has accelerated the development of code large language models (CodeLLMs), raising concerns about unauthorized dataset usage. Dataset poisoning offers a proactive defense by reducing the utility of such unauthorized training. However, existing poisoning methods often require full dataset poisoning and introduce transformations that break code compilability. In this paper, we introduce FunPoison, a functionality-preserving poisoning approach that injects short, compilable weak-use fragments into executed code paths. FunPoison leverages reusable statement-level templates with automatic repair and conservative safety checking to ensure side-effect freedom, while a type-aware synthesis module suppresses static analysis warnings and enhances stealth. Extensive experiments show that FunPoison achieves effective poisoning by contaminating only 10% of the dataset, while maintaining 100% compilability and functional correctness, and remains robust against various advanced code sanitization techniques.
Problem

Research questions and friction points this paper is trying to address.

dataset poisoning
code datasets
functionality preservation
unauthorized use
compilability
Innovation

Methods, ideas, or system contributions that make the work stand out.

functionality-preserving poisoning
code dataset protection
weak-use fragments
type-aware synthesis
compilability preservation
🔎 Similar Papers
No similar papers found.