🤖 AI Summary
This paper addresses the challenge of detecting whether large language models (LLMs) have been fine-tuned on copyright-protected data under strict black-box conditions. We propose TRACE, a framework based on private-key-guided lossless watermark rewriting that leverages fine-tuning–induced “radioactive effects” and an entropy-controlled scoring mechanism to achieve high-sensitivity detection—without degrading text quality or task performance. Its core innovation is the first demonstration of reliable attribution in a fully black-box setting: no access to internal model signals (e.g., logits), no requirement for original reference datasets, and support for multi-dataset provenance tracing—even after extensive post-fine-tuning pretraining. Extensive experiments across multiple LLM families and datasets yield statistically significant results (p < 0.05) and high detection accuracy, empirically validating the practical feasibility of copyright-use traceability.
📝 Abstract
Large Language Models (LLMs) are increasingly fine-tuned on smaller, domain-specific datasets to improve downstream performance. These datasets often contain proprietary or copyrighted material, raising the need for reliable safeguards against unauthorized use. Existing membership inference attacks (MIAs) and dataset-inference methods typically require access to internal signals such as logits, while current black-box approaches often rely on handcrafted prompts or a clean reference dataset for calibration, both of which limit practical applicability. Watermarking is a promising alternative, but prior techniques can degrade text quality or reduce task performance. We propose TRACE, a practical framework for fully black-box detection of copyrighted dataset usage in LLM fine-tuning. exttt{TRACE} rewrites datasets with distortion-free watermarks guided by a private key, ensuring both text quality and downstream utility. At detection time, we exploit the radioactivity effect of fine-tuning on watermarked data and introduce an entropy-gated procedure that selectively scores high-uncertainty tokens, substantially amplifying detection power. Across diverse datasets and model families, TRACE consistently achieves significant detections (p<0.05), often with extremely strong statistical evidence. Furthermore, it supports multi-dataset attribution and remains robust even after continued pretraining on large non-watermarked corpora. These results establish TRACE as a practical route to reliable black-box verification of copyrighted dataset usage. We will make our code available at: https://github.com/NusIoraPrivacy/TRACE.