Leave No TRACE: Black-box Detection of Copyrighted Dataset Usage in Large Language Models via Watermarking

📅 2025-10-03

📈 Citations: 0

✨ Influential: 0

career value

185K/year

🤖 AI Summary

This paper addresses the challenge of detecting whether large language models (LLMs) have been fine-tuned on copyright-protected data under strict black-box conditions. We propose TRACE, a framework based on private-key-guided lossless watermark rewriting that leverages fine-tuning–induced “radioactive effects” and an entropy-controlled scoring mechanism to achieve high-sensitivity detection—without degrading text quality or task performance. Its core innovation is the first demonstration of reliable attribution in a fully black-box setting: no access to internal model signals (e.g., logits), no requirement for original reference datasets, and support for multi-dataset provenance tracing—even after extensive post-fine-tuning pretraining. Extensive experiments across multiple LLM families and datasets yield statistically significant results (p < 0.05) and high detection accuracy, empirically validating the practical feasibility of copyright-use traceability.

Technology Category

Application Category

📝 Abstract

Large Language Models (LLMs) are increasingly fine-tuned on smaller, domain-specific datasets to improve downstream performance. These datasets often contain proprietary or copyrighted material, raising the need for reliable safeguards against unauthorized use. Existing membership inference attacks (MIAs) and dataset-inference methods typically require access to internal signals such as logits, while current black-box approaches often rely on handcrafted prompts or a clean reference dataset for calibration, both of which limit practical applicability. Watermarking is a promising alternative, but prior techniques can degrade text quality or reduce task performance. We propose TRACE, a practical framework for fully black-box detection of copyrighted dataset usage in LLM fine-tuning. exttt{TRACE} rewrites datasets with distortion-free watermarks guided by a private key, ensuring both text quality and downstream utility. At detection time, we exploit the radioactivity effect of fine-tuning on watermarked data and introduce an entropy-gated procedure that selectively scores high-uncertainty tokens, substantially amplifying detection power. Across diverse datasets and model families, TRACE consistently achieves significant detections (p<0.05), often with extremely strong statistical evidence. Furthermore, it supports multi-dataset attribution and remains robust even after continued pretraining on large non-watermarked corpora. These results establish TRACE as a practical route to reliable black-box verification of copyrighted dataset usage. We will make our code available at: https://github.com/NusIoraPrivacy/TRACE.

Problem

Research questions and friction points this paper is trying to address.

Detecting unauthorized use of copyrighted datasets in LLM fine-tuning

Providing black-box detection without internal model access

Maintaining text quality while embedding distortion-free watermarks

Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses distortion-free watermarking with private keys

Employs entropy-gated scoring on high-uncertainty tokens

Enables multi-dataset attribution in black-box detection

🔎 Similar Papers

Can Watermarking Large Language Models Prevent Copyrighted Text Generation and Hide Training Data?