π€ AI Summary
This work addresses the challenge of tracing the training data sources of closed-source large language models (LLMs) by proposing the first dataset watermarking method tailored for such models. The approach embeds a detectable signal by subtly enhancing the co-occurrence frequency of specific random word pairs through natural language rewriting, using only approximately 1% of the training data as watermarked samples. A statistical hypothesis test (p < 0.01) is then applied to the co-occurrence patterns in model-generated text to provably determine whether the model was trained on the watermarked dataset. Extensive evaluation across multiple base models and benchmark datasets demonstrates the methodβs effectiveness, offering strong statistical guarantees for data provenance while preserving semantic integrity and downstream task performance.
π Abstract
Large language models (LLMs) are pre-trained and post-trained on vast amounts of loosely curated data, raising the possibility that these models may have been trained on proprietary datasets or the same benchmarks used for evaluation. This motivates the need for dataset watermarking: designing datasets such that training on them leaves detectable signatures in the resulting model. Prior work has explored this problem for open models. We introduce the first dataset watermarking method for closed LLMs with provable detection. In particular, we embed a dataset-level watermark signal by increasing the co-occurrence frequency of randomly selected word pairs through rephrasing, and detect it using a statistical test on co-occurrence patterns in model-generated outputs. We evaluate our method with multiple base models and benchmark datasets and show that it reliably detects the watermark ($p <0.01$) in the fine-tuning stage. Notably, our method remains effective in a data mixture setting where the watermarked dataset constitutes only approximately $1\%$ of the total fine-tuning tokens. Furthermore, we show that our method preserves the utility and semantic integrity of the benchmark.