Data Provenance Auditing of Fine-Tuned Large Language Models with a Text-Preserving Technique

📅 2025-10-07

📈 Citations: 0

✨ Influential: 0

career value

192K/year

🤖 AI Summary

This work addresses the challenge of data provenance tracing for LLM fine-tuning in black-box settings. We propose a fine-grained watermarking audit method based on imperceptible Unicode characters, which embeds invisible watermarks in document chunks, establishes a cue–response matching mechanism, and incorporates counterfactual control sets with rank-based statistical testing to enable verifiable source attribution under strict false-positive control. Our core contribution lies in achieving both low intrusiveness and high robustness: experiments show a detection failure rate < 0.1% and zero false positives across over 18,000 tests; per-document detection accuracy exceeds 45%, and performance remains stable even when watermarked tokens constitute only 0.33% of the input—significantly outperforming existing black-box auditing approaches.

Technology Category

Application Category

📝 Abstract

We address the problem of auditing whether sensitive or copyrighted texts were used to fine-tune large language models (LLMs) under black-box access. Prior signals-verbatim regurgitation and membership inference-are unreliable at the level of individual documents or require altering the visible text. We introduce a text-preserving watermarking framework that embeds sequences of invisible Unicode characters into documents. Each watermark is split into a cue (embedded in odd chunks) and a reply (embedded in even chunks). At audit time, we submit prompts that contain only the cue; the presence of the corresponding reply in the model's output provides evidence of memorization consistent with training on the marked text. To obtain sound decisions, we compare the score of the published watermark against a held-out set of counterfactual watermarks and apply a ranking test with a provable false-positive-rate bound. The design is (i) minimally invasive (no visible text changes), (ii) scalable to many users and documents via a large watermark space and multi-watermark attribution, and (iii) robust to common passive transformations. We evaluate on open-weight LLMs and multiple text domains, analyzing regurgitation dynamics, sensitivity to training set size, and interference under multiple concurrent watermarks. Our results demonstrate reliable post-hoc provenance signals with bounded FPR under black-box access. We experimentally observe a failure rate of less than 0.1% when detecting a reply after fine-tuning with 50 marked documents. Conversely, no spurious reply was recovered in over 18,000 challenges, corresponding to a 100%TPR@0% FPR. Moreover, detection rates remain relatively stable as the dataset size increases, maintaining a per-document detection rate above 45% even when the marked collection accounts for less than 0.33% of the fine-tuning data.

Problem

Research questions and friction points this paper is trying to address.

Detecting sensitive text usage in black-box LLM fine-tuning

Auditing document provenance without visible text alterations

Providing reliable watermark-based evidence for training data memorization

Innovation

Methods, ideas, or system contributions that make the work stand out.

Invisible Unicode watermarking for text provenance auditing

Cue-reply mechanism detects memorization in model outputs

Ranking test ensures bounded false-positive rate

🔎 Similar Papers

A Comprehensive Survey of Contamination Detection Methods in Large Language Models