🤖 AI Summary
This work reveals the fragility of large language model (LLM) watermarks under knowledge distillation: student models can systematically evade watermark inheritance, enabling unauthorized appropriation of teacher capabilities. To address this, we propose— for the first time—two watermark-removal paradigms covering the entire distillation pipeline: (i) pre-distillation data rewriting (targeted and untargted textual paraphrasing) and (ii) post-distillation inference-time watermark neutralization (WN). Experiments demonstrate that WN fully preserves knowledge transfer fidelity while eliminating watermark traces entirely, incurs negligible computational overhead, and generalizes robustly across diverse LLM architectures and mainstream watermarking schemes. This study is the first to systematically analyze the “radioactive decay” failure mechanism of LLM watermarks—where watermarks degrade or vanish during distillation—and introduces the first end-to-end, efficient, and broadly applicable watermark removal framework.
📝 Abstract
The radioactive nature of Large Language Model (LLM) watermarking enables the detection of watermarks inherited by student models when trained on the outputs of watermarked teacher models, making it a promising tool for preventing unauthorized knowledge distillation. However, the robustness of watermark radioactivity against adversarial actors remains largely unexplored. In this paper, we investigate whether student models can acquire the capabilities of teacher models through knowledge distillation while avoiding watermark inheritance. We propose two categories of watermark removal approaches: pre-distillation removal through untargeted and targeted training data paraphrasing (UP and TP), and post-distillation removal through inference-time watermark neutralization (WN). Extensive experiments across multiple model pairs, watermarking schemes and hyper-parameter settings demonstrate that both TP and WN thoroughly eliminate inherited watermarks, with WN achieving this while maintaining knowledge transfer efficiency and low computational overhead. Given the ongoing deployment of watermarking techniques in production LLMs, these findings emphasize the urgent need for more robust defense strategies. Our code is available at https://github.com/THU-BPM/Watermark-Radioactivity-Attack.