Can LLM Watermarks Robustly Prevent Unauthorized Knowledge Distillation?

📅 2025-02-17
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work reveals the fragility of large language model (LLM) watermarks under knowledge distillation: student models can systematically evade watermark inheritance, enabling unauthorized appropriation of teacher capabilities. To address this, we propose— for the first time—two watermark-removal paradigms covering the entire distillation pipeline: (i) pre-distillation data rewriting (targeted and untargted textual paraphrasing) and (ii) post-distillation inference-time watermark neutralization (WN). Experiments demonstrate that WN fully preserves knowledge transfer fidelity while eliminating watermark traces entirely, incurs negligible computational overhead, and generalizes robustly across diverse LLM architectures and mainstream watermarking schemes. This study is the first to systematically analyze the “radioactive decay” failure mechanism of LLM watermarks—where watermarks degrade or vanish during distillation—and introduces the first end-to-end, efficient, and broadly applicable watermark removal framework.

Technology Category

Application Category

📝 Abstract
The radioactive nature of Large Language Model (LLM) watermarking enables the detection of watermarks inherited by student models when trained on the outputs of watermarked teacher models, making it a promising tool for preventing unauthorized knowledge distillation. However, the robustness of watermark radioactivity against adversarial actors remains largely unexplored. In this paper, we investigate whether student models can acquire the capabilities of teacher models through knowledge distillation while avoiding watermark inheritance. We propose two categories of watermark removal approaches: pre-distillation removal through untargeted and targeted training data paraphrasing (UP and TP), and post-distillation removal through inference-time watermark neutralization (WN). Extensive experiments across multiple model pairs, watermarking schemes and hyper-parameter settings demonstrate that both TP and WN thoroughly eliminate inherited watermarks, with WN achieving this while maintaining knowledge transfer efficiency and low computational overhead. Given the ongoing deployment of watermarking techniques in production LLMs, these findings emphasize the urgent need for more robust defense strategies. Our code is available at https://github.com/THU-BPM/Watermark-Radioactivity-Attack.
Problem

Research questions and friction points this paper is trying to address.

Robustness of LLM watermarking against adversarial attacks
Prevention of unauthorized knowledge distillation in LLMs
Effectiveness of watermark removal techniques in maintaining knowledge transfer
Innovation

Methods, ideas, or system contributions that make the work stand out.

Pre-distillation watermark removal
Post-distillation watermark neutralization
Maintains knowledge transfer efficiency
🔎 Similar Papers
2024-06-17North American Chapter of the Association for Computational LinguisticsCitations: 2