PRO: Enabling Precise and Robust Text Watermark for Open-Source LLMs

📅 2025-10-27
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Current watermarking techniques for open-source large language models (LLMs) suffer from weak detectability, poor robustness against downstream modifications (e.g., fine-tuning or model merging), and inadequate intellectual property protection. To address these challenges, this paper proposes a joint-training-based text watermarking method that co-optimizes a watermark policy model and the target LLM. It incorporates pattern-alignment regularization and knowledge-distillation-guided adversarial perturbations to ensure precise watermark embedding and strong robustness in generated text. Experimental evaluation on LLaMA-3.2, LLaMA-3, and Phi-2 demonstrates that our method achieves over 92% watermark detection accuracy—even after aggressive parameter fine-tuning or model merging—surpassing baseline methods by more than 15 percentage points. This work provides a practical, deployable solution for copyright attribution and content authenticity verification in open-source LLMs.

Technology Category

Application Category

📝 Abstract
Text watermarking for large language models (LLMs) enables model owners to verify text origin and protect intellectual property. While watermarking methods for closed-source LLMs are relatively mature, extending them to open-source models remains challenging, as developers cannot control the decoding process. Consequently, owners of open-source LLMs lack practical means to verify whether text was generated by their models. A core difficulty lies in embedding watermarks directly into model weights without hurting detectability. A promising idea is to distill watermarks from a closed-source model into an open one, but this suffers from (i) poor detectability due to mismatch between learned and predefined patterns, and (ii) fragility to downstream modifications such as fine-tuning or model merging. To overcome these limitations, we propose PRO, a Precise and Robust text watermarking method for open-source LLMs. PRO jointly trains a watermark policy model with the LLM, producing patterns that are easier for the model to learn and more consistent with detection criteria. A regularization term further simulates downstream perturbations and penalizes degradation in watermark detectability, ensuring robustness under model edits. Experiments on open-source LLMs (e.g., LLaMA-3.2, LLaMA-3, Phi-2) show that PRO substantially improves both watermark detectability and resilience to model modifications.
Problem

Research questions and friction points this paper is trying to address.

Developing robust text watermarking for open-source LLMs
Embedding watermarks directly into model weights without detectability loss
Ensuring watermark resilience to downstream modifications like fine-tuning
Innovation

Methods, ideas, or system contributions that make the work stand out.

Jointly trains watermark policy model with LLM
Uses regularization to simulate downstream perturbations
Ensures robustness under model edits and modifications
🔎 Similar Papers
2024-06-17North American Chapter of the Association for Computational LinguisticsCitations: 2