MarkTune: Improving the Quality-Detectability Trade-off in Open-Weight LLM Watermarking

πŸ“… 2025-12-03
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
Existing watermarking methods for open-weight language models struggle to simultaneously ensure robust detection and preserve text generation quality. Method: This paper proposes a policy-fine-tuning–based watermark enhancement framework. Unlike conventional weight-fine-tuning approaches (e.g., GaussMark), it formulates the watermark signal as an optimizable reward in reinforcement learning and performs on-policy policy optimization in the representation space, guided by a quality-aware regularization term to achieve fine-grained, low-disturbance watermark embedding. Watermark injection is controlled by a secret key, enabling reliable detection. Results: Experiments demonstrate that the method maintains near-original model performance in generation quality (as measured by BLEU and perplexity) while achieving detection accuracy comparable to inference-time watermarking. It exhibits strong robustness against rewriting, fine-tuning, and other adversarial attacks, and generalizes effectively to unseen data.

Technology Category

Application Category

πŸ“ Abstract
Watermarking aims to embed hidden signals in generated text that can be reliably detected when given access to a secret key. Open-weight language models pose acute challenges for such watermarking schemes because the inference-time interventions that dominate contemporary approaches cannot be enforced once model weights are public. Existing watermaking techniques for open-weight models, such as the recently proposed GaussMark, typically rely on small modifications to model weights, which can yield signals detectable to those equipped with a secret key, but achieving detection power comparable to inference-time watermarks generally requires weight perturbations that noticeably reduce generation quality. We introduce MarkTune, a theoretically principled, on-policy fine-tuning framework that treats the GaussMark signal as a reward while simultaneously regularizing against degradation in text quality. We derive MarkTune as an improvement on GaussMark and demonstrate that MarkTune consistently improves the quality-detectability trade-off over GaussMark by steering finer-grained, watermark-aware weight updates within the model's representation space while preserving generation quality. Empirically, we show that MarkTune pushes the quality-detectability frontier of GaussMark close to that of inference-time watermarking, remains robust to paraphrasing and fine-tuning attacks, and exhibits strong generalization: a model fine-tuned on one dataset retains substantial watermark detection power on unseen datasets. Together, these results establish MarkTune as a general strategy for embedding robust, high-quality watermarks into open-weight LMs.
Problem

Research questions and friction points this paper is trying to address.

Improves watermark detectability without sacrificing text quality
Addresses challenges of watermarking open-weight language models
Enhances robustness against attacks while preserving generation quality
Innovation

Methods, ideas, or system contributions that make the work stand out.

On-policy fine-tuning framework for watermarking
Treats watermark signal as reward with quality regularization
Steers watermark-aware weight updates in representation space
πŸ”Ž Similar Papers