🤖 AI Summary
This work identifies a cross-model watermark collision problem in logit-based watermarking for large language models (LLMs): benign, watermark-free texts generated by one model may inadvertently trigger the watermark detector of another, causing false positives. This issue is pervasive in downstream tasks such as translation and paraphrasing, severely undermining watermark reliability for copyright protection and content provenance. The authors formally define watermark collision as a novel, generic adversarial paradigm and provide theoretical proof that it universally threatens all logit-based watermarking schemes. Through cross-model embedding/detection experiments and multi-task empirical evaluation across mainstream LLMs, they demonstrate the prevalence and severity of such collisions—reducing watermark detection accuracy by over 40% on average. Moving beyond conventional targeted attacks, this work establishes a new robustness benchmark and offers a foundational perspective for watermark resilience research in LLMs.
📝 Abstract
The proliferation of large language models (LLMs) in generating content raises concerns about text copyright. Watermarking methods, particularly logit-based approaches, embed imperceptible identifiers into text to address these challenges. However, the widespread usage of watermarking across diverse LLMs has led to an inevitable issue known as watermark collision during common tasks, such as paraphrasing or translation. In this paper, we introduce watermark collision as a novel and general philosophy for watermark attacks, aimed at enhancing attack performance on top of any other attacking methods. We also provide a comprehensive demonstration that watermark collision poses a threat to all logit-based watermark algorithms, impacting not only specific attack scenarios but also downstream applications.