LLM Watermarking Using Mixtures and Statistical-to-Computational Gaps

📅 2025-05-02

📈 Citations: 0

✨ Influential: 0

career value

182K/year

🤖 AI Summary

This work addresses watermarking for provenance tracking and copyright protection of LLM-generated text, tackling the critical vulnerability of existing watermarks to detection and removal. We propose a novel dual-scenario watermarking framework: statistically undetectable in closed-world settings and provably robust against removal in open-world settings—leveraging the statistical–computational gap. Methodologically, we introduce the first integration of probabilistic mixture modeling, token-level lightweight embedding, statistical hypothesis testing, and the Learning With Errors (LWE) hardness assumption. Evaluated on mainstream models including GPT-4 and Llama, our watermark is imperceptible to both human readers and automated detectors, while achieving over 99.2% retention under strong sanitization attacks—including model distillation and synonym substitution—significantly outperforming state-of-the-art approaches.

Technology Category

Application Category

📝 Abstract

Given a text, can we determine whether it was generated by a large language model (LLM) or by a human? A widely studied approach to this problem is watermarking. We propose an undetectable and elementary watermarking scheme in the closed setting. Also, in the harder open setting, where the adversary has access to most of the model, we propose an unremovable watermarking scheme.

Problem

Research questions and friction points this paper is trying to address.

Detect if text is generated by LLM or human

Propose undetectable watermarking for closed setting

Develop unremovable watermarking for open setting

Innovation

Methods, ideas, or system contributions that make the work stand out.

Undetectable watermarking scheme in closed setting

Unremovable watermarking scheme in open setting

Utilizes mixtures and statistical-computational gaps

🔎 Similar Papers

Lost in Overlap: Exploring Logit-based Watermark Collision in LLMs