SAEMark: Multi-bit LLM Watermarking with Inference-Time Scaling

📅 2025-08-11
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing LLM text watermarking methods rely on white-box access, logit manipulation, or model fine-tuning, rendering them incompatible with black-box API-based LLMs and multilingual settings, while often degrading text quality. Method: We propose the first general-purpose, post-hoc, multi-bit watermarking framework for API-callable LLMs—requiring no model parameter or logit modification. During inference, it leverages key-directed statistical matching over sparse autoencoder–extracted hidden-layer features, integrating deterministic feature analysis and rejection sampling for personalized watermark embedding. Contribution/Results: Our approach theoretically balances detection success rate and computational overhead. Evaluated across four datasets, it achieves 99.7% F1 detection accuracy, preserves generation quality, and supports cross-lingual and cross-domain plug-and-play deployment—significantly outperforming prior art.

Technology Category

Application Category

📝 Abstract
Watermarking LLM-generated text is critical for content attribution and misinformation prevention. However, existing methods compromise text quality, require white-box model access and logit manipulation. These limitations exclude API-based models and multilingual scenarios. We propose SAEMark, a general framework for post-hoc multi-bit watermarking that embeds personalized messages solely via inference-time, feature-based rejection sampling without altering model logits or requiring training. Our approach operates on deterministic features extracted from generated text, selecting outputs whose feature statistics align with key-derived targets. This framework naturally generalizes across languages and domains while preserving text quality through sampling LLM outputs instead of modifying. We provide theoretical guarantees relating watermark success probability and compute budget that hold for any suitable feature extractor. Empirically, we demonstrate the framework's effectiveness using Sparse Autoencoders (SAEs), achieving superior detection accuracy and text quality. Experiments across 4 datasets show SAEMark's consistent performance, with 99.7% F1 on English and strong multi-bit detection accuracy. SAEMark establishes a new paradigm for scalable watermarking that works out-of-the-box with closed-source LLMs while enabling content attribution.
Problem

Research questions and friction points this paper is trying to address.

Existing methods degrade text quality and need model access
Current approaches exclude API-based models and multilingual cases
Lack of scalable watermarking for closed-source LLMs
Innovation

Methods, ideas, or system contributions that make the work stand out.

Inference-time feature-based rejection sampling
Multi-bit watermarking without logit alteration
Sparse Autoencoders for superior detection accuracy
🔎 Similar Papers
No similar papers found.
Zhuohao Yu
Zhuohao Yu
Peking University
Natural Language ProcessingSoftware Engineering
X
Xingru Jiang
Peking University
W
Weizheng Gu
Peking University
Y
Yidong Wang
Peking University
Shikun Zhang
Shikun Zhang
北京大学
W
Wei Ye
Peking University