🤖 AI Summary
Existing LLM text watermarking methods rely on white-box access, logit manipulation, or model fine-tuning, rendering them incompatible with black-box API-based LLMs and multilingual settings, while often degrading text quality.
Method: We propose the first general-purpose, post-hoc, multi-bit watermarking framework for API-callable LLMs—requiring no model parameter or logit modification. During inference, it leverages key-directed statistical matching over sparse autoencoder–extracted hidden-layer features, integrating deterministic feature analysis and rejection sampling for personalized watermark embedding.
Contribution/Results: Our approach theoretically balances detection success rate and computational overhead. Evaluated across four datasets, it achieves 99.7% F1 detection accuracy, preserves generation quality, and supports cross-lingual and cross-domain plug-and-play deployment—significantly outperforming prior art.
📝 Abstract
Watermarking LLM-generated text is critical for content attribution and misinformation prevention. However, existing methods compromise text quality, require white-box model access and logit manipulation. These limitations exclude API-based models and multilingual scenarios. We propose SAEMark, a general framework for post-hoc multi-bit watermarking that embeds personalized messages solely via inference-time, feature-based rejection sampling without altering model logits or requiring training. Our approach operates on deterministic features extracted from generated text, selecting outputs whose feature statistics align with key-derived targets. This framework naturally generalizes across languages and domains while preserving text quality through sampling LLM outputs instead of modifying. We provide theoretical guarantees relating watermark success probability and compute budget that hold for any suitable feature extractor. Empirically, we demonstrate the framework's effectiveness using Sparse Autoencoders (SAEs), achieving superior detection accuracy and text quality. Experiments across 4 datasets show SAEMark's consistent performance, with 99.7% F1 on English and strong multi-bit detection accuracy. SAEMark establishes a new paradigm for scalable watermarking that works out-of-the-box with closed-source LLMs while enabling content attribution.