SLAM: Structural Linguistic Activation Marking for Language Models

📅 2026-05-06
📈 Citations: 0
Influential: 0
📄 PDF

career value

186K/year
🤖 AI Summary
This work proposes a novel method for reliably detecting watermarks in large language models without compromising the quality of generated text. In contrast to conventional approaches that perturb token distributions, this study is the first to encode watermark information into sparse geometric directions within the residual stream that correspond to linguistic structures—such as voice, tense, and clause ordering—and leverages sparse autoencoders to identify these directions. During generation, the model causally modulates activation along these directions to embed the watermark. Evaluated on Gemma-2 2B and 9B models, the method achieves 100% detection accuracy with only a 1–2 point drop in reward scores, substantially outperforming baseline techniques like KGW, EWD, and Unigram while preserving naturalness and diversity comparable to unwatermarked text.
📝 Abstract
LLM watermarks must be detectable without compromising text quality, yet most existing schemes bias the next-token distribution and pay for detection with measurable quality loss. We present SLAM (Structural Linguistic Activation Marking), a novel white-box watermarking scheme that sidesteps this cost by writing the mark into structural geometry rather than token frequencies: sparse autoencoders identify residual-stream directions encoding linguistic structure (e.g., voice, tense, clause order), and we causally steer those directions at generation time, leaving lexical sampling and semantics unconstrained. On Gemma-2 2B and 9B, SLAM achieves 100% detection accuracy with a quality cost of only 1-2 reward points - compared to 7.5-11.5 for KGW, EWD, and Unigram - with naturalness and diversity preserved at near-unwatermarked levels across both models. The trade-off is a complementary robustness profile: SLAM resists word-level edits but is vulnerable to paraphrase that restructures syntax (at a quality cost), the converse of token-distribution methods.
Problem

Research questions and friction points this paper is trying to address.

LLM watermarking
text quality
detection accuracy
structural linguistics
token distribution
Innovation

Methods, ideas, or system contributions that make the work stand out.

watermarking
structural linguistic representation
sparse autoencoder
residual stream steering
language model security
🔎 Similar Papers