The Attentional White Bear Effect in Transformer Language Models

πŸ“… 2026-05-27
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
This study investigates whether instruction-based content suppression genuinely erases prohibited concepts from a model’s internal representations or merely blocks their explicit output. Through representation probing, attention analysis, and semantic leakage experiments, the authors systematically evaluate the internal states of multiple Transformer models under various suppression strategies. Their findings reveal a fundamental gap between behavioral alignment and representational alignment: even when models successfully avoid generating sensitive terms, the underlying prohibited concepts remain strongly encoded in hidden states and significantly influence attention routing and downstream generation. This phenomenon is robust across several mainstream model families, highlighting a critical limitation of current alignment methods at the representational level.
πŸ“ Abstract
Instruction-based suppression is widely used to prevent language models from generating prohibited content, yet it remains unclear whether suppression reduces internal representation or merely suppresses expression. We investigate this question through representational probing, attention analysis, and behavioral semantic leakage experiments across multiple transformer models. We find that prohibited concepts remain highly recoverable from hidden representations under suppression, continue to influence attention routing, and measurably shape downstream generations despite successful lexical avoidance. These effects persist across pooling strategies, indirect semantic controls, and multiple model families. Our results expose a fundamental gap between behavioral and representational alignment.
Problem

Research questions and friction points this paper is trying to address.

Attentional White Bear Effect
instruction-based suppression
internal representation
behavioral alignment
semantic leakage
Innovation

Methods, ideas, or system contributions that make the work stand out.

representational alignment
instruction-based suppression
attention routing
semantic leakage
Transformer language models
πŸ”Ž Similar Papers
No similar papers found.