The Attentional White Bear Effect in Transformer Language Models

📅 2026-05-27

📈 Citations: 0

✨ Influential: 0

career value

160K/year

🤖 AI Summary

This study investigates whether instruction-based content suppression genuinely erases prohibited concepts from a model’s internal representations or merely blocks their explicit output. Through representation probing, attention analysis, and semantic leakage experiments, the authors systematically evaluate the internal states of multiple Transformer models under various suppression strategies. Their findings reveal a fundamental gap between behavioral alignment and representational alignment: even when models successfully avoid generating sensitive terms, the underlying prohibited concepts remain strongly encoded in hidden states and significantly influence attention routing and downstream generation. This phenomenon is robust across several mainstream model families, highlighting a critical limitation of current alignment methods at the representational level.

📝 Abstract

Instruction-based suppression is widely used to prevent language models from generating prohibited content, yet it remains unclear whether suppression reduces internal representation or merely suppresses expression. We investigate this question through representational probing, attention analysis, and behavioral semantic leakage experiments across multiple transformer models. We find that prohibited concepts remain highly recoverable from hidden representations under suppression, continue to influence attention routing, and measurably shape downstream generations despite successful lexical avoidance. These effects persist across pooling strategies, indirect semantic controls, and multiple model families. Our results expose a fundamental gap between behavioral and representational alignment.

Problem

Research questions and friction points this paper is trying to address.

Attentional White Bear Effect

instruction-based suppression

internal representation

behavioral alignment

semantic leakage

Innovation

Methods, ideas, or system contributions that make the work stand out.

representational alignment

instruction-based suppression

attention routing