Attention Layers Add Into Low-Dimensional Residual Subspaces

📅 2025-08-23
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work identifies that Transformer attention outputs are projected onto an anomalously low-dimensional subspace; this intrinsic low-rank structure constitutes the root cause of the “dead feature” problem in sparse dictionary learning—arising from geometric mismatch between randomly initialized dictionary atoms and the activation manifold. To address this, we propose subspace-constrained training: principal component analysis of the attention output projection matrix identifies the active low-dimensional subspace, and the sparse autoencoder (SAE) dictionary is strictly initialized within this subspace. Our method dramatically improves feature utilization: in million-feature SAEs applied to attention outputs, the dead feature ratio drops from 87% to under 1%. It demonstrates strong generalization across diverse model architectures and datasets. This approach establishes a new paradigm for efficient, interpretable feature learning in AI, directly tackling geometric misalignment at initialization.

Technology Category

Application Category

📝 Abstract
While transformer models are widely believed to operate in high-dimensional hidden spaces, we show that attention outputs are confined to a surprisingly low-dimensional subspace, where about 60% of the directions account for 99% of the variance--a phenomenon that is induced by the attention output projection matrix and consistently observed across diverse model families and datasets. Critically, we find this low-rank structure as a fundamental cause of the prevalent dead feature problem in sparse dictionary learning, where it creates a mismatch between randomly initialized features and the intrinsic geometry of the activation space. Building on this insight, we propose a subspace-constrained training method for sparse autoencoders (SAEs), initializing feature directions into the active subspace of activations. Our approach reduces dead features from 87% to below 1% in Attention Output SAEs with 1M features, and can further extend to other sparse dictionary learning methods. Our findings provide both new insights into the geometry of attention and practical tools for improving sparse dictionary learning in large language models.
Problem

Research questions and friction points this paper is trying to address.

Attention outputs occupy low-dimensional residual subspaces
Low-rank structure causes dead feature problem
Proposing subspace-constrained training for sparse autoencoders
Innovation

Methods, ideas, or system contributions that make the work stand out.

Low-dimensional subspace constraint for attention outputs
Subspace-constrained training for sparse autoencoders
Feature initialization into activation active subspace
🔎 Similar Papers
No similar papers found.
J
Junxuan Wang
Shanghai Innovation Institute
X
Xuyang Ge
OpenMOSS Team, School of Computer Science, Fudan University
W
Wentao Shu
Shanghai Innovation Institute
Zhengfu He
Zhengfu He
Shanghai Innovation Institute
Mechanistic InterpretabilityLarge Language Models
X
Xipeng Qiu
OpenMOSS Team, School of Computer Science, Fudan University