OrtSAE: Orthogonal Sparse Autoencoders Uncover Atomic Features

📅 2025-09-26

📈 Citations: 0

✨ Influential: 0

career value

203K/year

🤖 AI Summary

Sparse autoencoders (SAEs) commonly suffer from feature absorption—where specialized features redundantly cover general patterns—and feature composition—where independent features couple into composite representations—leading to incomplete representations and reduced interpretability. To address this, we propose OrtSAE, an SAE variant incorporating orthogonality regularization that explicitly penalizes cosine similarity among hidden-layer activations, thereby enforcing low-correlation, atomic, and disentangled feature learning. The regularization scales linearly with the number of features, ensuring computational efficiency and scalability. Experiments demonstrate that OrtSAE increases the discovery of independent features by 9%, reduces feature absorption by 65% and feature composition by 15%, improves accuracy by 6% on spurious correlation removal, and maintains baseline performance on other downstream tasks. Our core contribution is a lightweight orthogonal constraint that achieves highly disentangled and interpretable sparse representations without compromising efficiency or task performance.

Technology Category

Application Category

📝 Abstract

Sparse autoencoders (SAEs) are a technique for sparse decomposition of neural network activations into human-interpretable features. However, current SAEs suffer from feature absorption, where specialized features capture instances of general features creating representation holes, and feature composition, where independent features merge into composite representations. In this work, we introduce Orthogonal SAE (OrtSAE), a novel approach aimed to mitigate these issues by enforcing orthogonality between the learned features. By implementing a new training procedure that penalizes high pairwise cosine similarity between SAE features, OrtSAE promotes the development of disentangled features while scaling linearly with the SAE size, avoiding significant computational overhead. We train OrtSAE across different models and layers and compare it with other methods. We find that OrtSAE discovers 9% more distinct features, reduces feature absorption (by 65%) and composition (by 15%), improves performance on spurious correlation removal (+6%), and achieves on-par performance for other downstream tasks compared to traditional SAEs.

Problem

Research questions and friction points this paper is trying to address.

OrtSAE addresses feature absorption in sparse autoencoders

It mitigates feature composition by enforcing orthogonality

The method improves feature distinctness and reduces representation issues

Innovation

Methods, ideas, or system contributions that make the work stand out.

Enforces orthogonality between learned features

Penalizes high pairwise cosine similarity

Scales linearly with SAE size

🔎 Similar Papers

Sparse Autoencoders Reveal Universal Feature Spaces Across Large Language Models