Feature Hedging: Correlated Features Break Narrow Sparse Autoencoders

📅 2025-05-16

📈 Citations: 0

✨ Influential: 0

career value

175K/year

🤖 AI Summary

This work identifies “feature hedging”—a phenomenon wherein sparse autoencoders (SAEs) degrade feature unidentifiability when the dictionary width is narrower than the true number of features and latent features are correlated; reconstruction loss forces SAEs to merge correlated feature components, compromising interpretability. This issue critically explains the substantial performance gap between unsupervised SAEs and supervised baselines in large language models (LLMs). We formally define feature hedging, analyze its mechanism via theoretical modeling and controlled toy experiments, and validate it empirically on LLM activation data through training, reconstruction error attribution, and diagnostic analysis. To mitigate hedging, we propose Matryoshka SAE, a novel variant employing hierarchical sparsity constraints. Experiments confirm that feature hedging is pervasive across mainstream LLM SAEs; our approach significantly improves feature unidentifiability and interpretability, substantially narrowing the performance gap with supervised baselines.

Technology Category

Application Category

📝 Abstract

It is assumed that sparse autoencoders (SAEs) decompose polysemantic activations into interpretable linear directions, as long as the activations are composed of sparse linear combinations of underlying features. However, we find that if an SAE is more narrow than the number of underlying"true features"on which it is trained, and there is correlation between features, the SAE will merge components of correlated features together, thus destroying monosemanticity. In LLM SAEs, these two conditions are almost certainly true. This phenomenon, which we call feature hedging, is caused by SAE reconstruction loss, and is more severe the narrower the SAE. In this work, we introduce the problem of feature hedging and study it both theoretically in toy models and empirically in SAEs trained on LLMs. We suspect that feature hedging may be one of the core reasons that SAEs consistently underperform supervised baselines. Finally, we use our understanding of feature hedging to propose an improved variant of matryoshka SAEs. Our work shows there remain fundamental issues with SAEs, but we are hopeful that that highlighting feature hedging will catalyze future advances that allow SAEs to achieve their full potential of interpreting LLMs at scale.

Problem

Research questions and friction points this paper is trying to address.

SAEs merge correlated features when too narrow

Feature hedging reduces monosemanticity in LLM SAEs

Current SAEs underperform due to feature hedging

Innovation

Methods, ideas, or system contributions that make the work stand out.

Studying feature hedging in sparse autoencoders

Proposing improved matryoshka SAE variant

Analyzing SAE issues via toy models

🔎 Similar Papers

Sparse Autoencoders Reveal Universal Feature Spaces Across Large Language Models