The Rate-Distortion-Polysemanticity Tradeoff in SAEs

📅 2026-05-14
📈 Citations: 0
Influential: 0
📄 PDF

career value

220K/year
🤖 AI Summary
This work addresses the challenge that sparse autoencoders (SAEs) struggle to learn monosemantic representations when simultaneously minimizing rate and distortion, thereby limiting their utility in mechanistic interpretability. We formally introduce, for the first time, a rate–distortion–polysemanticity trade-off, demonstrating that polysemanticity fundamentally arises from data distribution—particularly feature co-occurrence probabilities. Through theoretical analysis on toy models, SAE training on real large language models, and a principled benchmark for measuring polysemanticity that satisfies key axioms, we show that enforcing monosemanticity necessarily increases both rate and distortion, while also validating the efficacy of existing proxy metrics. By establishing polysemanticity as an inherent, data-driven phenomenon, this study offers a new perspective for interpretability research and informs the design of SAE architectures and optimization strategies.
📝 Abstract
Sparse Autoencoders (SAEs) that can accurately reconstruct their input (minimizing distortion) by making efficient use of few features (minimizing the rate) often fail to learn monosemantic representations (highly interpretable), limiting their usefulness for mechanistic interpretability. In this paper, we characterise this tension in learning faithful, efficient, and interpretable explanations, introducing the Rate-Distortion-Polysemanticity tradeoff in SAEs. Under toy-modeling assumptions, we theoretically and empirically show that restricting the SAE to be monosemantic necessarily comes with an increase in rate and distortion. Assuming a generative model behind the input observations, we further demonstrate that the degree of polysemanticity of optimal SAEs is determined by the training data distribution, especially by the probability of features to co-occur. Finally, we extend the analysis to real-world settings by deriving necessary conditions that a polysemanticity measure should satisfy when the data-generating process is unknown, and we benchmark existing proxy metrics on SAEs trained on Large Language Models. Taken together, our findings show that polysemanticity is a data problem that should be accounted for when addressing it at the architectural and optimization level.
Problem

Research questions and friction points this paper is trying to address.

Sparse Autoencoders
Rate-Distortion Tradeoff
Polysemanticity
Mechanistic Interpretability
Monosemantic Representations
Innovation

Methods, ideas, or system contributions that make the work stand out.

Sparse Autoencoders
Rate-Distortion Tradeoff
Polysemanticity
Monosemanticity
Mechanistic Interpretability
🔎 Similar Papers
No similar papers found.