The Rate-Distortion-Polysemanticity Tradeoff in SAEs

📅 2026-05-14

📈 Citations: 0

✨ Influential: 0

career value

192K/year

🤖 AI Summary

This work addresses the challenge that sparse autoencoders (SAEs) struggle to learn monosemantic representations when simultaneously minimizing rate and distortion, thereby limiting their utility in mechanistic interpretability. We formally introduce, for the first time, a rate–distortion–polysemanticity trade-off, demonstrating that polysemanticity fundamentally arises from data distribution—particularly feature co-occurrence probabilities. Through theoretical analysis on toy models, SAE training on real large language models, and a principled benchmark for measuring polysemanticity that satisfies key axioms, we show that enforcing monosemanticity necessarily increases both rate and distortion, while also validating the efficacy of existing proxy metrics. By establishing polysemanticity as an inherent, data-driven phenomenon, this study offers a new perspective for interpretability research and informs the design of SAE architectures and optimization strategies.

📝 Abstract

Sparse Autoencoders (SAEs) that can accurately reconstruct their input (minimizing distortion) by making efficient use of few features (minimizing the rate) often fail to learn monosemantic representations (highly interpretable), limiting their usefulness for mechanistic interpretability. In this paper, we characterise this tension in learning faithful, efficient, and interpretable explanations, introducing the Rate-Distortion-Polysemanticity tradeoff in SAEs. Under toy-modeling assumptions, we theoretically and empirically show that restricting the SAE to be monosemantic necessarily comes with an increase in rate and distortion. Assuming a generative model behind the input observations, we further demonstrate that the degree of polysemanticity of optimal SAEs is determined by the training data distribution, especially by the probability of features to co-occur. Finally, we extend the analysis to real-world settings by deriving necessary conditions that a polysemanticity measure should satisfy when the data-generating process is unknown, and we benchmark existing proxy metrics on SAEs trained on Large Language Models. Taken together, our findings show that polysemanticity is a data problem that should be accounted for when addressing it at the architectural and optimization level.

Problem

Research questions and friction points this paper is trying to address.

Sparse Autoencoders

Rate-Distortion Tradeoff

Polysemanticity

Mechanistic Interpretability

Monosemantic Representations

Innovation

Methods, ideas, or system contributions that make the work stand out.

Sparse Autoencoders

Rate-Distortion Tradeoff

Polysemanticity