🤖 AI Summary
This work systematically evaluates the practical utility of sparse autoencoders (SAEs) as interpretability tools for large language model (LLM) activations under four realistic challenges: data scarcity, class imbalance, label noise, and covariate shift. Methodologically, it conducts the first rigorous comparison of SAEs against strong non-sparse baselines—including linear and nonlinear probes, as well as ensemble methods—on real downstream tasks, and introduces a robust classification evaluation framework. Results show that SAEs fail to consistently outperform baselines across any challenge scenario; moreover, their purported advantages—such as detecting spurious correlations and diagnosing data quality—are replicable using simple non-sparse approaches. The study critically questions the empirical foundation of SAEs as concept-level explanation tools, revealing that their performance gains lack robustness. It contributes both a methodological critique and a novel, rigorously grounded evaluation paradigm for interpretable AI.
📝 Abstract
Sparse autoencoders (SAEs) are a popular method for interpreting concepts represented in large language model (LLM) activations. However, there is a lack of evidence regarding the validity of their interpretations due to the lack of a ground truth for the concepts used by an LLM, and a growing number of works have presented problems with current SAEs. One alternative source of evidence would be demonstrating that SAEs improve performance on downstream tasks beyond existing baselines. We test this by applying SAEs to the real-world task of LLM activation probing in four regimes: data scarcity, class imbalance, label noise, and covariate shift. Due to the difficulty of detecting concepts in these challenging settings, we hypothesize that SAEs' basis of interpretable, concept-level latents should provide a useful inductive bias. However, although SAEs occasionally perform better than baselines on individual datasets, we are unable to design ensemble methods combining SAEs with baselines that consistently outperform ensemble methods solely using baselines. Additionally, although SAEs initially appear promising for identifying spurious correlations, detecting poor dataset quality, and training multi-token probes, we are able to achieve similar results with simple non-SAE baselines as well. Though we cannot discount SAEs' utility on other tasks, our findings highlight the shortcomings of current SAEs and the need to rigorously evaluate interpretability methods on downstream tasks with strong baselines.