🤖 AI Summary
This work addresses the poor cross-domain generalization of sparse autoencoders (SAEs) in language model interpretability. Using “answerability” as a semantic anchor, we systematically evaluate the in-distribution (ID) and out-of-distribution (OOD) transfer performance of SAE features from Gemma-2 across multiple source datasets. We quantitatively demonstrate— for the first time—that SAE feature generalization is highly unstable: ID performance falls below that of linear probes on the residual stream, which themselves exhibit substantial OOD variance. Our findings confirm that current SAE-based interpretability methods lack generalization guarantees, underscoring the urgent need for a predictive, quantitative framework for feature generalizability. The core contribution is the establishment of the first cross-dataset generalization evaluation paradigm targeting abstract semantic capabilities (i.e., answerability), empirically revealing dual limitations in generalization robustness—both for SAEs and residual-stream linear probes.
📝 Abstract
Sparse autoencoders (SAEs) have emerged as a promising approach in language model interpretability, offering unsupervised extraction of sparse features. For interpretability methods to succeed, they must identify abstract features across domains, and these features can often manifest differently in each context. We examine this through"answerability"-a model's ability to recognize answerable questions. We extensively evaluate SAE feature generalization across diverse answerability datasets for Gemma 2 SAEs. Our analysis reveals that residual stream probes outperform SAE features within domains, but generalization performance differs sharply. SAE features demonstrate inconsistent transfer ability, and residual stream probes similarly show high variance out of distribution. Overall, this demonstrates the need for quantitative methods to predict feature generalization in SAE-based interpretability.