Do Sparse Autoencoders Generalize? A Case Study of Answerability

📅 2025-02-27

📈 Citations: 0

✨ Influential: 0

career value

184K/year

🤖 AI Summary

This work addresses the poor cross-domain generalization of sparse autoencoders (SAEs) in language model interpretability. Using “answerability” as a semantic anchor, we systematically evaluate the in-distribution (ID) and out-of-distribution (OOD) transfer performance of SAE features from Gemma-2 across multiple source datasets. We quantitatively demonstrate— for the first time—that SAE feature generalization is highly unstable: ID performance falls below that of linear probes on the residual stream, which themselves exhibit substantial OOD variance. Our findings confirm that current SAE-based interpretability methods lack generalization guarantees, underscoring the urgent need for a predictive, quantitative framework for feature generalizability. The core contribution is the establishment of the first cross-dataset generalization evaluation paradigm targeting abstract semantic capabilities (i.e., answerability), empirically revealing dual limitations in generalization robustness—both for SAEs and residual-stream linear probes.

Technology Category

Application Category

📝 Abstract

Sparse autoencoders (SAEs) have emerged as a promising approach in language model interpretability, offering unsupervised extraction of sparse features. For interpretability methods to succeed, they must identify abstract features across domains, and these features can often manifest differently in each context. We examine this through"answerability"-a model's ability to recognize answerable questions. We extensively evaluate SAE feature generalization across diverse answerability datasets for Gemma 2 SAEs. Our analysis reveals that residual stream probes outperform SAE features within domains, but generalization performance differs sharply. SAE features demonstrate inconsistent transfer ability, and residual stream probes similarly show high variance out of distribution. Overall, this demonstrates the need for quantitative methods to predict feature generalization in SAE-based interpretability.

Problem

Research questions and friction points this paper is trying to address.

Generalization of sparse autoencoder features

Evaluation across diverse answerability datasets

Quantitative methods for feature generalization prediction

Innovation

Methods, ideas, or system contributions that make the work stand out.

Sparse Autoencoders for feature extraction

Residual stream probes outperform SAEs

Quantitative methods predict feature generalization

🔎 Similar Papers

A is for Absorption: Studying Feature Splitting and Absorption in Sparse Autoencoders