A is for Absorption: Studying Feature Splitting and Absorption in Sparse Autoencoders

📅 2024-09-22

🏛️ arXiv.org

📈 Citations: 29

✨ Influential: 2

career value

197K/year

🤖 AI Summary

This paper identifies “feature absorption” as a critical failure mode in sparse autoencoders (SAEs) when decomposing large language model activations: monosemantic high-level features (e.g., “mathematics”) are competitively suppressed by their fine-grained subfeatures (e.g., “algebra”, “geometry”), leading to interpretability collapse. Method: We design a first-letter identification synthetic task and introduce a ground-truth–driven diagnostic framework with quantitative activation interpretability evaluation. Contribution/Results: We provide the first systematic empirical confirmation that feature absorption is pervasive, highly robust, and non-monotonically alleviated by increasing SAE scale or sparsity. Crucially, hyperparameter tuning alone cannot resolve it, necessitating foundational reformulation of SAE theory. Our work establishes the first benchmark and diagnostic paradigm for feature splitting/absorption failures in interpretable AI—offering both a standardized testbed and methodological framework for diagnosing representational pathologies in dictionary learning-based interpretability methods.

Technology Category

Application Category

📝 Abstract

Sparse Autoencoders (SAEs) have emerged as a promising approach to decompose the activations of Large Language Models (LLMs) into human-interpretable latents. In this paper, we pose two questions. First, to what extent do SAEs extract monosemantic and interpretable latents? Second, to what extent does varying the sparsity or the size of the SAE affect monosemanticity / interpretability? By investigating these questions in the context of a simple first-letter identification task where we have complete access to ground truth labels for all tokens in the vocabulary, we are able to provide more detail than prior investigations. Critically, we identify a problematic form of feature-splitting we call feature absorption where seemingly monosemantic latents fail to fire in cases where they clearly should. Our investigation suggests that varying SAE size or sparsity is insufficient to solve this issue, and that there are deeper conceptual issues in need of resolution.

Problem

Research questions and friction points this paper is trying to address.

Studying feature splitting and absorption in sparse autoencoders

Detecting and addressing feature absorption in hierarchical features

Improving robustness of sparse autoencoders for LLM interpretation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Studying feature splitting and absorption in SAEs

Introducing a metric to detect feature absorption

Exploring solutions for hierarchical feature robustness

🔎 Similar Papers

Sparse Autoencoders Reveal Universal Feature Spaces Across Large Language Models