Evaluating and Designing Sparse Autoencoders by Approximating Quasi-Orthogonality

📅 2025-03-31
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Sparse autoencoders (SAEs) are widely used for mechanistic interpretability in large language models (LLMs), yet the dominant top-$k$ sparsity constraint lacks theoretical grounding for the hyperparameter $k$. Method: Grounded in the linear representation and superposition hypotheses, we propose the Approximate Feature Activation (AFA) theory and derive a closed-form bound on reconstruction error. We formally define and quantify over- and under-activation of feature vectors relative to inputs, enabling parameter-free sparsity control. Based on this, we introduce the top-AFA SAE architecture—eliminating manual $k$ tuning—and design the ZF plot for diagnostic visualization. Results: Experiments show that top-AFA SAE achieves reconstruction error competitive with state-of-the-art top-$k$ methods—without hyperparameter tuning—while providing, for the first time, a theoretically grounded, empirically verifiable framework for sparse feature activation and an associated interpretability toolkit.

Technology Category

Application Category

📝 Abstract
Sparse autoencoders (SAEs) have emerged as a workhorse of modern mechanistic interpretability, but leading SAE approaches with top-$k$ style activation functions lack theoretical grounding for selecting the hyperparameter $k$. SAEs are based on the linear representation hypothesis (LRH), which assumes that the representations of large language models (LLMs) are linearly encoded, and the superposition hypothesis (SH), which states that there can be more features in the model than its dimensionality. We show that, based on the formal definitions of the LRH and SH, the magnitude of sparse feature vectors (the latent representations learned by SAEs of the dense embeddings of LLMs) can be approximated using their corresponding dense vector with a closed-form error bound. To visualize this, we propose the ZF plot, which reveals a previously unknown relationship between LLM hidden embeddings and SAE feature vectors, allowing us to make the first empirical measurement of the extent to which feature vectors of pre-trained SAEs are over- or under-activated for a given input. Correspondingly, we introduce Approximate Feature Activation (AFA), which approximates the magnitude of the ground-truth sparse feature vector, and propose a new evaluation metric derived from AFA to assess the alignment between inputs and activations. We also leverage AFA to introduce a novel SAE architecture, the top-AFA SAE, leading to SAEs that: (a) are more in line with theoretical justifications; and (b) obviate the need to tune SAE sparsity hyperparameters. Finally, we empirically demonstrate that top-AFA SAEs achieve reconstruction loss comparable to that of state-of-the-art top-k SAEs, without requiring the hyperparameter $k$ to be tuned. Our code is available at: https://github.com/SewoongLee/top-afa-sae.
Problem

Research questions and friction points this paper is trying to address.

Lack of theoretical grounding for selecting hyperparameter k in SAEs
Need to empirically measure feature vector activation in SAEs
Requirement for SAE architectures aligning with theoretical justifications
Innovation

Methods, ideas, or system contributions that make the work stand out.

Proposes ZF plot for visualizing LLM-SAE relationships
Introduces Approximate Feature Activation (AFA) method
Develops top-AFA SAE architecture eliminating hyperparameter tuning
🔎 Similar Papers
No similar papers found.
S
Sewoong Lee
Siebel School of Computing and Data Science, University of Illinois Urbana-Champaign
Adam Davies
Adam Davies
PhD Candidate, University of Illinois Urbana-Champaign
NLPinterpretabilitycognitive scienceOOD generalizationsynthetic data
M
Marc E. Canby
Siebel School of Computing and Data Science, University of Illinois Urbana-Champaign
Julia Hockenmaier
Julia Hockenmaier
Professor, University of Illinois at Urbana-Champaign (UIUC)
Natural Language ProcessingComputational LinguisticsArtificial Intelligence