Evaluating and Designing Sparse Autoencoders by Approximating Quasi-Orthogonality

📅 2025-03-31

📈 Citations: 0

✨ Influential: 0

career value

213K/year

🤖 AI Summary

Sparse autoencoders (SAEs) are widely used for mechanistic interpretability in large language models (LLMs), yet the dominant top-$k$ sparsity constraint lacks theoretical grounding for the hyperparameter $k$. Method: Grounded in the linear representation and superposition hypotheses, we propose the Approximate Feature Activation (AFA) theory and derive a closed-form bound on reconstruction error. We formally define and quantify over- and under-activation of feature vectors relative to inputs, enabling parameter-free sparsity control. Based on this, we introduce the top-AFA SAE architecture—eliminating manual $k$ tuning—and design the ZF plot for diagnostic visualization. Results: Experiments show that top-AFA SAE achieves reconstruction error competitive with state-of-the-art top-$k$ methods—without hyperparameter tuning—while providing, for the first time, a theoretically grounded, empirically verifiable framework for sparse feature activation and an associated interpretability toolkit.

Technology Category

Application Category

📝 Abstract

Sparse autoencoders (SAEs) have emerged as a workhorse of modern mechanistic interpretability, but leading SAE approaches with top-$k$ style activation functions lack theoretical grounding for selecting the hyperparameter $k$. SAEs are based on the linear representation hypothesis (LRH), which assumes that the representations of large language models (LLMs) are linearly encoded, and the superposition hypothesis (SH), which states that there can be more features in the model than its dimensionality. We show that, based on the formal definitions of the LRH and SH, the magnitude of sparse feature vectors (the latent representations learned by SAEs of the dense embeddings of LLMs) can be approximated using their corresponding dense vector with a closed-form error bound. To visualize this, we propose the ZF plot, which reveals a previously unknown relationship between LLM hidden embeddings and SAE feature vectors, allowing us to make the first empirical measurement of the extent to which feature vectors of pre-trained SAEs are over- or under-activated for a given input. Correspondingly, we introduce Approximate Feature Activation (AFA), which approximates the magnitude of the ground-truth sparse feature vector, and propose a new evaluation metric derived from AFA to assess the alignment between inputs and activations. We also leverage AFA to introduce a novel SAE architecture, the top-AFA SAE, leading to SAEs that: (a) are more in line with theoretical justifications; and (b) obviate the need to tune SAE sparsity hyperparameters. Finally, we empirically demonstrate that top-AFA SAEs achieve reconstruction loss comparable to that of state-of-the-art top-k SAEs, without requiring the hyperparameter $k$ to be tuned. Our code is available at: https://github.com/SewoongLee/top-afa-sae.

Problem

Research questions and friction points this paper is trying to address.

Lack of theoretical grounding for selecting hyperparameter k in SAEs

Need to empirically measure feature vector activation in SAEs

Requirement for SAE architectures aligning with theoretical justifications

Innovation

Methods, ideas, or system contributions that make the work stand out.

Proposes ZF plot for visualizing LLM-SAE relationships

Introduces Approximate Feature Activation (AFA) method

Develops top-AFA SAE architecture eliminating hyperparameter tuning

🔎 Similar Papers

Sparse Autoencoders Reveal Universal Feature Spaces Across Large Language Models