Resurrecting the Salmon: Rethinking Mechanistic Interpretability with Domain-Specific Sparse Autoencoders

📅 2025-08-12

📈 Citations: 0

✨ Influential: 0

career value

202K/year

🤖 AI Summary

Traditional sparse autoencoders (SAEs) trained on broad-domain data suffer from fixed latent dimensionality, leading them to predominantly capture high-frequency, generic patterns. This results in substantial linear “dark matter” reconstruction error and feature entanglement, severely undermining mechanistic interpretability. To address this, we propose a domain-constrained SAE framework tailored to medical text, trained on Gemma-2’s 20th-layer activations using JumpReLU nonlinearity over 195K clinical Q&A samples. By reallocating latent capacity toward low-frequency yet semantically rich clinical concepts, our method effectively suppresses dark matter and mitigates feature splitting/absorption. Experiments show that, compared to general-purpose SAEs, our approach explains 20% more variance, significantly reduces residual reconstruction error, and improves loss recovery. Both automated and human evaluations confirm that the learned latent features exhibit clear, clinically grounded semantic meaning—enhancing interpretability and utility for medical AI analysis.

Technology Category

Application Category

📝 Abstract

Sparse autoencoders (SAEs) decompose large language model (LLM) activations into latent features that reveal mechanistic structure. Conventional SAEs train on broad data distributions, forcing a fixed latent budget to capture only high-frequency, generic patterns. This often results in significant linear ``dark matter'' in reconstruction error and produces latents that fragment or absorb each other, complicating interpretation. We show that restricting SAE training to a well-defined domain (medical text) reallocates capacity to domain-specific features, improving both reconstruction fidelity and interpretability. Training JumpReLU SAEs on layer-20 activations of Gemma-2 models using 195k clinical QA examples, we find that domain-confined SAEs explain up to 20% more variance, achieve higher loss recovery, and reduce linear residual error compared to broad-domain SAEs. Automated and human evaluations confirm that learned features align with clinically meaningful concepts (e.g., ``taste sensations'' or ``infectious mononucleosis''), rather than frequent but uninformative tokens. These domain-specific SAEs capture relevant linear structure, leaving a smaller, more purely nonlinear residual. We conclude that domain-confinement mitigates key limitations of broad-domain SAEs, enabling more complete and interpretable latent decompositions, and suggesting the field may need to question ``foundation-model'' scaling for general-purpose SAEs.

Problem

Research questions and friction points this paper is trying to address.

Improving interpretability of sparse autoencoders in domain-specific contexts

Reducing linear dark matter and feature fragmentation in autoencoders

Enhancing feature alignment with meaningful domain concepts

Innovation

Methods, ideas, or system contributions that make the work stand out.

Domain-specific sparse autoencoders improve interpretability

Training on medical text enhances feature relevance

Reduces linear residual error significantly

🔎 Similar Papers

A Practical Review of Mechanistic Interpretability for Transformer-Based Language Models