🤖 AI Summary
Post-hoc training of sparse autoencoders (SAEs) yields ambiguous concept discovery, unstable features, and poor cross-checkpoint comparability—making it difficult to distinguish whether a concept is unrepresented by the model or merely undetected due to SAE failure.
Method: We propose TopK Language Models, which integrate TopK activation functions directly into Transformer layer embeddings, inherently inducing sparsity and interpretability in latent representations—embedding sparsity into the architecture rather than relying on post-hoc SAE training.
Contribution/Results: This approach eliminates SAEs’ training dependency and architectural sensitivity, enabling stable feature tracking across training steps, precise neuron-level steering, and rigorous analysis of concept evolution. Experiments show that TopK LMs retain original model performance while substantially improving interpretability and controllability. The method establishes a more reliable, reproducible sparse representation paradigm for probing language model internals.
📝 Abstract
Sparse autoencoders (SAEs) have become an important tool for analyzing and interpreting the activation space of transformer-based language models (LMs). However, SAEs suffer several shortcomings that diminish their utility and internal validity. Since SAEs are trained post-hoc, it is unclear if the failure to discover a particular concept is a failure on the SAE's side or due to the underlying LM not representing this concept. This problem is exacerbated by training conditions and architecture choices affecting which features an SAE learns. When tracing how LMs learn concepts during training, the lack of feature stability also makes it difficult to compare SAEs features across different checkpoints. To address these limitations, we introduce a modification to the transformer architecture that incorporates a TopK activation function at chosen layers, making the model's hidden states equivalent to the latent features of a TopK SAE. This approach eliminates the need for post-hoc training while providing interpretability comparable to SAEs. The resulting TopK LMs offer a favorable trade-off between model size, computational efficiency, and interpretability. Despite this simple architectural change, TopK LMs maintain their original capabilities while providing robust interpretability benefits. Our experiments demonstrate that the sparse representations learned by TopK LMs enable successful steering through targeted neuron interventions and facilitate detailed analysis of neuron formation processes across checkpoints and layers. These features make TopK LMs stable and reliable tools for understanding how language models learn and represent concepts, which we believe will significantly advance future research on model interpretability and controllability.