🤖 AI Summary
This work addresses the challenge of effectively matching semantically similar features across layers in sparse autoencoders (SAEs) and compressing large feature circuits into interpretable supernodes. The authors formulate this as a unified problem of estimating semantic distances between SAE features residing on different activation manifolds. They propose a feature representation based on activation-weighted distributions and compare these representations using Wasserstein distance within a shared reference space. The method exhibits invariance to activation rescaling and robustness to perturbations, enabling accurate recovery of true matches even with limited samples. Experimental results demonstrate that the approach significantly outperforms baselines relying on decoder vectors or large language models, capturing subtle functional differences and enabling automatic, interpretable circuit compression.
📝 Abstract
Sparse autoencoders (SAEs) have become a central tool for interpreting language models. However, two key SAE analyses that remain difficult to scale are (1) matching semantically similar features across multi-layers and (2) compressing large feature circuits into interpretable supernodes. Although these have been treated as separate problems, we show that both are instances of a more fundamental challenge, which we frame as the estimation of semantic distances between SAE features that lie on different activation manifolds. We introduce a distributional framework for this problem, in which each feature is represented not by a single decoder vector like in the literature, but by an activation-weighted distribution over the hidden states that express it. By projecting these distributions into a shared reference space and comparing them with Wasserstein distance, our method provides a unified semantic metric for cross-layer feature comparison. We prove that our representation is invariant to activation rescaling, stable under perturbations, and recovers true matches under finite-sample margin conditions. Empirically, our method outperforms decoder-vector and LLM-based baselines and captures subtle functional distinctions between related features. Notably, our method compresses large feature circuits into interpretable supernodes automatically.