🤖 AI Summary
This study addresses the need to uncover the internal structure of language models and their response mechanisms to input perturbations. To this end, it introduces spectroscopic principles into language model analysis and proposes a susceptibility-based clustering method. By perturbing the distribution of context tokens and employing Stochastic Gradient Langevin Dynamics (SGLD) to approximate the local Gibbs posterior, the approach combines admittance clustering with covariance analysis to identify semantic units organized by similar causal mechanisms within the model. Applied to Pythia-14M, the method successfully discovers 510 interpretable clusters capturing patterns in syntax, code, and mathematical notation, with 50% showing correspondence to features extracted by sparse autoencoders, thereby validating both the efficacy and interpretability of the proposed framework.
📝 Abstract
Spectroscopy infers the internal structure of physical systems by measuring their response to perturbations. We apply this principle to neural networks: perturbing the data distribution by upweighting a token $y$ in context $x$, we measure the model's response via susceptibilities $\chi_{xy}$, which are covariances between component-level observables and the perturbation computed over a localized Gibbs posterior via stochastic gradient Langevin dynamics (SGLD). Theoretically, we show that susceptibilities decompose as a sum over modes of the data distribution, explaining why tokens that follow their contexts"for similar reasons"cluster together in susceptibility space. Empirically, we apply this methodology to Pythia-14M, developing a conductance-based clustering algorithm that identifies 510 interpretable clusters ranging from grammatical patterns to code structure to mathematical notation. Comparing to sparse autoencoders, 50% of our clusters match SAE features, validating that both methods recover similar structure.