๐ค AI Summary
This work investigates the interpretability of small language models (3M-parameter Transformers) under data distribution shifts (e.g., The Pile โ GitHub/legal text). We propose a Bayesian statistical-mechanics-based linear response framework, modeling the model as a stochastic dynamical system. Local SGLD sampling and perturbation probing are employed to quantify the linear susceptibility of individual network components to distributional change. To our knowledge, this is the first application of statistical-mechanical linear response theory to LLM interpretability; it establishes a rigorous connection between susceptibility and the local learning coefficient from singularity learning theory. We find that the response matrix exhibits low-rank structure, enabling unsupervised functional decompositionโe.g., disentangling multigram heads and induction heads. Furthermore, we generate signed, token-level attribution scores and quantitatively characterize how distribution shift alters the local geometry of the loss manifold.
๐ Abstract
We develop a linear response framework for interpretability that treats a neural network as a Bayesian statistical mechanical system. A small, controlled perturbation of the data distribution, for example shifting the Pile toward GitHub or legal text, induces a first-order change in the posterior expectation of an observable localized on a chosen component of the network. The resulting susceptibility can be estimated efficiently with local SGLD samples and factorizes into signed, per-token contributions that serve as attribution scores. Building a set of perturbations (probes) yields a response matrix whose low-rank structure separates functional modules such as multigram and induction heads in a 3M-parameter transformer. Susceptibilities link local learning coefficients from singular learning theory with linear-response theory, and quantify how local loss landscape geometry deforms under shifts in the data distribution.