Studying Small Language Models with Susceptibilities

📅 2025-04-25

📈 Citations: 0

✨ Influential: 0

career value

230K/year

🤖 AI Summary

This work investigates the interpretability of small language models (3M-parameter Transformers) under data distribution shifts (e.g., The Pile → GitHub/legal text). We propose a Bayesian statistical-mechanics-based linear response framework, modeling the model as a stochastic dynamical system. Local SGLD sampling and perturbation probing are employed to quantify the linear susceptibility of individual network components to distributional change. To our knowledge, this is the first application of statistical-mechanical linear response theory to LLM interpretability; it establishes a rigorous connection between susceptibility and the local learning coefficient from singularity learning theory. We find that the response matrix exhibits low-rank structure, enabling unsupervised functional decomposition—e.g., disentangling multigram heads and induction heads. Furthermore, we generate signed, token-level attribution scores and quantitatively characterize how distribution shift alters the local geometry of the loss manifold.

Technology Category

Application Category

📝 Abstract

We develop a linear response framework for interpretability that treats a neural network as a Bayesian statistical mechanical system. A small, controlled perturbation of the data distribution, for example shifting the Pile toward GitHub or legal text, induces a first-order change in the posterior expectation of an observable localized on a chosen component of the network. The resulting susceptibility can be estimated efficiently with local SGLD samples and factorizes into signed, per-token contributions that serve as attribution scores. Building a set of perturbations (probes) yields a response matrix whose low-rank structure separates functional modules such as multigram and induction heads in a 3M-parameter transformer. Susceptibilities link local learning coefficients from singular learning theory with linear-response theory, and quantify how local loss landscape geometry deforms under shifts in the data distribution.

Problem

Research questions and friction points this paper is trying to address.

Develops linear response framework for neural network interpretability

Measures network susceptibility to controlled data perturbations

Links local learning coefficients with linear-response theory

Innovation

Methods, ideas, or system contributions that make the work stand out.

Linear response framework for neural network interpretability

Efficient susceptibility estimation with local SGLD samples

Low-rank response matrix separates functional modules

🔎 Similar Papers

Scaling Trends in Language Model Robustness