🤖 AI Summary
This study addresses the challenge of identifying sparse activation circuits in neural networks for mechanistic interpretability. We propose Local Loss Landscape Decomposition (L3D), a method that locates low-rank subnetworks directly in parameter space, thereby revealing the sparse functional circuits upon which model behavior depends. L3D is the first approach to align local parameter directions with loss gradient reconstruction capability, eliminating reliance on activation-space analysis. It integrates local loss curvature estimation, low-rank subspace decomposition, and gradient-alignment constraints, and supports causal intervention for validation. On controlled synthetic models, L3D nearly perfectly recovers ground-truth subnetworks; on real-world Transformers and CNNs, it successfully extracts sample-selective, semantically interpretable sparse circuits and empirically verifies their functional necessity. The core innovation lies in achieving interpretable localization and causal validation of functional circuits from a parameter-geometric perspective.
📝 Abstract
Much of mechanistic interpretability has focused on understanding the activation spaces of large neural networks. However, activation space-based approaches reveal little about the underlying circuitry used to compute features. To better understand the circuits employed by models, we introduce a new decomposition method called Local Loss Landscape Decomposition (L3D). L3D identifies a set of low-rank subnetworks: directions in parameter space of which a subset can reconstruct the gradient of the loss between any sample's output and a reference output vector. We design a series of progressively more challenging toy models with well-defined subnetworks and show that L3D can nearly perfectly recover the associated subnetworks. Additionally, we investigate the extent to which perturbing the model in the direction of a given subnetwork affects only the relevant subset of samples. Finally, we apply L3D to a real-world transformer model and a convolutional neural network, demonstrating its potential to identify interpretable and relevant circuits in parameter space.