🤖 AI Summary
This work aims to achieve fine-grained interpretability of component contributions and compositional mechanisms within Transformer models. To this end, it introduces Unpack, a method grounded in the shared key-value structure φ(S)U between attention and MLP blocks, which—through a single forward pass followed by backward recursion—simultaneously recovers interaction strengths among arbitrary components, end-to-end computational paths, K/Q/V pattern labels, and token-level attributions, all without model intervention, gradients, or additional training. Notably, Unpack is the first approach to unify these multidimensional explanatory signals in a single forward evaluation, without reliance on ground-truth circuit annotations. Experiments demonstrate that Unpack fully reconstructs the known indirect object identification pathway and its routing patterns in GPT-2 Small, and consistently reveals attribution suppression for repeated tokens across the Pythia model series (160M–6.9B), confirming its effectiveness across scales.
📝 Abstract
Mechanistic interpretability of transformers requires identifying not just which components matter but how they compose into the computational route that produced a prediction. Both attention and MLP follow a shared key-value template $φ(S)U$. We exploit this structure to develop Unpack, a backward recursion that decomposes credit through both sublayers, producing interaction strengths between any two components, named end-to-end paths with K/Q/V composition labels, and per-token attribution from a single forward pass, without intervention, gradients, or auxiliary training. We evaluate on the indirect object identification task. On GPT-2 small, the method recovers all three composition connections described by Wang et al. (2023), including the mode-specific routing of each connection (K, Q, or V). To test token-level attribution beyond trivial copying, we compare two occurrences of the same name in the same decomposition: the first mention retains strong credit while the duplicate-detection position is suppressed, a pattern absent in matched control prompts. Across the Pythia family from 160M to 6.9B parameters, this suppression pattern is consistently recovered at every scale, demonstrating that the method tracks mechanistic structure without ground-truth circuit labels. Code is available at https://github.com/Fun-Cry/unpacklm.