🤖 AI Summary
This work addresses the critical problem of how subword units are integrated into semantically coherent “internal lexicons” during the early inference stages of language models (e.g., GPT-2). We propose a purely weight-based analytical method—requiring no forward passes—that interprets the structural properties of first-layer attention weights to establish an explainable decomposition framework, quantifying the contributions of positional, token-level, and mixing effects. For the first time, we mathematically characterize the weight-level origins of two empirically observed phenomena: (i) attention bias toward neighboring tokens and (ii) subword detokenization behavior—revealing that coarse semantic recombination emerges already at the initial layer. Our approach departs from conventional activation-based probing paradigms in model interpretability, offering a lightweight, efficient, and inference-free methodology for mechanistic analysis of transformer architectures.
📝 Abstract
According to the stages-of-inference hypothesis, early layers of language models map their subword-tokenized input, which does not necessarily correspond to a linguistically meaningful segmentation, to more meaningful representations that form the model's ``inner vocabulary''. Prior analysis of this detokenization stage has predominantly relied on probing and interventions such as path patching, which involve selecting particular inputs, choosing a subset of components that will be patched, and then observing changes in model behavior. Here, we show that several important aspects of the detokenization stage can be understood purely by analyzing model weights, without performing any model inference steps. Specifically, we introduce an analytical decomposition of first-layer attention in GPT-2. Our decomposition yields interpretable terms that quantify the relative contributions of position-related, token-related, and mixed effects. By focusing on terms in this decomposition, we discover weight-based explanations of attention bias toward close tokens and attention for detokenization.