🤖 AI Summary
This work investigates why Transformers learn seemingly irrelevant abstract features during next-token prediction training. By decomposing the gradient signal of the prediction objective, we establish—for the first time—a causal link between its constituent components and the emergence of hidden representations, revealing that these ostensibly redundant features arise from indirect influences of future tokens. We introduce a quantitative framework to measure such effects, validated through toy tasks and interpretability analyses on OthelloGPT, small-scale models, and pretrained large language models. Our approach elucidates the origins of world models and syntactic structures in learned representations and demonstrates that features with high or low future-token influence predominantly emerge in domains requiring formal reasoning, such as code.
📝 Abstract
Trained Transformers have been shown to compute abstract features that appear redundant for predicting the immediate next token. We identify which components of the gradient signal from the next-token prediction objective give rise to this phenomenon, and we propose a method to estimate the influence of those components on the emergence of specific features. After validating our approach on toy tasks, we use it to interpret the origins of the world model in OthelloGPT and syntactic features in a small language model. Finally, we apply our framework to a pretrained LLM, showing that features with extremely high or low influence on future tokens tend to be related to formal reasoning domains such as code. Overall, our work takes a step toward understanding hidden features of Transformers through the lens of their development during training.