How Do Transformers Learn to Associate Tokens: Gradient Leading Terms Bring Mechanistic Interpretability

📅 2026-01-27

📈 Citations: 0

✨ Influential: 0

career value

142K/year

🤖 AI Summary

This work investigates how Transformer language models learn semantic associations—such as between “bird” and “flew”—from training data. By analyzing gradient dynamics during early training, the study provides the first analytical decomposition of layer weights as linear combinations of three basis functions: bigram statistics, token exchangeability, and contextual mapping. This formulation establishes a direct link between linguistic statistical properties in the data and the internal mechanisms of the model. The theoretically derived closed-form expression for the weights closely matches empirical observations from training large-scale models, revealing that the attention mechanism captures semantic relationships in a highly structured manner. These findings offer a principled understanding of how semantic structure emerges from data-driven learning in Transformer architectures.

Technology Category

Application Category

📝 Abstract

Semantic associations such as the link between"bird"and"flew"are foundational for language modeling as they enable models to go beyond memorization and instead generalize and generate coherent text. Understanding how these associations are learned and represented in language models is essential for connecting deep learning with linguistic theory and developing a mechanistic foundation for large language models. In this work, we analyze how these associations emerge from natural language data in attention-based language models through the lens of training dynamics. By leveraging a leading-term approximation of the gradients, we develop closed-form expressions for the weights at early stages of training that explain how semantic associations first take shape. Through our analysis, we reveal that each set of weights of the transformer has closed-form expressions as simple compositions of three basis functions (bigram, token-interchangeability, and context mappings), reflecting the statistics of the text corpus and uncovering how each component of the transformer captures semantic associations based on these compositions. Experiments on real-world LLMs demonstrate that our theoretical weight characterizations closely match the learned weights, and qualitative analyses further show how our theorem shines light on interpreting the learned associations in transformers.

Problem

Research questions and friction points this paper is trying to address.

semantic associations

transformers

language modeling

mechanistic interpretability

training dynamics

Innovation

Methods, ideas, or system contributions that make the work stand out.

gradient leading terms

mechanistic interpretability

semantic associations