🤖 AI Summary
This work investigates the intrinsic learning mechanisms of Transformers on sparse modular addition, focusing on the representation and evolution of task invariance within the R² embedding space.
Method: We construct a 2D embedding visualization sandbox to systematically track training dynamics across attention heads and feed-forward networks (FFNs) layer-wise.
Contribution/Results: We identify and name a novel circuit—“clustering heads”—that explicitly separates modular addition equivalence classes via interpretable clustering behavior, exhibiting a two-phase learning trajectory: coarse-grained clustering followed by boundary refinement. We demonstrate that this circuit emerges under sensitivity to weight initialization, curriculum learning strategies, and high-curvature geometry induced by normalization layers—factors that also explain characteristic loss spikes during training. Our approach achieves fine-grained interpretability of Transformer training on a controlled task, offering a new paradigm for analyzing structured inductive biases in deep sequence models.
📝 Abstract
This paper introduces the sparse modular addition task and examines how transformers learn it. We focus on transformers with embeddings in $R^2$ and introduce a visual sandbox that provides comprehensive visualizations of each layer throughout the training process. We reveal a type of circuit, called"clustering heads,"which learns the problem's invariants. We analyze the training dynamics of these circuits, highlighting two-stage learning, loss spikes due to high curvature or normalization layers, and the effects of initialization and curriculum learning.