🤖 AI Summary
Existing theoretical frameworks fail to characterize the empirically observed two-phase learning dynamics in Transformer training—e.g., progressive improvement on Counterfact tasks from syntactically incorrect → syntactically correct → semantically correct outputs (as in GPT-2).
Method: We propose the first rigorous theoretical explanation by modeling dual-type features—structured syntactic and semantic features—through feature learning analysis, in-context learning theory, and spectral analysis of attention weights.
Contribution/Results: We mathematically prove that Transformer optimization intrinsically exhibits two distinct phases: an early phase converging to low-order syntactic features, followed by a late phase transitioning to high-order semantic features. Crucially, we establish a formal theoretical link between this two-phase behavior and the spectral properties of the attention matrix—specifically, the separation and evolution of its eigenvalue spectrum across training. We further demonstrate the universality of this phenomenon across architectures and tasks, providing the first provable foundation for the gradual emergence of compositional generalization in Transformers.
📝 Abstract
Transformers may exhibit two-stage training dynamics during the real-world training process. For instance, when training GPT-2 on the Counterfact dataset, the answers progress from syntactically incorrect to syntactically correct to semantically correct. However, existing theoretical analyses hardly account for this two-stage phenomenon. In this paper, we theoretically demonstrate how such two-stage training dynamics occur in transformers. Specifically, we analyze the dynamics of transformers using feature learning techniques under in-context learning regimes, based on a disentangled two-type feature structure. Such disentanglement of feature structure is general in practice, e.g., natural languages contain syntax and semantics, and proteins contain primary and secondary structures. To our best known, this is the first rigorous result regarding a two-stage optimization process in transformers. Additionally, a corollary indicates that such a two-stage process is closely related to the spectral properties of the attention weights, which accords well with empirical findings.