Disentangling Feature Structure: A Mathematically Provable Two-Stage Training Dynamics in Transformers

📅 2025-02-28

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

Existing theoretical frameworks fail to characterize the empirically observed two-phase learning dynamics in Transformer training—e.g., progressive improvement on Counterfact tasks from syntactically incorrect → syntactically correct → semantically correct outputs (as in GPT-2). Method: We propose the first rigorous theoretical explanation by modeling dual-type features—structured syntactic and semantic features—through feature learning analysis, in-context learning theory, and spectral analysis of attention weights. Contribution/Results: We mathematically prove that Transformer optimization intrinsically exhibits two distinct phases: an early phase converging to low-order syntactic features, followed by a late phase transitioning to high-order semantic features. Crucially, we establish a formal theoretical link between this two-phase behavior and the spectral properties of the attention matrix—specifically, the separation and evolution of its eigenvalue spectrum across training. We further demonstrate the universality of this phenomenon across architectures and tasks, providing the first provable foundation for the gradual emergence of compositional generalization in Transformers.

Technology Category

Application Category

📝 Abstract

Transformers may exhibit two-stage training dynamics during the real-world training process. For instance, when training GPT-2 on the Counterfact dataset, the answers progress from syntactically incorrect to syntactically correct to semantically correct. However, existing theoretical analyses hardly account for this two-stage phenomenon. In this paper, we theoretically demonstrate how such two-stage training dynamics occur in transformers. Specifically, we analyze the dynamics of transformers using feature learning techniques under in-context learning regimes, based on a disentangled two-type feature structure. Such disentanglement of feature structure is general in practice, e.g., natural languages contain syntax and semantics, and proteins contain primary and secondary structures. To our best known, this is the first rigorous result regarding a two-stage optimization process in transformers. Additionally, a corollary indicates that such a two-stage process is closely related to the spectral properties of the attention weights, which accords well with empirical findings.

Problem

Research questions and friction points this paper is trying to address.

Explains two-stage training dynamics in transformers.

Analyzes feature learning in in-context learning regimes.

Links training stages to attention weights' spectral properties.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Two-stage training dynamics in transformers

Disentangled feature structure analysis

Spectral properties of attention weights

🔎 Similar Papers

No similar papers found.

Authors to Follow