🤖 AI Summary
This work addresses the challenge of determining optimal early stopping points during Transformer training. Methodologically, it leverages random matrix theory to formulate a validation-free early stopping criterion by modeling the heavy-tailed evolution of the spectral density of self-attention matrices. It is the first to partition training dynamics into three distinct phases and introduces novel spectral metrics—including power-law exponents and spectral stability—as convergence indicators. The key contribution lies in eliminating reliance on validation sets, enabling real-time, theory-driven monitoring of training progression and structural convergence diagnosis. Extensive experiments across diverse architectures (e.g., ViT, BERT) and tasks (e.g., image classification, language modeling) demonstrate that the proposed criterion accurately pinpoints convergence onset, reducing unnecessary computation while consistently improving both training efficiency and generalization stability.
📝 Abstract
This work introduces a novel theoretical framework grounded in Random Matrix Theory (RMT) for analyzing Transformer training dynamics. We focus on the underlying mechanisms that drive performance improvements and derive principled early-stopping criteria. Empirically, we observe that the spectral density of the shallow self-attention matrix V consistently evolves into a heavy-tailed distribution. Utilizing the PL (Power Law) fit to this matrix as a probe, we demarcate training into three stages: structural exploration, heavy-tailed structure stabilization, and convergence saturation. This staging provides guidance for preliminary stopping decisions. Crucially, we propose two consistent and validation-free criteria: a quantitative metric for heavy-tailed dynamics and a novel spectral signature indicative of convergence. The strong alignment between these criteria highlights the utility of RMT for monitoring and diagnosing the progression of Transformer model training.