Early-stopping for Transformer model training

📅 2025-10-17
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenge of determining optimal early stopping points during Transformer training. Methodologically, it leverages random matrix theory to formulate a validation-free early stopping criterion by modeling the heavy-tailed evolution of the spectral density of self-attention matrices. It is the first to partition training dynamics into three distinct phases and introduces novel spectral metrics—including power-law exponents and spectral stability—as convergence indicators. The key contribution lies in eliminating reliance on validation sets, enabling real-time, theory-driven monitoring of training progression and structural convergence diagnosis. Extensive experiments across diverse architectures (e.g., ViT, BERT) and tasks (e.g., image classification, language modeling) demonstrate that the proposed criterion accurately pinpoints convergence onset, reducing unnecessary computation while consistently improving both training efficiency and generalization stability.

Technology Category

Application Category

📝 Abstract
This work introduces a novel theoretical framework grounded in Random Matrix Theory (RMT) for analyzing Transformer training dynamics. We focus on the underlying mechanisms that drive performance improvements and derive principled early-stopping criteria. Empirically, we observe that the spectral density of the shallow self-attention matrix V consistently evolves into a heavy-tailed distribution. Utilizing the PL (Power Law) fit to this matrix as a probe, we demarcate training into three stages: structural exploration, heavy-tailed structure stabilization, and convergence saturation. This staging provides guidance for preliminary stopping decisions. Crucially, we propose two consistent and validation-free criteria: a quantitative metric for heavy-tailed dynamics and a novel spectral signature indicative of convergence. The strong alignment between these criteria highlights the utility of RMT for monitoring and diagnosing the progression of Transformer model training.
Problem

Research questions and friction points this paper is trying to address.

Analyzing Transformer training dynamics using Random Matrix Theory
Developing early-stopping criteria without validation data
Identifying spectral signatures indicating training convergence stages
Innovation

Methods, ideas, or system contributions that make the work stand out.

Early-stopping via Random Matrix Theory framework
Spectral density analysis with Power Law fitting
Validation-free criteria using heavy-tailed dynamics metrics
🔎 Similar Papers
No similar papers found.
J
Jing He
School of Mathematics, Shandong University, PR China
Hua Jiang
Hua Jiang
Assistant Professor, Yunnan University, China
Heuristics and Optimization
C
Cheng Li
Huawei Technologies Ltd. PR China
S
Siqian Xin
Shandong University-Zhong Tai Securities Institute for Financial Studies, Shandong University, PR China
S
Shuzhen Yang
Shandong University-Zhong Tai Securities Institute for Financial Studies, Shandong University, PR China