Early-stopping for Transformer model training

📅 2025-10-17

📈 Citations: 0

✨ Influential: 0

career value

184K/year

🤖 AI Summary

This work addresses the challenge of determining optimal early stopping points during Transformer training. Methodologically, it leverages random matrix theory to formulate a validation-free early stopping criterion by modeling the heavy-tailed evolution of the spectral density of self-attention matrices. It is the first to partition training dynamics into three distinct phases and introduces novel spectral metrics—including power-law exponents and spectral stability—as convergence indicators. The key contribution lies in eliminating reliance on validation sets, enabling real-time, theory-driven monitoring of training progression and structural convergence diagnosis. Extensive experiments across diverse architectures (e.g., ViT, BERT) and tasks (e.g., image classification, language modeling) demonstrate that the proposed criterion accurately pinpoints convergence onset, reducing unnecessary computation while consistently improving both training efficiency and generalization stability.

Technology Category

Application Category

📝 Abstract

This work introduces a novel theoretical framework grounded in Random Matrix Theory (RMT) for analyzing Transformer training dynamics. We focus on the underlying mechanisms that drive performance improvements and derive principled early-stopping criteria. Empirically, we observe that the spectral density of the shallow self-attention matrix V consistently evolves into a heavy-tailed distribution. Utilizing the PL (Power Law) fit to this matrix as a probe, we demarcate training into three stages: structural exploration, heavy-tailed structure stabilization, and convergence saturation. This staging provides guidance for preliminary stopping decisions. Crucially, we propose two consistent and validation-free criteria: a quantitative metric for heavy-tailed dynamics and a novel spectral signature indicative of convergence. The strong alignment between these criteria highlights the utility of RMT for monitoring and diagnosing the progression of Transformer model training.

Problem

Research questions and friction points this paper is trying to address.

Analyzing Transformer training dynamics using Random Matrix Theory

Developing early-stopping criteria without validation data

Identifying spectral signatures indicating training convergence stages

Innovation

Methods, ideas, or system contributions that make the work stand out.

Early-stopping via Random Matrix Theory framework

Spectral density analysis with Power Law fitting

Validation-free criteria using heavy-tailed dynamics metrics

🔎 Similar Papers

No similar papers found.