GradES: Significantly Faster Training in Transformers with Gradient-Based Early Stopping

πŸ“… 2025-09-01
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
Traditional early stopping for large Transformer training relies on computationally expensive global validation loss monitoring. To address this, we propose GradESβ€”the first gradient-driven, fine-grained early stopping method tailored for Transformer components. GradES eliminates the need for validation inference by dynamically tracking the magnitude of backward gradients for projection matrices in individual attention and feed-forward layers. It independently determines convergence per parameter group using an adaptive threshold Ο„ and freezes converged parameters immediately. This enables parameter-level update termination, jointly accelerating training and improving generalization. Experiments across diverse Transformer architectures and tasks demonstrate that GradES achieves 1.57–7.22Γ— faster training while boosting average accuracy by 1.2%, significantly outperforming conventional validation-loss-based early stopping.

Technology Category

Application Category

πŸ“ Abstract
Early stopping monitors global validation loss and halts all parameter updates simultaneously, which is computationally costly for large transformers due to the extended time required for validation inference. We propose GradES, a novel gradient-based early stopping approach that operates within transformer components (attention projections and Feed-Forward layer matrices). We found that different components converge at varying rates during fine-tuning. GradES tracks the magnitude of gradients in backpropagation for these matrices during training. When a projection matrix's gradients fall below a convergence threshold $Ο„$, we exclude that projection matrix from further updates individually, eliminating costly validation passes while allowing slow converging matrices to continue learning. By strategically freezing parameters when their gradients converge, GradES speeds up training time by 1.57--7.22$ imes$ while simultaneously enhancing generalization through early prevention of overfitting, resulting in 1.2% higher average accuracy.
Problem

Research questions and friction points this paper is trying to address.

Reducing computational cost of early stopping in transformers
Addressing varying convergence rates in transformer components
Eliminating validation passes while preventing overfitting
Innovation

Methods, ideas, or system contributions that make the work stand out.

Gradient-based early stopping for transformers
Monitors gradient magnitudes in backpropagation
Freezes converged parameters individually during training
πŸ”Ž Similar Papers
No similar papers found.
Q
Qifu Wen
Department of Computer Science, Boston University Metropolitan College
X
Xi Zeng
Department of Computer Science, Boston University Metropolitan College
Z
Zihan Zhou
Department of Computer Science, Boston University Metropolitan College
Shuaijun Liu
Shuaijun Liu
Information Hub, The Hong Kong University of Science and Technology, Guangzhou
Mehdi Hosseinzadeh
Mehdi Hosseinzadeh
Associate Professor in Computer Engineering, IEEE Senior Member
Data MiningMachine learningSocial NetworksE MarketingE Commerce
Reza Rawassizadeh
Reza Rawassizadeh
Associate Professor, Boston University
Digital HealthOn-device AIAI DemocratizationUbiquitous Computing