π€ AI Summary
Traditional early stopping for large Transformer training relies on computationally expensive global validation loss monitoring. To address this, we propose GradESβthe first gradient-driven, fine-grained early stopping method tailored for Transformer components. GradES eliminates the need for validation inference by dynamically tracking the magnitude of backward gradients for projection matrices in individual attention and feed-forward layers. It independently determines convergence per parameter group using an adaptive threshold Ο and freezes converged parameters immediately. This enables parameter-level update termination, jointly accelerating training and improving generalization. Experiments across diverse Transformer architectures and tasks demonstrate that GradES achieves 1.57β7.22Γ faster training while boosting average accuracy by 1.2%, significantly outperforming conventional validation-loss-based early stopping.
π Abstract
Early stopping monitors global validation loss and halts all parameter updates simultaneously, which is computationally costly for large transformers due to the extended time required for validation inference. We propose GradES, a novel gradient-based early stopping approach that operates within transformer components (attention projections and Feed-Forward layer matrices). We found that different components converge at varying rates during fine-tuning. GradES tracks the magnitude of gradients in backpropagation for these matrices during training. When a projection matrix's gradients fall below a convergence threshold $Ο$, we exclude that projection matrix from further updates individually, eliminating costly validation passes while allowing slow converging matrices to continue learning. By strategically freezing parameters when their gradients converge, GradES speeds up training time by 1.57--7.22$ imes$ while simultaneously enhancing generalization through early prevention of overfitting, resulting in 1.2% higher average accuracy.