GradES: Significantly Faster Training in Transformers with Gradient-Based Early Stopping

📅 2025-09-01

📈 Citations: 0

✨ Influential: 0

career value

136K/year

🤖 AI Summary

Traditional early stopping for large Transformer training relies on computationally expensive global validation loss monitoring. To address this, we propose GradES—the first gradient-driven, fine-grained early stopping method tailored for Transformer components. GradES eliminates the need for validation inference by dynamically tracking the magnitude of backward gradients for projection matrices in individual attention and feed-forward layers. It independently determines convergence per parameter group using an adaptive threshold τ and freezes converged parameters immediately. This enables parameter-level update termination, jointly accelerating training and improving generalization. Experiments across diverse Transformer architectures and tasks demonstrate that GradES achieves 1.57–7.22× faster training while boosting average accuracy by 1.2%, significantly outperforming conventional validation-loss-based early stopping.

Technology Category

Application Category

📝 Abstract

Early stopping monitors global validation loss and halts all parameter updates simultaneously, which is computationally costly for large transformers due to the extended time required for validation inference. We propose GradES, a novel gradient-based early stopping approach that operates within transformer components (attention projections and Feed-Forward layer matrices). We found that different components converge at varying rates during fine-tuning. GradES tracks the magnitude of gradients in backpropagation for these matrices during training. When a projection matrix's gradients fall below a convergence threshold $τ$, we exclude that projection matrix from further updates individually, eliminating costly validation passes while allowing slow converging matrices to continue learning. By strategically freezing parameters when their gradients converge, GradES speeds up training time by 1.57--7.22$ imes$ while simultaneously enhancing generalization through early prevention of overfitting, resulting in 1.2% higher average accuracy.

Problem

Research questions and friction points this paper is trying to address.

Reducing computational cost of early stopping in transformers

Addressing varying convergence rates in transformer components

Eliminating validation passes while preventing overfitting

Innovation

Methods, ideas, or system contributions that make the work stand out.

Gradient-based early stopping for transformers

Monitors gradient magnitudes in backpropagation

Freezes converged parameters individually during training

🔎 Similar Papers

No similar papers found.