A Unified Noise-Curvature View of Loss of Trainability

📅 2025-09-23
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
In continual learning, models often suffer from loss of trainability (LoT)—stagnation or degradation in accuracy—despite sufficient capacity and supervision. This work identifies, for the first time, that LoT under Adam optimization arises from the synergistic deterioration of gradient noise and Hessian curvature fluctuations, unifying noise and curvature perspectives. We propose a hierarchical trainability prediction threshold that jointly incorporates batch-aware noise bounds and curvature fluctuation bounds, proving that either bound alone is insufficient for reliable LoT detection. Further, we design a hierarchical dynamic step-size scheduler integrating CReLU activation, Wasserstein regularization, and L2 decay. Evaluated on multiple continual learning benchmarks, our method significantly improves training stability and final accuracy. The learned learning rate trajectories naturally exhibit principled decay patterns, validating the method’s intrinsic soundness and generalizability.

Technology Category

Application Category

📝 Abstract
Loss of trainability (LoT) in continual learning occurs when gradient steps no longer yield improvement as tasks evolve, so accuracy stalls or degrades despite adequate capacity and supervision. We analyze LoT incurred with Adam through an optimization lens and find that single indicators such as Hessian rank, sharpness level, weight or gradient norms, gradient-to-parameter ratios, and unit-sign entropy are not reliable predictors. Instead we introduce two complementary criteria: a batch-size-aware gradient-noise bound and a curvature volatility-controlled bound that combine into a per-layer predictive threshold that anticipates trainability behavior. Using this threshold, we build a simple per-layer scheduler that keeps each layers effective step below a safe limit, stabilizing training and improving accuracy across concatenated ReLU (CReLU), Wasserstein regularization, and L2 weight decay, with learned learning-rate trajectories that mirror canonical decay.
Problem

Research questions and friction points this paper is trying to address.

Analyzing loss of trainability in continual learning when gradients stop improving performance
Identifying unreliable single indicators like Hessian rank and gradient norms for predicting trainability
Developing predictive thresholds to stabilize training and prevent accuracy degradation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Batch-size-aware gradient-noise bound criterion
Curvature volatility-controlled bound criterion
Per-layer scheduler with safe step limit
G
Gunbir Singh Baveja
University of British Columbia, Canada CIFAR AI Chair (Amii)
Mark Schmidt
Mark Schmidt
Professor of Computer Science, University of British Columbia
Machine LearningOptimization