🤖 AI Summary
In continual learning, models often suffer from loss of trainability (LoT)—stagnation or degradation in accuracy—despite sufficient capacity and supervision. This work identifies, for the first time, that LoT under Adam optimization arises from the synergistic deterioration of gradient noise and Hessian curvature fluctuations, unifying noise and curvature perspectives. We propose a hierarchical trainability prediction threshold that jointly incorporates batch-aware noise bounds and curvature fluctuation bounds, proving that either bound alone is insufficient for reliable LoT detection. Further, we design a hierarchical dynamic step-size scheduler integrating CReLU activation, Wasserstein regularization, and L2 decay. Evaluated on multiple continual learning benchmarks, our method significantly improves training stability and final accuracy. The learned learning rate trajectories naturally exhibit principled decay patterns, validating the method’s intrinsic soundness and generalizability.
📝 Abstract
Loss of trainability (LoT) in continual learning occurs when gradient steps no longer yield improvement as tasks evolve, so accuracy stalls or degrades despite adequate capacity and supervision. We analyze LoT incurred with Adam through an optimization lens and find that single indicators such as Hessian rank, sharpness level, weight or gradient norms, gradient-to-parameter ratios, and unit-sign entropy are not reliable predictors. Instead we introduce two complementary criteria: a batch-size-aware gradient-noise bound and a curvature volatility-controlled bound that combine into a per-layer predictive threshold that anticipates trainability behavior. Using this threshold, we build a simple per-layer scheduler that keeps each layers effective step below a safe limit, stabilizing training and improving accuracy across concatenated ReLU (CReLU), Wasserstein regularization, and L2 weight decay, with learned learning-rate trajectories that mirror canonical decay.