🤖 AI Summary
This work identifies a hidden failure mechanism in continual learning where gradient modification methods—such as projection, penalty rescaling, and replay-buffer hybrids—interfere with Adam’s adaptive moment estimation, causing performance degradation or catastrophic collapse due to conflicting updates between first- and second-order moments. To address this, the authors propose an adaptive decoupled moment routing strategy that applies modified gradients solely to the first-moment update while preserving original gradients for faithful second-moment statistics. The approach seamlessly integrates with LoRA fine-tuning and consistently prevents optimization collapse across diverse model scales, optimizers, and continual learning settings. On 8- and 16-domain continual language modeling benchmarks, it outperforms the strongest baseline by 3.8 and 4.5–4.8 units, respectively, while achieving a stable average forgetting rate of 9.4, markedly surpassing existing methods.
📝 Abstract
Many continual-learning methods modify gradients upstream (e.g., projection, penalty rescaling, replay mixing) while treating Adam as a neutral backend. We show this composition has a hidden failure mode. In a high-overlap, non-adaptive 8-domain continual LM, all shared-routing projection baselines collapse close to vanilla forgetting (12.5--12.8 vs. 13.2). A 0.5% replay buffer is the strongest shared alternative but still reaches 11.6, while fixed-strength decoupling falls below vanilla at 14.1. Only adaptive decoupled routing remains stable at 9.4, improving over vanilla by 3.8 units. On a 16-domain stream, its gain over the strongest shared-routing projection baseline grows to 4.5--4.8 units. The failure is largely invisible on clean benchmarks.
We explain this effect through Adam's second-moment pathway: in the tested regime, projection induces a 1/(1-alpha) inflation of the old-direction effective learning rate, matching measurements within 8% across eight alpha values. The same conflict appears with penalty methods, replay mixing, and at 7B scale under LoRA. Our fix routes the modified gradient only to the first moment while preserving magnitude-faithful second-moment statistics, with overlap-aware adaptive strength. This simple change is the only tested configuration that consistently avoids collapse across methods, optimizers, and scale.