The Importance of Being Lazy: Scaling Limits of Continual Learning

📅 2025-06-20
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work investigates how model scale and feature learning depth influence catastrophic forgetting (CF) in continual learning. Addressing the susceptibility of neural networks to forgetting in non-stationary environments—and the lack of mechanistic understanding of CF—we propose a tunable parametric architecture that explicitly separates “lazy” (low feature learning) from “rich” (high feature learning) training regimes. We develop a dynamic mean-field theoretical framework in the infinite-width limit, revealing for the first time a critical phase transition in forgetting rate as a function of task similarity. Experiments and theory jointly demonstrate that: (i) model width enhances robustness only under low feature learning; (ii) an optimal, task-similarity-dependent feature learning level minimizes forgetting; and (iii) this optimum is scalable across model sizes. Our core contributions are a quantitative phase-transition theory linking feature learning to forgetting, and a controllable paradigm for continual learning design.

Technology Category

Application Category

📝 Abstract
Despite recent efforts, neural networks still struggle to learn in non-stationary environments, and our understanding of catastrophic forgetting (CF) is far from complete. In this work, we perform a systematic study on the impact of model scale and the degree of feature learning in continual learning. We reconcile existing contradictory observations on scale in the literature, by differentiating between lazy and rich training regimes through a variable parameterization of the architecture. We show that increasing model width is only beneficial when it reduces the amount of feature learning, yielding more laziness. Using the framework of dynamical mean field theory, we then study the infinite width dynamics of the model in the feature learning regime and characterize CF, extending prior theoretical results limited to the lazy regime. We study the intricate relationship between feature learning, task non-stationarity, and forgetting, finding that high feature learning is only beneficial with highly similar tasks. We identify a transition modulated by task similarity where the model exits an effectively lazy regime with low forgetting to enter a rich regime with significant forgetting. Finally, our findings reveal that neural networks achieve optimal performance at a critical level of feature learning, which depends on task non-stationarity and transfers across model scales. This work provides a unified perspective on the role of scale and feature learning in continual learning.
Problem

Research questions and friction points this paper is trying to address.

Understanding catastrophic forgetting in continual learning
Impact of model scale and feature learning
Optimal feature learning level for performance
Innovation

Methods, ideas, or system contributions that make the work stand out.

Variable parameterization differentiates lazy and rich regimes
Infinite width dynamics studied using mean field theory
Optimal performance at critical feature learning level
🔎 Similar Papers
J
Jacopo Graldi
Dept. of Information Technology and Electrical Engineering, ETH Zurich, Switzerland
A
Alessandro Breccia
Dept. of Physics and Astronomy, University of Padua, Italy
Giulia Lanzillotta
Giulia Lanzillotta
Phd fellow at ETH AI Center
continual learningbio-inspired learninggeneral artificial intelligence
T
Thomas Hofmann
Dept. of Computer Science, ETH Zurich
Lorenzo Noci
Lorenzo Noci
PhD Student, ETH Zürich
deep learningmachine learning