🤖 AI Summary
This work investigates the incremental learning phenomenon in over-parameterized matrix factorization—specifically, why small-initialization gradient flow learns singular components of the target matrix in descending order of singular values. We derive a closed-form dynamical characterization based on Riccati-type matrix differential equations. Our analysis rigorously establishes time-scale separation as the core mechanism: components associated with larger singular values evolve as fast variables and converge rapidly, while those tied to smaller singular values act as slow variables, activating later. By tuning the initialization scale, we achieve precise control over the learning sequence and enable controllable low-rank approximation. Methodologically, the approach integrates analytical tools from matrix differential equations, gradient flow dynamics modeling, and symmetric decomposition theory. It yields the first quantitative characterization of the entire learning trajectory and provides an extensible theoretical framework for asymmetric factorizations.
📝 Abstract
Many theoretical studies on neural networks attribute their excellent empirical performance to the implicit bias or regularization induced by first-order optimization algorithms when training networks under certain initialization assumptions. One example is the incremental learning phenomenon in gradient flow (GF) on an overparamerterized matrix factorization problem with small initialization: GF learns a target matrix by sequentially learning its singular values in decreasing order of magnitude over time. In this paper, we develop a quantitative understanding of this incremental learning behavior for GF on the symmetric matrix factorization problem, using its closed-form solution obtained by solving a Riccati-like matrix differential equation. We show that incremental learning emerges from some time-scale separation among dynamics corresponding to learning different components in the target matrix. By decreasing the initialization scale, these time-scale separations become more prominent, allowing one to find low-rank approximations of the target matrix. Lastly, we discuss the possible avenues for extending this analysis to asymmetric matrix factorization problems.