🤖 AI Summary
This work investigates the independence and sequentiality of feature learning during neural network training, focusing on the linearly independent evolution of hidden-layer feature bases and its impact on optimization dynamics. We introduce the “effective rank” to quantify feature-base diversity, and empirically discover— for the first time—that it grows in a stepwise (staircase-like) manner during training, exhibiting a strong negative correlation with training loss; we term this the “staircase phenomenon.” We theoretically establish a monotonically decreasing lower bound on the loss as a function of effective rank. Numerical experiments, supported by matrix analysis and asymptotic theory, demonstrate that advanced optimizers accelerate effective-rank growth and skip redundant staircase steps, thereby significantly improving convergence speed. Our core contribution is establishing a quantitative linkage among feature independence, effective rank, and optimization efficiency—providing a novel perspective on the intrinsic mechanisms of deep learning.
📝 Abstract
In recent years, deep learning, powered by neural networks, has achieved widespread success in solving high-dimensional problems, particularly those with low-dimensional feature structures. This success stems from their ability to identify and learn low dimensional features tailored to the problems. Understanding how neural networks extract such features during training dynamics remains a fundamental question in deep learning theory. In this work, we propose a novel perspective by interpreting the neurons in the last hidden layer of a neural network as basis functions that represent essential features. To explore the linear independence of these basis functions throughout the deep learning dynamics, we introduce the concept of 'effective rank'. Our extensive numerical experiments reveal a notable phenomenon: the effective rank increases progressively during the learning process, exhibiting a staircase-like pattern, while the loss function concurrently decreases as the effective rank rises. We refer to this observation as the 'staircase phenomenon'. Specifically, for deep neural networks, we rigorously prove the negative correlation between the loss function and effective rank, demonstrating that the lower bound of the loss function decreases with increasing effective rank. Therefore, to achieve a rapid descent of the loss function, it is critical to promote the swift growth of effective rank. Ultimately, we evaluate existing advanced learning methodologies and find that these approaches can quickly achieve a higher effective rank, thereby avoiding redundant staircase processes and accelerating the rapid decline of the loss function.