Beyond Scaling Curves: Internal Dynamics of Neural Networks Through the NTK Lens

📅 2025-07-07
📈 Citations: 0
Influential: 0
📄 PDF

career value

255K/year
🤖 AI Summary
This work addresses two fundamental limitations of neural scaling laws: (1) the inability of performance scaling exponents to capture intrinsic dynamical differences across architectures or training regimes, and (2) the disappearance of feature learning and the kernel-to-feature transition during the finite-width-to-infinite-width limit. Leveraging the Neural Tangent Kernel (NTK) theoretical framework, we combine large-scale training dynamics observations with empirical scaling analysis. Our results reveal that identical scaling exponents can mask diametrically opposed internal dynamics. We quantitatively identify, for the first time, a critical width—approximately one-tenth that of current state-of-the-art large language models—below which feature learning ceases. Furthermore, we rigorously characterize the phase boundary separating kernel-driven from feature-driven scaling regimes. Collectively, these findings expose the inherent limitations of conventional scaling laws and establish a new paradigm for understanding representational evolution under model scale expansion.

Technology Category

Application Category

📝 Abstract
Scaling laws offer valuable insights into the relationship between neural network performance and computational cost, yet their underlying mechanisms remain poorly understood. In this work, we empirically analyze how neural networks behave under data and model scaling through the lens of the neural tangent kernel (NTK). This analysis establishes a link between performance scaling and the internal dynamics of neural networks. Our findings of standard vision tasks show that similar performance scaling exponents can occur even though the internal model dynamics show opposite behavior. This demonstrates that performance scaling alone is insufficient for understanding the underlying mechanisms of neural networks. We also address a previously unresolved issue in neural scaling: how convergence to the infinite-width limit affects scaling behavior in finite-width models. To this end, we investigate how feature learning is lost as the model width increases and quantify the transition between kernel-driven and feature-driven scaling regimes. We identify the maximum model width that supports feature learning, which, in our setups, we find to be more than ten times smaller than typical large language model widths.
Problem

Research questions and friction points this paper is trying to address.

Link performance scaling to internal neural network dynamics
Understand convergence effects on finite-width model scaling
Determine maximum width for feature learning in models
Innovation

Methods, ideas, or system contributions that make the work stand out.

Analyzing neural networks via neural tangent kernel (NTK)
Linking performance scaling to internal dynamics
Identifying maximum width for feature learning