🤖 AI Summary
Existing Neural Tangent Kernel (NTK) theory relies on convergence bounds derived from the smallest eigenvalue, which are overly pessimistic and fail to explain the rapid convergence observed in practical neural network training. This work proposes a refined analytical framework based on the alignment between Label-NTK and Residual-NTK, revealing for the first time that the projections of labels and training residuals onto NTK eigenvectors scale proportionally with their corresponding eigenvalues. Leveraging this insight, the authors derive tight convergence and improved generalization bounds that depend on the full spectral structure of the NTK. Combining NTK linearized dynamics, spectral analysis, theoretical proofs, and extensive experiments across multiple datasets—including both MLPs and CNNs—the proposed bounds significantly outperform classical worst-case results, more accurately capture real-world training dynamics, and validate theoretical predictions on standard benchmarks.
📝 Abstract
The Neural Tangent Kernel (NTK) framework explains optimization in over-parameterized neural networks via approximately linearized dynamics, yielding exponential convergence guarantees. However, existing results are often overly pessimistic and do not match the fast training in practice, as they depend on the smallest NTK eigenvalue, which is typically extremely small in practice. In this work, we develop sharper convergence guarantees by characterizing the interaction between data labels and the NTK eigen-spectrum. We identify two key phenomena, Label-NTK alignment and Residual-NTK alignment, showing that projections of labels and residuals onto NTK eigenvectors scale with the corresponding eigenvalues. We provide empirical evidence and theoretical justification under mild data assumptions. Exploiting these alignment properties, we derive a refined convergence bound that depends on the full spectrum and closely matches practical training dynamics, significantly improving over classical worst-case results. We further obtain improved generalization bounds. Experiments on MLPs and CNNs across multiple datasets validate our theory.