🤖 AI Summary
This work addresses the incompleteness of gradient-flow generalization theory by proposing a data-dependent, path-aware Loss Path Kernel (LPK), the first to jointly model the full training trajectory and the data distribution—overcoming the fundamental limitation of static kernels (e.g., NTK) that ignore optimization dynamics. Leveraging LPK, we derive a computable, tight upper bound on Rademacher complexity, quantitatively linking the evolution of loss gradient norms during training to generalization error. The bound is both theoretically rigorous and empirically meaningful: it reproduces known generalization behaviors in overparameterized kernel regression and exhibits strong correlation with measured generalization gaps on real-world datasets. Crucially, it unifies the characterization of neural network feature learning and kernel methods, revealing their essential differences through the lens of training-path geometry and data structure.
📝 Abstract
Gradient-based optimization methods have shown remarkable empirical success, yet their theoretical generalization properties remain only partially understood. In this paper, we establish a generalization bound for gradient flow that aligns with the classical Rademacher complexity bounds for kernel methods-specifically those based on the RKHS norm and kernel trace-through a data-dependent kernel called the loss path kernel (LPK). Unlike static kernels such as NTK, the LPK captures the entire training trajectory, adapting to both data and optimization dynamics, leading to tighter and more informative generalization guarantees. Moreover, the bound highlights how the norm of the training loss gradients along the optimization trajectory influences the final generalization performance. The key technical ingredients in our proof combine stability analysis of gradient flow with uniform convergence via Rademacher complexity. Our bound recovers existing kernel regression bounds for overparameterized neural networks and shows the feature learning capability of neural networks compared to kernel methods. Numerical experiments on real-world datasets validate that our bounds correlate well with the true generalization gap.