🤖 AI Summary
This work investigates the impact of dynamically decreasing the bandwidth of translation-invariant kernels during training on the generalization performance of kernel regression. Addressing the strong dependence on model selection and the difficulty of controlling overfitting, we propose a bandwidth-decay strategy: we theoretically establish— for the first time—that it induces the double-descent phenomenon and enables benign overfitting. Furthermore, we introduce the “zero-bandwidth limit” paradigm, which automatically determines the optimal bandwidth without cross-validation or marginal likelihood estimation. We extend this principle to the neural tangent kernel (NTK), designing a dynamic NTK modulation mechanism. Experiments demonstrate that our approach significantly outperforms fixed-bandwidth baselines on both synthetic and real-world datasets, accelerating convergence while improving generalization. The core contribution lies in establishing a theoretical link between bandwidth dynamics and generalization behavior, providing a novel, model-selection-free paradigm for adaptive kernel learning.
📝 Abstract
We investigate changing the bandwidth of a translational-invariant kernel during training when solving kernel regression with gradient descent. We present a theoretical bound on the out-of-sample generalization error that advocates for decreasing the bandwidth (and thus increasing the model complexity) during training. We further use the bound to show that kernel regression exhibits a double descent behavior when the model complexity is expressed as the minimum allowed bandwidth during training. Decreasing the bandwidth all the way to zero results in benign overfitting, and also circumvents the need for model selection. We demonstrate the double descent behavior on real and synthetic data and also demonstrate that kernel regression with a decreasing bandwidth outperforms that of a constant bandwidth, selected by cross-validation or marginal likelihood maximization. We finally apply our findings to neural networks, demonstrating that by modifying the neural tangent kernel (NTK) during training, making the NTK behave as if its bandwidth were decreasing to zero, we can make the network overfit more benignly, and converge in fewer iterations.