🤖 AI Summary
Existing generalization bounds for over-parameterized shallow neural networks suffer from looseness or even failure, as they scale with the square root of network width due to their dependence on the spectral norm of the initialization matrix. This work addresses this limitation by introducing a path norm to measure the distance from initialization and proposing a layer-wise peeling technique. Under general Lipschitz activation functions, the authors establish a non-vacuous generalization upper bound that depends solely on the initialization and exhibits only logarithmic dependence on network width. This bound significantly improves upon existing results and, for the first time, provides matching upper and lower bounds up to constant factors for over-parameterized shallow networks. The theoretical analysis combines Rademacher complexity with initialization constraints, and experiments confirm that the proposed approach yields substantially tighter generalization guarantees.
📝 Abstract
Overparameterized neural networks often show a benign overfitting property in the sense of achieving excellent generalization behavior despite the number of parameters exceeding the number of training examples. A promising direction to explain benign overfitting is to relate generalization to the norm of distance from initialization, motivated by the empirical observations that this distance is often significantly smaller than the norm itself. However, the existing initialization-dependent complexity analyses cannot fully exploit the power of initialization since the associated bounds depend on the spectral norm of the initialization matrix, which can scale as a square-root function of the width and are therefore not effective for overparameterized models. In this paper, we develop the first \emph{fully} initialization-dependent complexity bounds for shallow neural networks with general Lipschitz activation functions, which enjoys a logarithmic dependency on the width. Our bounds depend on the path-norm of the distance from initialization, which are derived by introducing a new peeling technique to handle the challenge along with the initialization-dependent constraint. We also develop a lower bound tight up to a constant factor. Finally, we conduct empirical comparisons and show that our generalization analysis implies non-vacuous bounds for overparameterized networks.