🤖 AI Summary
This work addresses the challenge of deriving generalization bounds for parallel positively homogeneous neural networks—including deep linear/ReLU networks, single-layer multi-head attention, and matrix/tensor decompositions. We propose the first unified convex relaxation framework, embedding the non-convex empirical risk minimization problem into a convex space of prediction functions. By introducing a controllable bias term, our approach enables generalization analysis without relying on conventional parameter-norm or complexity-based regularizers. Leveraging theory of positively homogeneous functions and empirical process techniques, we derive structured risk bounds that yield near-linear sample complexity in network width across multiple model classes—substantially improving upon existing results. Our framework establishes a novel theoretical pathway for analyzing generalization in non-convex neural networks, offering both conceptual unification and quantitative advances.
📝 Abstract
We propose a general framework for deriving generalization bounds for parallel positively homogeneous neural networks--a class of neural networks whose input-output map decomposes as the sum of positively homogeneous maps. Examples of such networks include matrix factorization and sensing, single-layer multi-head attention mechanisms, tensor factorization, deep linear and ReLU networks, and more. Our general framework is based on linking the non-convex empirical risk minimization (ERM) problem to a closely related convex optimization problem over prediction functions, which provides a global, achievable lower-bound to the ERM problem. We exploit this convex lower-bound to perform generalization analysis in the convex space while controlling the discrepancy between the convex model and its non-convex counterpart. We apply our general framework to a wide variety of models ranging from low-rank matrix sensing, to structured matrix sensing, two-layer linear networks, two-layer ReLU networks, and single-layer multi-head attention mechanisms, achieving generalization bounds with a sample complexity that scales almost linearly with the network width.