🤖 AI Summary
This work addresses the minimal parameter count required for Lipschitz-robust interpolation by overparameterized models. Extending the Bubeck–Sellke lower bound—originally established for squared loss and scalar responses—to general Bregman divergences (e.g., cross-entropy, squared error) and vector-valued outputs, the authors reformulate the proof within a bias–variance decomposition framework, circumventing reliance on Rademacher complexity. Instead, they directly leverage properties of Bregman divergences, concentration inequalities, and Lipschitz constraints. Their key contribution is the first unified robustness law: for any Bregman loss, $d$-dimensional inputs, and $m$-dimensional outputs, achieving Lipschitz interpolation necessitates $Omega(n + d)$ parameters, where $n$ is the sample size. This lower bound reveals the fundamental necessity of overparameterization in generalized loss settings and multi-output learning, unifying and generalizing prior results on robust interpolation.
📝 Abstract
In contemporary deep learning practice, models are often trained to near zero loss i.e. to nearly interpolate the training data. However, the number of parameters in the model is usually far more than the number of data points n, the theoretical minimum needed for interpolation: a phenomenon referred to as overparameterization. In an interesting piece of work, Bubeck and Sellke considered a natural notion of interpolation: the model is said to interpolate when the model's training loss goes below the loss of the conditional expectation of the response given the covariate. For this notion of interpolation and for a broad class of covariate distributions (specifically those satisfying a natural notion of concentration of measure), they showed that overparameterization is necessary for robust interpolation i.e. if the interpolating function is required to be Lipschitz. Their main proof technique applies to regression with square loss against a scalar response, but they remark that via a connection to Rademacher complexity and using tools such as the Ledoux-Talagrand contraction inequality, their result can be extended to more general losses, at least in the case of scalar response variables. In this work, we recast the original proof technique of Bubeck and Sellke in terms of a bias-variance type decomposition, and show that this view directly unlocks a generalization to Bregman divergence losses (even for vector-valued responses), without the use of tools such as Rademacher complexity or the Ledoux-Talagrand contraction principle. Bregman divergences are a natural class of losses since for these, the best estimator is the conditional expectation of the response given the covariate, and include other practical losses such as the cross entropy loss. Our work thus gives a more general understanding of the main proof technique of Bubeck and Sellke and demonstrates its broad utility.