🤖 AI Summary
Reliable uncertainty quantification is critical for deep learning models in out-of-distribution (OOD) and safety-critical applications, yet Bayesian approaches suffer from challenges in explicit prior specification and high computational cost. This paper proposes an implicit regularization-based variational inference framework that requires no handcrafted priors. We theoretically establish, for the first time, that standard stochastic gradient descent (SGD) on overparameterized linear models is equivalent to generalized variational inference, and we identify network parameterization as the key determinant of the induced implicit regularizer. The method leverages only standard SGD training—introducing beneficial inductive biases automatically, without additional hyperparameter tuning. Empirically, it significantly improves uncertainty calibration and OOD detection performance on both in-distribution and OOD benchmarks, while incurring negligible overhead in time and memory compared to standard training.
📝 Abstract
Modern deep learning models generalize remarkably well in-distribution, despite being overparametrized and trained with little to no explicit regularization. Instead, current theory credits implicit regularization imposed by the choice of architecture, hyperparameters and optimization procedure. However, deploying deep learning models out-of-distribution, in sequential decision-making tasks, or in safety-critical domains, necessitates reliable uncertainty quantification, not just a point estimate. The machinery of modern approximate inference -- Bayesian deep learning -- should answer the need for uncertainty quantification, but its effectiveness has been challenged by our inability to define useful explicit inductive biases through priors, as well as the associated computational burden. Instead, in this work we demonstrate, both theoretically and empirically, how to regularize a variational deep network implicitly via the optimization procedure, just as for standard deep learning. We fully characterize the inductive bias of (stochastic) gradient descent in the case of an overparametrized linear model as generalized variational inference and demonstrate the importance of the choice of parametrization. Finally, we show empirically that our approach achieves strong in- and out-of-distribution performance without tuning of additional hyperparameters and with minimal time and memory overhead over standard deep learning.