Bias of Stochastic Gradient Descent or the Architecture: Disentangling the Effects of Overparameterization of Neural Networks

📅 2024-07-04

🏛️ International Conference on Machine Learning

📈 Citations: 0

✨ Influential: 0

career value

192K/year

🤖 AI Summary

This work investigates the origins of generalization in overparameterized neural networks, aiming to disentangle the relative contributions of SGD-induced optimization bias versus architecture-induced bias (governed by depth and width) to generalization performance. Method: We conduct controlled ablation experiments across width and depth dimensions, comparing randomly initialized networks achieving zero training error with their SGD-trained counterparts. Contribution/Results: We provide the first empirical evidence that improved generalization with increased width arises primarily from SGD’s implicit regularization, whereas degraded generalization with increased depth stems intrinsically from architectural bias—orthogonal to optimization dynamics. This conclusion holds robustly in low-data regimes; critically, random and SGD-trained networks exhibit identical depth-dependent generalization trends, confirming the dominance of architectural bias. Our study precisely delineates the functional boundaries between optimization and architecture biases, delivering key experimental evidence for understanding deep learning generalization mechanisms.

Technology Category

Application Category

📝 Abstract

Neural networks typically generalize well when fitting the data perfectly, even though they are heavily overparameterized. Many factors have been pointed out as the reason for this phenomenon, including an implicit bias of stochastic gradient descent (SGD) and a possible simplicity bias arising from the neural network architecture. The goal of this paper is to disentangle the factors that influence generalization stemming from optimization and architectural choices by studying random and SGD-optimized networks that achieve zero training error. We experimentally show, in the low sample regime, that overparameterization in terms of increasing width is beneficial for generalization, and this benefit is due to the bias of SGD and not due to an architectural bias. In contrast, for increasing depth, overparameterization is detrimental for generalization, but random and SGD-optimized networks behave similarly, so this can be attributed to an architectural bias. For more information, see https://bias-sgd-or-architecture.github.io .

Problem

Research questions and friction points this paper is trying to address.

Neural Networks

Random Gradient Descent (SGD)

Learning Adaptability

Innovation

Methods, ideas, or system contributions that make the work stand out.

Neural Networks

SGD Training

Network Width and Depth

🔎 Similar Papers

Benign Overfitting in Token Selection of Attention Mechanism