Scaling ResNets in the Large-depth Regime

📅 2022-06-14

🏛️ arXiv.org

📈 Citations: 16

✨ Influential: 2

career value

221K/year

🤖 AI Summary

This work addresses gradient vanishing/exploding issues in deep ResNets as depth increases, systematically investigating training stability mechanisms in ultra-deep networks. Leveraging probabilistic analysis and continuous-limit modeling, we rigorously prove—under standard initialization—that the layer-wise output scaling factor αₗ = 1/√L is the unique nontrivial stable scaling regime; moreover, this scaling induces a Neural Stochastic Differential Equation (Neural SDE) as the continuous limit, challenging the conventional belief that ResNets inherently converge to Neural ODEs. We further uncover a strong coupling between weight regularity and scaling. Supported by theoretical derivation and large-scale initialization/scaling experiments, our framework unifies three distinct regimes: gradient explosion, stable training, and performance degradation. Crucially, we establish that both αₗ and post-training weight smoothness jointly govern generalization performance—both before and after training.

📝 Abstract

Deep ResNets are recognized for achieving state-of-the-art results in complex machine learning tasks. However, the remarkable performance of these architectures relies on a training procedure that needs to be carefully crafted to avoid vanishing or exploding gradients, particularly as the depth $L$ increases. No consensus has been reached on how to mitigate this issue, although a widely discussed strategy consists in scaling the output of each layer by a factor $alpha_L$. We show in a probabilistic setting that with standard i.i.d.~initializations, the only non-trivial dynamics is for $alpha_L = frac{1}{sqrt{L}}$; other choices lead either to explosion or to identity mapping. This scaling factor corresponds in the continuous-time limit to a neural stochastic differential equation, contrarily to a widespread interpretation that deep ResNets are discretizations of neural ordinary differential equations. By contrast, in the latter regime, stability is obtained with specific correlated initializations and $alpha_L = frac{1}{L}$. Our analysis suggests a strong interplay between scaling and regularity of the weights as a function of the layer index. Finally, in a series of experiments, we exhibit a continuous range of regimes driven by these two parameters, which jointly impact performance before and after training.

Problem

Research questions and friction points this paper is trying to address.

Mitigating vanishing or exploding gradients in deep ResNets.

Determining optimal scaling factors for ResNet layer outputs.

Exploring the relationship between weight regularity and scaling.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Scaling factor α_L = 1/√L prevents gradient issues

Neural stochastic differential equations model ResNets

Correlated initializations stabilize deep ResNets

🔎 Similar Papers

SimPLR: A Simple and Plain Transformer for Efficient Object Detection and Segmentation