🤖 AI Summary
This work addresses gradient vanishing/exploding issues in deep ResNets as depth increases, systematically investigating training stability mechanisms in ultra-deep networks. Leveraging probabilistic analysis and continuous-limit modeling, we rigorously prove—under standard initialization—that the layer-wise output scaling factor αₗ = 1/√L is the unique nontrivial stable scaling regime; moreover, this scaling induces a Neural Stochastic Differential Equation (Neural SDE) as the continuous limit, challenging the conventional belief that ResNets inherently converge to Neural ODEs. We further uncover a strong coupling between weight regularity and scaling. Supported by theoretical derivation and large-scale initialization/scaling experiments, our framework unifies three distinct regimes: gradient explosion, stable training, and performance degradation. Crucially, we establish that both αₗ and post-training weight smoothness jointly govern generalization performance—both before and after training.
📝 Abstract
Deep ResNets are recognized for achieving state-of-the-art results in complex machine learning tasks. However, the remarkable performance of these architectures relies on a training procedure that needs to be carefully crafted to avoid vanishing or exploding gradients, particularly as the depth $L$ increases. No consensus has been reached on how to mitigate this issue, although a widely discussed strategy consists in scaling the output of each layer by a factor $alpha_L$. We show in a probabilistic setting that with standard i.i.d.~initializations, the only non-trivial dynamics is for $alpha_L = frac{1}{sqrt{L}}$; other choices lead either to explosion or to identity mapping. This scaling factor corresponds in the continuous-time limit to a neural stochastic differential equation, contrarily to a widespread interpretation that deep ResNets are discretizations of neural ordinary differential equations. By contrast, in the latter regime, stability is obtained with specific correlated initializations and $alpha_L = frac{1}{L}$. Our analysis suggests a strong interplay between scaling and regularity of the weights as a function of the layer index. Finally, in a series of experiments, we exhibit a continuous range of regimes driven by these two parameters, which jointly impact performance before and after training.