🤖 AI Summary
The scaling factor in residual connections of ResNets critically influences generalization, yet its mechanistic role and robustness across hyperparameter configurations remain poorly understood.
Method: We establish the first finite-width field-theoretic framework for ResNets and analytically derive the input response function to characterize signal propagation.
Contribution/Results: Our theory reveals that the empirically optimal scaling interval corresponds to the regime of maximal input sensitivity; moreover, the optimal scaling value depends only weakly on network depth and weight variance—explaining its empirical stability across diverse hyperparameter settings. This work provides the first analytical solution for the residual scaling factor and yields interpretable, theoretically grounded guidelines for its selection, thereby bridging empirical practice with rigorous understanding of signal propagation in deep residual networks.
📝 Abstract
Residual networks have significantly better trainability and thus performance than feed-forward networks at large depth. Introducing skip connections facilitates signal propagation to deeper layers. In addition, previous works found that adding a scaling parameter for the residual branch further improves generalization performance. While they empirically identified a particularly beneficial range of values for this scaling parameter, the associated performance improvement and its universality across network hyperparameters yet need to be understood. For feed-forward networks, finite-size theories have led to important insights with regard to signal propagation and hyperparameter tuning. We here derive a systematic finite-size field theory for residual networks to study signal propagation and its dependence on the scaling for the residual branch. We derive analytical expressions for the response function, a measure for the network's sensitivity to inputs, and show that for deep networks the empirically found values for the scaling parameter lie within the range of maximal sensitivity. Furthermore, we obtain an analytical expression for the optimal scaling parameter that depends only weakly on other network hyperparameters, such as the weight variance, thereby explaining its universality across hyperparameters. Overall, this work provides a theoretical framework to study ResNets at finite size.