🤖 AI Summary
Why does layer-wise stacking and parameter copying in deep residual networks significantly improve training efficiency?
Method: The authors establish a rigorous theoretical equivalence between stacking and Nesterov accelerated gradient descent, introducing a novel potential-function analysis framework that accommodates fault-tolerant updates and extends beyond classical acceleration theory’s applicability limits.
Contribution/Results: They provide the first formal convergence proof for deep linear residual networks under stacking, demonstrating a square-root speedup in training complexity. Conceptual experiments validate both the acceleration effect and its robustness to initialization errors. This work unifies the optimization principles underlying two empirically successful but previously disparate strategies: residual network stacking and classifier initialization in Boosting—revealing their shared foundation in implicit acceleration.
📝 Abstract
Stacking, a heuristic technique for training deep residual networks by progressively increasing the number of layers and initializing new layers by copying parameters from older layers, has proven quite successful in improving the efficiency of training deep neural networks. In this paper, we propose a theoretical explanation for the efficacy of stacking: viz., stacking implements a form of Nesterov's accelerated gradient descent. The theory also covers simpler models such as the additive ensembles constructed in boosting methods, and provides an explanation for a similar widely-used practical heuristic for initializing the new classifier in each round of boosting. We also prove that for certain deep linear residual networks, stacking does provide accelerated training, via a new potential function analysis of the Nesterov's accelerated gradient method which allows errors in updates. We conduct proof-of-concept experiments to validate our theory as well.