Stacking as Accelerated Gradient Descent

📅 2024-03-08

🏛️ arXiv.org

📈 Citations: 2

✨ Influential: 0

career value

212K/year

🤖 AI Summary

Why does layer-wise stacking and parameter copying in deep residual networks significantly improve training efficiency? Method: The authors establish a rigorous theoretical equivalence between stacking and Nesterov accelerated gradient descent, introducing a novel potential-function analysis framework that accommodates fault-tolerant updates and extends beyond classical acceleration theory’s applicability limits. Contribution/Results: They provide the first formal convergence proof for deep linear residual networks under stacking, demonstrating a square-root speedup in training complexity. Conceptual experiments validate both the acceleration effect and its robustness to initialization errors. This work unifies the optimization principles underlying two empirically successful but previously disparate strategies: residual network stacking and classifier initialization in Boosting—revealing their shared foundation in implicit acceleration.

Technology Category

Application Category

📝 Abstract

Stacking, a heuristic technique for training deep residual networks by progressively increasing the number of layers and initializing new layers by copying parameters from older layers, has proven quite successful in improving the efficiency of training deep neural networks. In this paper, we propose a theoretical explanation for the efficacy of stacking: viz., stacking implements a form of Nesterov's accelerated gradient descent. The theory also covers simpler models such as the additive ensembles constructed in boosting methods, and provides an explanation for a similar widely-used practical heuristic for initializing the new classifier in each round of boosting. We also prove that for certain deep linear residual networks, stacking does provide accelerated training, via a new potential function analysis of the Nesterov's accelerated gradient method which allows errors in updates. We conduct proof-of-concept experiments to validate our theory as well.

Problem

Research questions and friction points this paper is trying to address.

Explains stacking as accelerated gradient descent

Theorizes efficacy in deep residual networks

Validates theory with proof-of-concept experiments

Innovation

Methods, ideas, or system contributions that make the work stand out.

Stacking as accelerated gradient descent

Initializing layers by copying parameters

Potential function analysis for errors

🔎 Similar Papers

No similar papers found.